Speech Recognition with Whisper.cpp
Today I stumbled across Whispers of A.I.'s Modular Future by James Somers, a piece that, at least by the standards of publications aimed at the general public, makes an excellent point of why whisper.cpp might finally be some useful and non-patronising output of the current AI hype.
What can I say? I think I'm sold. And perhaps I'm now a little bit scared, too. If you want to understand way and speak a bit of German, you can skip to The Crazy right away.
The Good
You know, so far I've ignored most of the current statistical modelling (“AI”, “Machine Learning“) – if you need a graphics chip with drivers even worse than Intel's, and that then needs 8 GB of video RAM before anything works, I'm out. And I'm also out when the only way I can use some software is on some web page because there's proprietary data behind it.
Not so for whisper.cpp. This is software as it was meant to be: trivial dependencies, compact, works on basically any hardware there is. To build it, you just run:
git clone https://github.com/ggerganov/whisper.cpp/ cd whisper.cpp make
– and that's it. No dependency juggling down to incompatible micro versions, no fancy build system, just a few C(++) sources and a Makefile. The thing works in place without a hitch, and it has a sensible command line interface.
Well, you need the language models, of course. There are some reasonably free ones for English. The whisper.cpp distribution's models/README.md explains how to obtain some. I got myself ggml-small.en.bin, recorded a few words of English into a file zw.wav and ran:
./main -m models/ggml-small.en.bin ~/zw.wav
The machine demanded I use a samplerate of 16 kHz, I made audacity oblige, ran the thing again and was blown away when – admittedly after a surprisingly long time – my words appeared on the screen.
I immediately tried to figure out how to stream in data but then quickly decided that's probably not worth the effort; the software needs to see words in context, and for what I plan to do – transcribing radio shows – having an intermediate WAV file really does not hurt.
I quickly cobbled together a piece of Python wrapping the conversion (using the perennial classic of audio processing, sox) somewhat cleverly, like this:
#!/usr/bin/python # A quick hack to transcribe audio files # # Dependencies: # * sox (would be mpv, but that's somehow broken) # * a build of whispercpp (https://github.com/ggerganov/whisper.cpp/) # * a language model (see models/README.md in the whisper source) import contextlib import os import subprocess import sys import tempfile WHISPER_DIR = "/usr/src/whisper.cpp" @contextlib.contextmanager def workdir(wd): prev_dir = os.getcwd() try: os.chdir(wd) yield finally: os.chdir(prev_dir) def transcribe(audio_source, model, lang): """transcibes an audio file, creating an in-place .txt. model must be the name of a model file in WHISPER_DIR/models; lang is the ISO language code in which the output should turn up. """ audio_source = os.path.join(os.getcwd(), audio_source) with tempfile.TemporaryDirectory(suffix="transcribe", dir="/var/tmp") as wd: with workdir(wd): subprocess.check_call(["sox", audio_source, "-b", "16", "-r", "16000", "-c", "1", "audiodump.wav"]) out_name = os.path.splitext(audio_source)[0] subprocess.check_call([WHISPER_DIR+"/main", "-l", lang, "-m", WHISPER_DIR+"/models/"+model, "-otxt", "-of", out_name, "audiodump.wav"]) def parse_command_line(): import argparse parser = argparse.ArgumentParser(description="Wrap whisper.cpp to" " bulk-transcribe audio files.") parser.add_argument("model", type=str, help="name of ggml language" f" model to use, relative to {WHISPER_DIR}/models") parser.add_argument("audios", type=str, nargs="+", help="Sox-translatable audio file to transliterate.") parser.add_argument("--lang", type=str, default="en", help="Spoken language to try and recogonise") return parser.parse_args() if __name__=="__main__": args = parse_command_line() for audio in args.audios: transcribe(audio, args.model, args.lang)
Nachtrag (2023-06-26)
(Added a --lang option as per ron's feedback below)
I have that as transcribe.py in my path, and I can now enter the rip of an audiobook and say:
transcribe.py ggml-small.en.bin *.ogg
(provided I have downloaded the model as per whisper.cpp's instructions). After a little while (with high CPU usage), there is a transcript on my disk that's better what I had typed myself even after two rounds of proof-reading, except that whisper.cpp doesn't get the paragraphs right.
For the first time in the current AI hype, I start getting carried away, in particular when I consider how much speech recognition sucked when I last played with it around 2003, using a heap of sorry failure called viavoice.
The Bad
Skip the rant to get to the exciting part.
Trouble is: What I'd mainly like to transcribe is German radio, and whisper.cpp does not come with a German language model. Not to worry, one would think, as whisper.cpp comes with conversion scripts for the pyTorch-based whisper models like those one can get from Hugging Face. I downloaded what I think is the model file and cheerfully ran:
$ python convert-h5-to-ggml.py /media/downloads/model.bin Traceback (most recent call last): File "/home/src/whisper.cpp/models/convert-h5-to-ggml.py", line 24, in <module> import torch ModuleNotFoundError: No module named 'torch'
Oh bummer. Well, how hard can it be? Turns out: Surprisingly hard. There is no pytorch package Debian stable. Ah… I very much later realised there is, it's just that my main system still has an i386 userland, and pytorch is only available for amd64. But I hadn't figured that out then. So, I enabled a virtual python (never mix your system python and pip) and ran:
$ pip install torch ERROR: Could not find a version that satisfies the requirement torch ERROR: No matching distribution found for torch
Huh? What's that? I ran pip with a couple of -v sprinkled in, which at least yielded:
[...] Skipping link: none of the wheel's tags match: cp38-cp38-win_amd64: https://download.pytorch.org/whl/cpu/torch-1.9.0%2Bcpu-cp38-cp38-win_amd64.whl (from https://download.pytorch.org/whl/cpu/torch/) [...] Given no hashes to check 0 links for project 'torch': discarding no candidates ERROR: Could not find a version that satisfies the requirement torch ERROR: No matching distribution found for torch [...]
The message with “Given no“ has a certain lyric quality, but other than that from the “Skipping“ messages I concluded they don't have 32 bit builds any more.
Well, how hard can it be? Pypi says the sources are on github, and so I cloned that repo. Oh boy, AI at its finest. The thing pulls in a whopping 3.5 Gigabytes of who-knows-what. Oh, come on.
python setup.py build fails after a short while, complaining about missing typing_extensions. Manually running pip install typing_extensions fixes that. But I killed setup.py build after a few minutes when there were only 50/5719 files built. Has AI written that software?
In the meantime, I had gone to a machine with a 64 bit userland, and to be fair the experience wasn't too bad there, except for the hellish amount of dependencies that pytorch pulls in.
So, my expectations regarding “AI code” were by and large met in that second part of the adventure, including the little detail that the internal links on https://pypi.org/project/torch/ are broken because right now their document processor does not produce id attributes on the headlines. Yeah, I know, they're giving it all away for free and all that. But still, after the brief glimpse into the paradise of yesteryear's software that whisper.cpp afforded, this was a striking contrast.
The Crazy
So, I converted the German language model doing, in effect:
git clone https://github.com/openai/whisper.git git lfs install git clone https://huggingface.co/bofenghuang/whisper-small-cv11-german python convert-h5-to-ggml.py whisper-small-cv11-german/ whisper tmp
(where I took convert-h5-to-ggml.py from whisper.cpp's repo). Then I moved the resulting tmp/ggml-model.bin to german-small.ggml and ran:
transcribe.py german-small.ggml peer_review_wie_objektiv_ist_das_wissenschaftliche_dlf_20221214_1646_8a93e930.mp3
with my script above and this German-language mp3 from Deutschlandfunk. From the English experience, I had expected to get an almost flawless transliteration of the German text. What I got instead was (paragraphs inserted by me); listen to the audio in parallel if you can:
Germany. Research is on [that was: Deutschlandfunk Forschung aktuell]
A Nobel Prize for Science is not easy without further ado. They really need to find something out. For example, Vernon Smith, who is now 95 years old, is now the father of the Experimental Economy. In 2002 he won the Nobel Prize for Science.
This made such a prize and renommee also make impression on other Fachleuteen and that actually influenced the unabhängig well-office method for scientific publications. This has recently shown a study of Business Science in the Fachmagazin PNS. Anike Meyer spoke with one of the authors.
When Jürgen Huber and his colleagues thought about the experiment, it was clear to them that this is not fair. The same manuscript was given by two different authors, Vernon …