Speech Recognition with Whisper.cpp

Today I stumbled across Whispers of A.I.'s Modular Future by James Somers, a piece that, at least by the standards of publications aimed at the general public, makes an excellent point of why whisper.cpp might finally be some useful and non-patronising output of the current AI hype.

What can I say? I think I'm sold. And perhaps I'm now a little bit scared, too. If you want to understand way and speak a bit of German, you can skip to The Crazy right away.

The Good

You know, so far I've ignored most of the current statistical modelling (“AI”, “Machine Learning“) – if you need a graphics chip with drivers even worse than Intel's, and that then needs 8 GB of video RAM before anything works, I'm out. And I'm also out when the only way I can use some software is on some web page because there's proprietary data behind it.

Not so for whisper.cpp. This is software as it was meant to be: trivial dependencies, compact, works on basically any hardware there is. To build it, you just run:

git clone https://github.com/ggerganov/whisper.cpp/
cd whisper.cpp
make

– and that's it. No dependency juggling down to incompatible micro versions, no fancy build system, just a few C(++) sources and a Makefile. The thing works in place without a hitch, and it has a sensible command line interface.

Well, you need the language models, of course. There are some reasonably free ones for English. The whisper.cpp distribution's models/README.md explains how to obtain some. I got myself ggml-small.en.bin, recorded a few words of English into a file zw.wav and ran:

./main -m models/ggml-small.en.bin ~/zw.wav

The machine demanded I use a samplerate of 16 kHz, I made audacity oblige, ran the thing again and was blown away when – admittedly after a surprisingly long time – my words appeared on the screen.

I immediately tried to figure out how to stream in data but then quickly decided that's probably not worth the effort; the software needs to see words in context, and for what I plan to do – transcribing radio shows – having an intermediate WAV file really does not hurt.

I quickly cobbled together a piece of Python wrapping the conversion (using the perennial classic of audio processing, sox) somewhat cleverly, like this:

#!/usr/bin/python
# A quick hack to transcribe audio files
#
# Dependencies:
# * sox (would be mpv, but that's somehow broken)
# * a build of whispercpp (https://github.com/ggerganov/whisper.cpp/)
# * a language model (see models/README.md in the whisper source)

import contextlib
import os
import subprocess
import sys
import tempfile

WHISPER_DIR = "/usr/src/whisper.cpp"


@contextlib.contextmanager
def workdir(wd):
        prev_dir = os.getcwd()
        try:
                os.chdir(wd)
                yield
        finally:
                os.chdir(prev_dir)


def transcribe(audio_source, model, lang):
        """transcibes an audio file, creating an in-place .txt.

        model must be the name of a model file in WHISPER_DIR/models;
        lang is the ISO language code in which the output should turn up.
        """
        audio_source = os.path.join(os.getcwd(), audio_source)
        with tempfile.TemporaryDirectory(suffix="transcribe", dir="/var/tmp") as wd:
                with workdir(wd):
                        subprocess.check_call(["sox",
                                audio_source,
                                "-b", "16", "-r", "16000", "-c", "1",
                                "audiodump.wav"])

                        out_name = os.path.splitext(audio_source)[0]
                        subprocess.check_call([WHISPER_DIR+"/main",
                                "-l", lang,
                                "-m", WHISPER_DIR+"/models/"+model,
                                "-otxt", "-of", out_name,
                                "audiodump.wav"])


def parse_command_line():
        import argparse
        parser = argparse.ArgumentParser(description="Wrap whisper.cpp to"
                " bulk-transcribe audio files.")
        parser.add_argument("model", type=str, help="name of ggml language"
                f" model to use, relative to {WHISPER_DIR}/models")
        parser.add_argument("audios", type=str, nargs="+",
                help="Sox-translatable audio file to transliterate.")
        parser.add_argument("--lang", type=str, default="en",
                help="Spoken language to try and recogonise")

        return parser.parse_args()


if __name__=="__main__":
        args = parse_command_line()
        for audio in args.audios:
                transcribe(audio, args.model, args.lang)

Nachtrag (2023-06-26)

(Added a --lang option as per ron's feedback below)

I have that as transcribe.py in my path, and I can now enter the rip of an audiobook and say:

transcribe.py ggml-small.en.bin *.ogg

(provided I have downloaded the model as per whisper.cpp's instructions). After a little while (with high CPU usage), there is a transcript on my disk that's better what I had typed myself even after two rounds of proof-reading, except that whisper.cpp doesn't get the paragraphs right.

For the first time in the current AI hype, I start getting carried away, in particular when I consider how much speech recognition sucked when I last played with it around 2003, using a heap of sorry failure called viavoice.

The Bad

Skip the rant to get to the exciting part.

Trouble is: What I'd mainly like to transcribe is German radio, and whisper.cpp does not come with a German language model. Not to worry, one would think, as whisper.cpp comes with conversion scripts for the pyTorch-based whisper models like those one can get from Hugging Face. I downloaded what I think is the model file and cheerfully ran:

$ python convert-h5-to-ggml.py /media/downloads/model.bin
Traceback (most recent call last):
  File "/home/src/whisper.cpp/models/convert-h5-to-ggml.py", line 24, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

Oh bummer. Well, how hard can it be? Turns out: Surprisingly hard. There is no pytorch package Debian stable. Ah… I very much later realised there is, it's just that my main system still has an i386 userland, and pytorch is only available for amd64. But I hadn't figured that out then. So, I enabled a virtual python (never mix your system python and pip) and ran:

$ pip install torch
ERROR: Could not find a version that satisfies the requirement torch
ERROR: No matching distribution found for torch

Huh? What's that? I ran pip with a couple of -v sprinkled in, which at least yielded:

[...]
Skipping link: none of the wheel's tags match: cp38-cp38-win_amd64: https://download.pytorch.org/whl/cpu/torch-1.9.0%2Bcpu-cp38-cp38-win_amd64.whl (from https://download.pytorch.org/whl/cpu/torch/)
[...]
Given no hashes to check 0 links for project 'torch': discarding no candidates
ERROR: Could not find a version that satisfies the requirement torch
ERROR: No matching distribution found for torch
[...]

The message with “Given no“ has a certain lyric quality, but other than that from the “Skipping“ messages I concluded they don't have 32 bit builds any more.

Well, how hard can it be? Pypi says the sources are on github, and so I cloned that repo. Oh boy, AI at its finest. The thing pulls in a whopping 3.5 Gigabytes of who-knows-what. Oh, come on.

python setup.py build fails after a short while, complaining about missing typing_extensions. Manually running pip install typing_extensions fixes that. But I killed setup.py build after a few minutes when there were only 50/5719 files built. Has AI written that software?

In the meantime, I had gone to a machine with a 64 bit userland, and to be fair the experience wasn't too bad there, except for the hellish amount of dependencies that pytorch pulls in.

So, my expectations regarding “AI code” were by and large met in that second part of the adventure, including the little detail that the internal links on https://pypi.org/project/torch/ are broken because right now their document processor does not produce id attributes on the headlines. Yeah, I know, they're giving it all away for free and all that. But still, after the brief glimpse into the paradise of yesteryear's software that whisper.cpp afforded, this was a striking contrast.

The Crazy

So, I converted the German language model doing, in effect:

git clone https://github.com/openai/whisper.git
git lfs install
git clone https://huggingface.co/bofenghuang/whisper-small-cv11-german
python convert-h5-to-ggml.py whisper-small-cv11-german/ whisper tmp

(where I took convert-h5-to-ggml.py from whisper.cpp's repo). Then I moved the resulting tmp/ggml-model.bin to german-small.ggml and ran:

transcribe.py german-small.ggml peer_review_wie_objektiv_ist_das_wissenschaftliche_dlf_20221214_1646_8a93e930.mp3

with my script above and this German-language mp3 from Deutschlandfunk. From the English experience, I had expected to get an almost flawless transliteration of the German text. What I got instead was (paragraphs inserted by me); listen to the audio in parallel if you can:

Germany. Research is on [that was: Deutschlandfunk Forschung aktuell]

A Nobel Prize for Science is not easy without further ado. They really need to find something out. For example, Vernon Smith, who is now 95 years old, is now the father of the Experimental Economy. In 2002 he won the Nobel Prize for Science.

This made such a prize and renommee also make impression on other Fachleuteen and that actually influenced the unabhängig well-office method for scientific publications. This has recently shown a study of Business Science in the Fachmagazin PNS. Anike Meyer spoke with one of the authors.

When Jürgen Huber and his colleagues thought about the experiment, it was clear to them that this is not fair. The same manuscript was given by two different authors, Vernon Smith, who won a Nobel Prize and Sabio Inua, who still has no Doctor-titel, but has an African background. Who has better chances, to publish the study?

We have expected a certain effect for Nobel Prize-täger, but we were honestly surprised and shocked how big the effect is. Six times as often, the Doctor-titel was given the study without much work for publication, if it was the Nobel Prize-täger. Was the Doctor-titel as a given, two-thirds of the Good-achter could not be printed for real, the Nobel Prize-täger was less than a quarter of a recitation.

But that is humanly, but does not correspond to the objectivity, which has in science, says Professor for Finance at the University of Graz, Jürgen Huber.

The Good-achting of science works through other sciences, it is important for publications for research, for jobs, for who is a Professor etc. It is very important that we design this process as objectively as possible, so that the money, which is often used by public hand, can be used best, so that we have progress.

As a good-touch, so-called Peer Review, can be very different. The most common option in nature and economic science is simply blinded. The Good-achter's knowledge of who are the authors is not necessary. This process is not particularly objective, there are not many studies, which have shown that in recent years there are different examples.

The easiest conclusion would be, we must stay at least in the anonymized version of this good-touching of papers, because then there is no disadvantage for the unrecognized, no advantage for the known, so that it really works. But it is not enough, to remove the name from the deckboard. The whole manuscript must be written so that the Good-achter, the identity of the authors, cannot be guessed. And that is, in times of Internet and Preprince, especially in small research fields, quite difficult.

Studies have shown that double blinding, in less than half of the cases, is possible. Despite that, many researchers have in double blinded Review, the better alternative. In terms of professional magazines and magazines, usually there are three-quarters of participants for it. The scientists are very open for changes and say, it was extremely exciting, what you showed, we need more such studies. So, the science is, thank God, still, so to speak, open and healthy enough to learn.

In fact, the willingness of the publishers of magazine with alternative Good-achter-tests to experiment, is now significantly growing. Nature has been offering, since 2015, the possibility of, in itself, free will, to decide for a double-blinded Good-achter-test. IOP, a big publisher from the field of Physique and Engineering, was adopted in 2020. But with moderate success. Just twelve percent of the enrichment by Nature and nearly twenty-by-Iop, use the double-blind version.

The change in theory is simpler than in practice, and for that, it's also the Noble vs. Nobody-Studio by Jürgen Huber and his colleagues. It was unrecognisable, and afterwards, I'm very disappointed. Its Studio for Peer Review is neither double-blinded nor classic, but in a less-gray-sounding way, that some magazines offer, only for particularly renommated authors, such as Nobel-priestriker, about VIP-cursing. The authors are allowed to choose their Good-achter-tests themselves.

I'm not sure if many colleagues have said, "Hey, friend, you're using your status by yourself now, I can say to our defense, we've been through a review, but not any family and friends, they're actually experts, and it was also a really tedious review process, but yes, the point is, it's well-taken."

As an objective, the scientific well-behindersystem is an extension of Annicka Meyer.

As I had not expected a translation component within the language model, this immediately struck me as a miracle. The thing hears German and writes some vintage-2010 machine translation into English. I still stand in awe, although I start believing this could actually have happened by design.

Regrettably, it's not what I want – I'd like a German transliteration. Well: I suppose I'll have to do a bit more research for that. Can anyone help me out here? Is there a “small“ German-to-German model out there somehwhere?

Zitiert in: Fiese Metriken: Das Beispiel Tarifbindung

Kategorie: edv

Letzte Ergänzungen