Engelszüngeln: Speech Recognition with Whisper.cpp

The Good
The Bad
The Crazy

Today I stumbled across Whispers of A.I.'s Modular Future by James Somers, a piece that, at least by the standards of publications aimed at the general public, makes an excellent point of why whisper.cpp might finally be some useful and non-patronising output of the current AI hype.

What can I say? I think I'm sold. And perhaps I'm now a little bit scared, too. If you want to understand way and speak a bit of German, you can skip to The Crazy right away.

The Good

You know, so far I've ignored most of the current statistical modelling (“AI”, “Machine Learning“) – if you need a graphics chip with drivers even worse than Intel's, and that then needs 8 GB of video RAM before anything works, I'm out. And I'm also out when the only way I can use some software is on some web page because there's proprietary data behind it.

Not so for whisper.cpp. This is software as it was meant to be: trivial dependencies, compact, works on basically any hardware there is. To build it, you just run:

git clone https://github.com/ggerganov/whisper.cpp/
cd whisper.cpp
make

– and that's it. No dependency juggling down to incompatible micro versions, no fancy build system, just a few C(++) sources and a Makefile. The thing works in place without a hitch, and it has a sensible command line interface.

Well, you need the language models, of course. There are some reasonably free ones for English. The whisper.cpp distribution's models/README.md explains how to obtain some. I got myself ggml-small.en.bin, recorded a few words of English into a file zw.wav and ran:

./main -m models/ggml-small.en.bin ~/zw.wav

The machine demanded I use a samplerate of 16 kHz, I made audacity oblige, ran the thing again and was blown away when – admittedly after a surprisingly long time – my words appeared on the screen.

I immediately tried to figure out how to stream in data but then quickly decided that's probably not worth the effort; the software needs to see words in context, and for what I plan to do – transcribing radio shows – having an intermediate WAV file really does not hurt.

I quickly cobbled together a piece of Python wrapping the conversion (using the perennial classic of audio processing, sox) somewhat cleverly, like this:

#!/usr/bin/python
# A quick hack to transcribe audio files
#
# Dependencies:
# * sox (would be mpv, but that's somehow broken)
# * a build of whispercpp (https://github.com/ggerganov/whisper.cpp/)
# * a language model (see models/README.md in the whisper source)

import contextlib
import os
import subprocess
import sys
import tempfile

WHISPER_DIR = "/usr/src/whisper.cpp"


@contextlib.contextmanager
def workdir(wd):
        prev_dir = os.getcwd()
        try:
                os.chdir(wd)
                yield
        finally:
                os.chdir(prev_dir)


def transcribe(audio_source, model, lang):
        """transcibes an audio file, creating an in-place .txt.

        model must be the name of a model file in WHISPER_DIR/models;
        lang is the ISO language code in which the output should turn up.
        """
        audio_source = os.path.join(os.getcwd(), audio_source)
        with tempfile.TemporaryDirectory(suffix="transcribe", dir="/var/tmp") as wd:
                with workdir(wd):
                        subprocess.check_call(["sox",
                                audio_source,
                                "-b", "16", "-r", "16000", "-c", "1",
                                "audiodump.wav"])

                        out_name = os.path.splitext(audio_source)[0]
                        subprocess.check_call([WHISPER_DIR+"/main",
                                "-l", lang,
                                "-m", WHISPER_DIR+"/models/"+model,
                                "-otxt", "-of", out_name,
                                "audiodump.wav"])


def parse_command_line():
        import argparse
        parser = argparse.ArgumentParser(description="Wrap whisper.cpp to"
                " bulk-transcribe audio files.")
        parser.add_argument("model", type=str, help="name of ggml language"
                f" model to use, relative to {WHISPER_DIR}/models")
        parser.add_argument("audios", type=str, nargs="+",
                help="Sox-translatable audio file to transliterate.")
        parser.add_argument("--lang", type=str, default="en",
                help="Spoken language to try and recogonise")

        return parser.parse_args()


if __name__=="__main__":
        args = parse_command_line()
        for audio in args.audios:
                transcribe(audio, args.model, args.lang)

Nachtrag (2023-06-26)

(Added a --lang option as per ron's feedback below)

I have that as transcribe.py in my path, and I can now enter the rip of an audiobook and say:

transcribe.py ggml-small.en.bin *.ogg

(provided I have downloaded the model as per whisper.cpp's instructions). After a little while (with high CPU usage), there is a transcript on my disk that's better what I had typed myself even after two rounds of proof-reading, except that whisper.cpp doesn't get the paragraphs right.

For the first time in the current AI hype, I start getting carried away, in particular when I consider how much speech recognition sucked when I last played with it around 2003, using a heap of sorry failure called viavoice.

The Bad

Skip the rant to get to the exciting part.

Trouble is: What I'd mainly like to transcribe is German radio, and whisper.cpp does not come with a German language model. Not to worry, one would think, as whisper.cpp comes with conversion scripts for the pyTorch-based whisper models like those one can get from Hugging Face. I downloaded what I think is the model file and cheerfully ran:

$ python convert-h5-to-ggml.py /media/downloads/model.bin
Traceback (most recent call last):
  File "/home/src/whisper.cpp/models/convert-h5-to-ggml.py", line 24, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

Oh bummer. Well, how hard can it be? Turns out: Surprisingly hard. There is no pytorch package Debian stable. Ah… I very much later realised there is, it's just that my main system still has an i386 userland, and pytorch is only available for amd64. But I hadn't figured that out then. So, I enabled a virtual python (never mix your system python and pip) and ran:

$ pip install torch
ERROR: Could not find a version that satisfies the requirement torch
ERROR: No matching distribution found for torch

Huh? What's that? I ran pip with a couple of -v sprinkled in, which at least yielded:

[...]
Skipping link: none of the wheel's tags match: cp38-cp38-win_amd64: https://download.pytorch.org/whl/cpu/torch-1.9.0%2Bcpu-cp38-cp38-win_amd64.whl (from https://download.pytorch.org/whl/cpu/torch/)
[...]
Given no hashes to check 0 links for project 'torch': discarding no candidates
ERROR: Could not find a version that satisfies the requirement torch
ERROR: No matching distribution found for torch
[...]

The message with “Given no“ has a certain lyric quality, but other than that from the “Skipping“ messages I concluded they don't have 32 bit builds any more.

Well, how hard can it be? Pypi says the sources are on github, and so I cloned that repo. Oh boy, AI at its finest. The thing pulls in a whopping 3.5 Gigabytes of who-knows-what. Oh, come on.

python setup.py build fails after a short while, complaining about missing typing_extensions. Manually running pip install typing_extensions fixes that. But I killed setup.py build after a few minutes when there were only 50/5719 files built. Has AI written that software?

In the meantime, I had gone to a machine with a 64 bit userland, and to be fair the experience wasn't too bad there, except for the hellish amount of dependencies that pytorch pulls in.

So, my expectations regarding “AI code” were by and large met in that second part of the adventure, including the little detail that the internal links on https://pypi.org/project/torch/ are broken because right now their document processor does not produce id attributes on the headlines. Yeah, I know, they're giving it all away for free and all that. But still, after the brief glimpse into the paradise of yesteryear's software that whisper.cpp afforded, this was a striking contrast.

The Crazy

So, I converted the German language model doing, in effect:

git clone https://github.com/openai/whisper.git
git lfs install
git clone https://huggingface.co/bofenghuang/whisper-small-cv11-german
python convert-h5-to-ggml.py whisper-small-cv11-german/ whisper tmp

(where I took convert-h5-to-ggml.py from whisper.cpp's repo). Then I moved the resulting tmp/ggml-model.bin to german-small.ggml and ran:

transcribe.py german-small.ggml peer_review_wie_objektiv_ist_das_wissenschaftliche_dlf_20221214_1646_8a93e930.mp3

with my script above and this German-language mp3 from Deutschlandfunk. From the English experience, I had expected to get an almost flawless transliteration of the German text. What I got instead was (paragraphs inserted by me); listen to the audio in parallel if you can:

Germany. Research is on [that was: Deutschlandfunk Forschung aktuell]

A Nobel Prize for Science is not easy without further ado. They really need to find something out. For example, Vernon Smith, who is now 95 years old, is now the father of the Experimental Economy. In 2002 he won the Nobel Prize for Science.

This made such a prize and renommee also make impression on other Fachleuteen and that actually influenced the unabhängig well-office method for scientific publications. This has recently shown a study of Business Science in the Fachmagazin PNS. Anike Meyer spoke with one of the authors.

When Jürgen Huber and his colleagues thought about the experiment, it was clear to them that this is not fair. The same manuscript was given by two different authors, Vernon Smith, who won a Nobel Prize and Sabio Inua, who still has no Doctor-titel, but has an African background. Who has better chances, to publish the study?

We have expected a certain effect for Nobel Prize-täger, but we were honestly surprised and shocked how big the effect is. Six times as often, the Doctor-titel was given the study without much work for publication, if it was the Nobel Prize-täger. Was the Doctor-titel as a given, two-thirds of the Good-achter could not be printed for real, the Nobel Prize-täger was less than a quarter of a recitation.

But that is humanly, but does not correspond to the objectivity, which has in science, says Professor for Finance at the University of Graz, Jürgen Huber.

The Good-achting of science works through other sciences, it is important for publications for research, for jobs, for who is a Professor etc. It is very important that we design this process as objectively as possible, so that the money, which is often used by public hand, can be used best, so that we have progress.

As a good-touch, so-called Peer Review, can be very different. The most common option in nature and economic science is simply blinded. The Good-achter's knowledge of who are the authors is not necessary. This process is not particularly objective, there are not many studies, which have shown that in recent years there are different examples.

The easiest conclusion would be, we must stay at least in the anonymized version of this good-touching of papers, because then there is no disadvantage for the unrecognized, no advantage for the known, so that it really works. But it is not enough, to remove the name from the deckboard. The whole manuscript must be written so that the Good-achter, the identity of the authors, cannot be guessed. And that is, in times of Internet and Preprince, especially in small research fields, quite difficult.

Studies have shown that double blinding, in less than half of the cases, is possible. Despite that, many researchers have in double blinded Review, the better alternative. In terms of professional magazines and magazines, usually there are three-quarters of participants for it. The scientists are very open for changes and say, it was extremely exciting, what you showed, we need more such studies. So, the science is, thank God, still, so to speak, open and healthy enough to learn.

In fact, the willingness of the publishers of magazine with alternative Good-achter-tests to experiment, is now significantly growing. Nature has been offering, since 2015, the possibility of, in itself, free will, to decide for a double-blinded Good-achter-test. IOP, a big publisher from the field of Physique and Engineering, was adopted in 2020. But with moderate success. Just twelve percent of the enrichment by Nature and nearly twenty-by-Iop, use the double-blind version.

The change in theory is simpler than in practice, and for that, it's also the Noble vs. Nobody-Studio by Jürgen Huber and his colleagues. It was unrecognisable, and afterwards, I'm very disappointed. Its Studio for Peer Review is neither double-blinded nor classic, but in a less-gray-sounding way, that some magazines offer, only for particularly renommated authors, such as Nobel-priestriker, about VIP-cursing. The authors are allowed to choose their Good-achter-tests themselves.

I'm not sure if many colleagues have said, "Hey, friend, you're using your status by yourself now, I can say to our defense, we've been through a review, but not any family and friends, they're actually experts, and it was also a really tedious review process, but yes, the point is, it's well-taken."

As an objective, the scientific well-behindersystem is an extension of Annicka Meyer.

As I had not expected a translation component within the language model, this immediately struck me as a miracle. The thing hears German and writes some vintage-2010 machine translation into English. I still stand in awe, although I start believing this could actually have happened by design.

Regrettably, it's not what I want – I'd like a German transliteration. Well: I suppose I'll have to do a bit more research for that. Can anyone help me out here? Is there a “small“ German-to-German model out there somehwhere?

Kommentar 1 am 2023-06-26 von ron

Hi, bin hier gerade bei der Suche über google reingestolpert, weil ich mich seit Wochen mit Georgi Gerganovs whisper-Implementierung beschäftige.

Du schrubst:

"Regrettably, it's not what I want – I'd like a German transliteration."

Das ist leicht. Du musst nur '-l de' als Parameter beim Programmaufruf angeben, um die Transkription in deutscher Sprache zu erhalten.

So erzeugt z.B. der Aufruf von:

./main -f samples/input.wav -m models/ggml-large.bin -l de -t 12 -pc -oj -ot 25000

eine JSON-Datei mit den Zeitangaben und dem erkannten Text,
verwendet dafür 12 Threads
kennzeichnet (in der Konsole) die Wahrscheinlichkeiten der Tokens farblich und
beginnt die input.wav erst ab Sekunde 25 zu lesen.

Die kleineren Modelle sind natürlich auch alle multilingual.

Tip: Das Kompilieren mit CUBLAS=1 (siehe github-Seite) funktioniert inzwischen und halbiert bei mir etwa die Transkriptionszeit.

(Die Sprachkürzel halten sich btw. an die ISO 639-1.)

Have phun!

Kommentar 2 am 2023-06-26 von anselm

Oh wow. Wer Doku liest, ist klar im Vorteil. Und wer nicht, blamiert sich (wie ich jetzt) öffentlich. Ich habe rons Hinweis aufgenommen und mein Skript wie oben angepasst. Jetzt liefert:

transcribe.py --lang de german-small.ggml peer_review_wie_objektiv_ist_das_wissenschaftliche_dlf_20221214_1646_8a93e930.mp3

folgenden sehr ordentlichen Text (auf meiner Arbeitskiste etwas schneller als abgetippt, aber deutlich stressfreier; Absätze wieder von mir, und ich habe nicht versucht, noch SprecherInnen annotiert zu kriegen):

Deutschlandfunk Forschung aktuell

Ein Nobelpreis bekommen Forschende nicht einfach ohne Weiteres, sie müssen schon wirklich etwas Wegweisendes herausgefunden haben, sowie zum Beispiel Vernon Smith, der mittlerweile fünfundneunzigjährige gilt als Vater der Experimentalökonomie, 2002 erhielt er dafür den Nobelpreis. Dass solche Preise und Renommee auch bei anderen Fachleuten Eindruck machen und das eigentlich als unabhängig gehandelte Gutatter-Verfahren bei wissenschaftlichen Publikationen beeinflussen, das hat kürzlich eine Studie von Wirtschaftswissenschaftlern im Fachmagazin PNS gezeigt. Anike Meyer hat mit einem der Autoren gesprochen.

Als Jürgen Huber und seine Kollegen sich das Experiment überlegt hatten, war ihnen schon klar, das ist kein fairer Vergleich. Das selbe Manuskript wird von zwei unterschiedlichen Autoren eingereicht, Vernon Smith, der einen Nobelpreis gewonnen hat und Sabio Inua, den noch keinen Doktortitel hat, dafür aber einen afrikanischen Nachnamen.

Wer hat bessere Chancen, die Studie zu veröffentlichen?

Wir haben erwartet, einen gewissen Effekt für Nobelpreisträger wird schon geben, aber wir waren ehrlich gesagt extrem überrascht und ein Kleinbiss schockiert, wie groß der Effekt ist.

Sechsmal so oft empfallen Gutachter die Studie ohne größere Überarbeitung zur Veröffentlichung, wenn der Name des Nobelpreisträgers traufstand. War der Doktorant als Autor angegeben, hielten zwei Drittel der Gutachter das gleiche Manuskript gar nicht für würdig gedruckt zu werden, der Nobelpreisträger erhielt von weniger als einem Viertel einer Ablehnung.

Das ist menschlich, entspricht aber nicht dem Anspruch auf Objektivität, den die Wissenschaft hat, meint der Professor für Finanzwirtschaft an der Uni Graz, Jürgen Huber.

Die Begutachtung von wissenschaftlichen Arbeiten durch andere Wissenschaftler ist für Publikationen wichtig, für Forschungsgelder, dafür, wer Jobs bekommt, wer Professor wird usw.

Also das ist ganz wichtig, dass wir diesen Prozess so objektiv, so fair gestalten wie möglich, damit auch die Gelder, die er oft von der öffentlichen Handkommen bestmöglich verwendet werden, damit wir Fortschritte haben.

Wie genau eine Begutachtung ein sogenannter Peer Review abläuft, kann ganz unterschiedlich sein, die in Natur- und Wirtschaftswissenschaften gebräuchlichste Variante ist einfach verblendet, die Gutachter wissen, wer die Autoren sind, umgekehrt gilt es nicht.

Das dieses Verfahren nicht besonders objektiv ist, ist nichts Neues, zahlreiche Studien haben das in den letzten Jahren an unterschiedlichen Beispielen gezeigt. Sozusagen die einfachste Schlussfolgerung wäre, wir müssen zumindest bei der anonymisierten Version dieser Begutachtung von Papers bleiben, denn dann ist zumindest kein Nachteil für die Unbekannten, kein Vorteil für die Bekannten, damit es wirklich funktioniert, reicht es allerdings nicht, die Namen nur vom Deckblatt zu entfernen. Das ganze Manuskript muss so geschrieben sein, dass die Gutachter die Identität der Autoren nicht erraten können, und das ist in Zeiten von Internet und Preprince, gerade bei kleinen Forschungsfördern ziemlich schwierig, Studien haben ergeben, dass Doppelverblindung in weniger als der Hälfte der Fälle gelingt. Protzdem sehen viele Forscher im doppelt verblendeten Review die bessere Alternative, in Umfragen von Fachmagazin und Verlagen sind regelmäßig drei Viertel der Teilnehmende dafür.

Die Wissenschaftler sind sehr offen für Veränderungen und sagen, mal extrem spannend, was ihr dort zeigt, wir brauchen mehr solche Studien. Also die Wissenschaft ist, Gott sei Dank noch, sozusagen offen und gesund genug, zu lernen.

Tatsächlich ist gerade die Bereitschaft der Verleger von Fachzeitschriften mit alternativen Gutachter-Verfahren zu experimentieren, spätestens mit der Open Science-Bewegung deutlich gewachsen. Nature bietet seit 2015 die Möglichkeit an sich, freiwillig für ein doppelt verblendetes Gutachter-Verfahren zu entscheiden. Iop ein großer Verleger aus dem Bereich Physik und Ingenieurswesen ist 2020 nachgezogen, allerdings mit mäßigem Erfolg. Nur zwölf Prozent der Einreichung bei Nature und knapp zwanzig bei Iop nutzen die doppelt blind-Version.

Das Veränderung in der Theorie einfacher ist als in der Praxis, dafür ist auch die Nobel-versus-Nobody-Studie von Jürgen Huber und seinen Kollegen ein Beispiel.

Das war Unwissenheitig, im Nachhinein auch sehr bedauere.

Ihre Studie zum Peer Review ist weder doppelt noch klassisch einfach verblendet begutachtet worden, sondern in einem weniger gebräuchlichen Verfahren, das einige Zeitschriften nur für besonders renommierte Autoren anbieten, wie eben Nobelpreisträger, sozusagen über die VIP-Abkürzung. Dabei dürfen die Autoren ihre Gutachter selber aussuchen.

Ist mir nachher von vielen Kollegen gesagt worden, Herr Freund, jetzt nutzt ihr selber den Status bei es, ich kann zu unserer Verteidigung sagen, wir haben also als Review aber nicht irgendwelche Family und Friends vorgeschlagen, sind tatsächlich Experten, und es war auch ein richtiger Müsomer Review-Prozess, durch den wir gegangen sind, aber ja, der Punkt ist, ist well-taken, ja?

Wie objektiv ist das wissenschaftliche Gutachtersystem ein Beitrag von Anneke Meyer.

Das ist ernsthaft nützlich – natürlich nirgendwohin perfekt, aber gut genug, damit eine rasche Überarbeitung reicht. Ein größeres Modell würde wahrscheinlich die Rechtschreibfehlerrate reduzieren, aber auch so: ich denke mal, ihr werdet hier in Zukunft großzügigere Zitate aus dem DLF lesen.

Danke für den Hinweis, ron!

Zitiert in: Die Quelle des Turing-Tests Hörtipp: Wir müssen reden anno 2022 Fiese Metriken: Das Beispiel Tarifbindung

The Good

The Bad

The Crazy

Letzte Ergänzungen