Tag Software

  • Browsing Peace and Privacy With dnsmasq

    Screenshot of the dnsmasq extra configuration page in freetz

    You can even have the DNS-based adblocking discussed here in your whole network if your router runs dnsmasq (it probably does) and you can edit its configuration (you probably can't). As shown here, with freetz you can.

    I'm not a big fan of in-browser adblocking. For one, I have my doubts about several of the extensions – Adblock plus, for instance, comes from a for-profit, though I give you this critique might be partisan. Also, I like to switch browsers freely and certainly don't want to maintain block lists for each of them, and finally quite a few clients other than browsers may render HTML and hence ads.

    At least with the pages I want (and don't want) to read, there's a much lighter alternative: DNS-based adblocking. You see, on the relatively few commercial pages I occasionally have reason to visit, ads, tracking pixels, and nasty javascript typically are served from a rather small set of domains – doubleclick.net, googleadservices.com, and a few more like these. If I can make my computer resolve these names to – that is, my computer in IPv4, or yours, if you type that address –, everything your browser would pull from these servers is instantly gone in everything rendering HTML.

    So, how do you do that? Well, you first make sure that your computer does the name resolution itself[1]. On Debian, you do that by installing the packages resolvconf (without a second e; in a systemd environment I think you want to use systemd-resolved instead) and dnsmasq; that's really all, and that ought to work out of the box in all reasonably common situations:

    $ sudo apt install resolvconf dnsmasq

    You will probably have to bring your network down and up again for this to take effect.

    Once that's done, you can tell dnsmasq what names to resolve to what. The man page dnsmasq(8) documents what to do under the --address option – you could actually configure dnsmasq through command line options exclusively –, where you can read:

    -A, --address=/<domain>[/<domain>...]/[<ipaddr>]

    Specify an IP address to return for any host in the given domains. […] A common use of this is to redirect the entire doubleclick.net domain to some friendly local web server to avoid banner ads. The domain specification works in the same was [sic, as of bullseye] as for --server […]

    – and from the documentation of --server you learn that <domain> is interpreted as a suffix (if you will), such that if you give an address for, say, google.com, it will also be used for foo.google.com or foo.bar.google.com.

    But where do these address expressions go? Well, at least in Debian, dnsmasq will read (essentially, see the README in there) any file you drop into /etc/dnsmasq.d and add its content to its configuration. Having configuration snippets in different files really helps maintenance and dist-upgrades in general; in this case, it also helps distributing the blacklist, as extra configuration that may be inappropriate on a different host is kept in some other file.

    I tend to prefix snippet names with numbers in case order might one day matter. So, I have a file /etc/dnsmasq.d/10spamreduce.conf containing:


    When you do the same thing, you should restart dnsmasq and then see the effect like this:

    $ sudo service dnsmasq restart
    $ dig +short fonts.gstatic.com

    As you can see, I have also included some trackers and other sources of annoyance in my address list. Of course, if you actually want to read Facebook (ugh) or need to pull Google's fonts (ughugh), you'll have to adapt that list a bit.

    In case you have interesting and useful contributions to this list: Please do write in!

    [1]Regrettably, with things like DNS over HTTPS, it could be that your browser actually will not use your computer's DNS resolver. Adblocking hence is one extra reason to disable DoH when you see it.
  • Work-Life Balance and Privacy with Bash, D-Bus, gajim and ifupdown

    A small screenshot showing an offline icon

    Sunday morning: my gajim is automatically offline. This post explains how I'm doing that.

    I still consider XMPP the open standard for “chat” (well, instant messaging), and I have been using Psi as an XMPP client for almost 20 years now. However, since Psi has occasionally crashed on me recently (as in: at least since Bullseye), presumably on receiving some message, I consider it a certainty that it is remotely exploitable. Given its large codebase I don't think I want to fix whatever is wrong myself, and I don't think there are still people maintaing Psi.

    I therefore recently migrated to gajim last week; after all, one of the nice things about open standards is that there are usually multiple implementations. This, however, made me update an ancient hack to automatically manage my status so that I'm XMPP-offline when it's nobody's business whether or not my machine is on.

    In this post, I'd like to tell you how that works, hoping it may be useful to solve other (but similar; for instance: get offline when doing talks) problems, too.

    Not Always Online

    First off, the major reason I'm not much of a fan of synchronous messaging (which IM is, and email is not) is that it requires some sort of “presence” notification: something needs to know whether I am online, and where I can be reached. At least in XMPP, additionally all your contacts get to know that, too.[1]

    While I admit that can be useful at times, during the night and on weekends, I really don't want to publish when my computer is on and when it's not. Hence I have so far told my Psi and I am now telling my gajim to not automatically re-connect on Weekends or between 20:00 and 7:00. That I can specify this perhaps somewhat unique preference illustrates how great shell integration everywhere is. The ingredients are:

    • ifupdown, Debian's native network management. If you're using systemd or NetworkManager or something, I think these use other hooks [if you've tried it, let me know so I can update this].
    • D-Bus, a framework to communicate between programs sitting on a common X11 display (though with gajim, D-Bus becomes somewhat hidden).
    • the shell, which lets you write little ad-hoc programlets and duct-tape together all the small utilities that accumulated in Unix since the early 1970ies (here: logger, date, and egrep).

    Inter-Process Communication with D-Bus

    The first thing I want to do is make tajim offline before a network interface goes down. That way, people don't have to wait for timeouts to see I am unavailable (unless someone pulls the cable or the Wifi disappears – without a network, gajim can't sign off). That means I have to control a running gajim from the outside, and the standard way to do that these days is through D-Bus, a nifty, if somewhat over-complicated way of calling functions within programs from other programs.

    One of these other programs is qdbus, which lets you inspect what listens on your sessions's (or, with an option, system's) D-Bus and what functions you can call where. For instance:

    $ qdbus org.gajim.Gajim /org/gajim/Gajim
    method void org.gtk.Actions.SetState(QString action_name, QDBusVariant value, QVariantMap platform_data)

    In Psi, with a bit of fiddling, a generic D-Bus tool was enough to switch the state. Since there's a QDBusVariant in the arguments gajim's SetState method wants according to the qdbus output, I don't think I could get away with that after the migration – qdbus does not seem to be able to generate that kind of argument.

    Enter gajim-remote

    But gajim comes with a D-Bus wrapper of its own, gajim-remote, and with that, you can run something like:

    gajim_remote change_status offline

    Except that won't work out of the box. That's because gajim comes with remote control disabled by default.

    To enable it, go to Preferences → Advanced, click Advanced Configuration Editor there, and then look for the remote_control configuration item. I have no idea why they've hidden that eminently useful setting so well.

    Anyway, once you've done that, you should be able to change your status with the command above and:

    gajim_remote change_status online

    ifupdown's Hooks

    I now need to arrange for these commands to be executed when network interfaces go up and down. These days, it would probably be smart to go all the way and run a little daemon listening to D-Bus events, but let me be a bit less high-tech, because last time I looked, something like that required actual and non-trivial programming.

    In contrast, if you are using ifupdown to manage your machine's network interfaces (and I think you should), all it takes is a bit of shell scripting. That's because ifupdown executes the scripts in /etc/network/if-up.d once a connection is up, and the ones in /etc/network/if-down.d before it brings a connection down in a controlled fashion. These scripts see a few environment variables that tell them what's going on (see interfaces(5) for a full list), the most important of which are IFACE (the name of the interface being operated on), and MODE, which would be start or stop, depending on what ifupdown is doing.

    The idea is to execute my change_status commands from these scripts. To make that a bit more manageable, I have a common script for both if-up.d and if-down.d. I have created a new subdirectory /etc/network/scripts for such shared ifupdown scripts, and I have placed the following file in there as jabber:

    # State management of gajim
    case $MODE in
      case $IFACE in
      eth* | wlan* | n900)
        if ! date +'%w/%H' | grep '[1-5]/\(0[789]\|1[0-9]\)'  > /dev/null; then
          exit 0
        su - $DESKTOP_USER -c 'DISPLAY=:0 gajim-remote change_status online "Got net"' > /dev/null || exit 0
      case $IFACE in
      eth* | wlan* | n900)
        if [ tonline == "t`su $DESKTOP_USER -c 'DISPLAY=:0 gajim-remote get_status'`" ]; then
          su - $DESKTOP_USER -c "DISPLAY=:0 gajim-remote change_status offline 'Losing network'" || exit 0
          sleep 0.5

    After chmod +x-ing this file, I made symbolic links like this:

    ln -s /etc/network/scripts/jabber /etc/network/if-down.d/
    ln -s /etc/network/scripts/jabber /etc/network/if-up.d/

    – and that should bascially be it (once you configure DESKTOP_USER).

    Nachtrag (2023-12-02)

    Let me admit that this never really worked terribly well with gajim, manly because – I think – its connections don't time out, and so once a status update hasn't worked for one reason or another, gajim would be in a sort of catatonic state. That's one of the reasons I switched on to pidgin, and its state management again broke when upgrading to Debian bookworm. My current script is near the bottom of this December 2023 post

    Debugging Admin Scripts

    Because it is a mouthful, let me comment a bit about what is going on:

    logger Jabber: $MODE $IFACE $LOGICAL

    logger is a useful program for when you have scripts started deeply within the bowels of your system. It writes messages to syslog, which effectively lets you do printf Debugging of your scripts. Once everything works for a script like this, you probably want to comment logger lines out.

    Note that while developing scripts of this kind, it is usually better to just get a normal shell, set the environment variables (or pass the arguments) that you may have obtained through logger, and then run them interactively, possibly with a -x option (print all statements executed) passed to sh. For instance:

    $ MODE=start IFACE=wlan0 sh -x /etc/network/scripts/jabber
    + DESKTOP_USER=anselmf
    + logger Jabber: start wlan0
    + case $MODE in
    + case $IFACE in
    + date +%w/%H
    + grep '[1-5]/\(0[789]\|1[0-9]\)'
    + exit 0

    – that way, you see exactly what commands are executed, and you don't have to continually watch /var/log/syslog (or journalctl if that's what you have), not to mention (for instance) bring network interfaces up and down all the time.

    Case Statments in Bourne's Legacy

    The main control structure in the script is:

    case $MODE in

    Case statements are one of the more powerful features of descendants of the Bourne shell. Read about them in the excellent ABS in case you are a bit mystified by the odd syntax and the critically important ;; lines.

    The particular case construct here is there so I can use the same script for if-up.d and if-down.d: it dispatches on whatever is in MODE. In case MODE is something other than start or stop, we silently do nothing. That is not always a good idea – programs failing without complaints are a major reason for the lack of hair on my head –, but since this isn't really user-callable, it's probably an acceptable behaviour.

    General rule of thumb, though: Be wary of case .. esac without a *) (which gives commands executed when nothing …

  • BahnBonus ohne Google-Id und auf dem eigenen Rechner

    Screenshot: Ein bunter App-Bildschirm mit wenig Information und einem Spendenaufruf.  Es handelt sich um die BahnBonus-App der Bahn.

    Objekt der Begierde: Die BahnBonus-App, die mich wieder in die DB-Lounges Einlass finden wird. Und zwar ganz ohne Apple und nur mit einer einfachen Überdosis Google.

    Vor einem knappen Jahr habe ich eine Großbeichte abgelegt: Ja, ich nehme an so einem blöden, schnüffelnden Kundenbindungsprogramm teil, und dann noch an dem von der Bahn, bei dem VielfahrerInnen gemütlich im Sessel Kakao schlürfen, während gewöhnliche Reisende draußen am Bahnsteig frieren oder sich um die wenigen Sitzgelegenheiten in den Bahnhofsgebäuden streiten müssen: Ehemals bahn.comfort, jetzt BahnBonus.

    Im zitierten Post habe ich dem Lounge-Zugang hinterhergeweint, denn seit einem knappen Jahr lässt die Bahn nur noch Menschen in die Lounges, die ferngewartete Computer („Smartphones“), und dann noch ziemlich neue davon, verwenden. Statt der alten Plastikkarte brauchte es jetzt eine, hust, App. Die tut nun (wie ich jetzt weiß und vorher ahnte) nicht nicht viel mehr als den Login-Bildschirm der Bahn-Webseite anzuzeigen und dann Links auf QR-Codes zu generieren. Wahrscheinlich etwas naiv habe damals gehofft, dass die Bahn die paar Zeilen Javascript, die es dafür braucht, auch auf ihre normale Webseite packt.

    Das ist aber nicht passiert. Als die Bahn neulich BahnBonus-Papierwerbung geschickt hat („Sie haben Gold-Status!”), habe ich erneut eine Mail an den Bahn-Support geschrieben, wie es denn mit QR-Codes auf der Webseite stehe. Erneut war die Antwort ein nicht weiter erklärtes Nein. Dass die Bahn mit der Negativantwort aber etliche Gutscheine (insbesondere zum Lounge-Zugang) schickte, nahm ich als „nee, wir machen das nie, auch wenn es einfach ist“. Mag sein, dass es dabei ums Datensammeln geht, mag sein, dass das einfach Konzernpolitik ist.

    Jedenfalls: Wenn ich wieder im Warmen Kakao schlürfen will, muss ich irgendwie auf Dauer an die QR-Codes kommen. Ferngewartete Computer kommen für mich allenfalls in virtuellen Maschinen in Frage, und so dachte ich mir: Ich probier mal, ob ich die BahnBonus-App nicht auch auf meinem normalen Rechner zum Laufen kriege.

    Stellt sich raus: Das geht, und wenn mensch sich Google in einer VM austoben lässt, sogar mit vertretbarem Aufwand. Ich schreibe hier mal auf, was es für mich gebraucht hat; das mag ja bei anderen Digitalzwängen auch ein wenig helfen.

    Android in QEMU aufsetzen

    Ich gehe für die folgenden Schritte aus von einem Debian (bullseye) auf einem Intel- oder AMD-System, das nicht wesentlich älter ist als 15 Jahre. Im Prinzip dürfte aber fast alles auch auf jeder anderen Plattform gehen, auf der Qemu läuft.

    Wenn ihr bei den folgenden Schritten irgendwo ins Schleudern kommt, lasst es mich bitte wissen – ich erweitere diese Erzählung gerne so, dass sie auch nicht übermäßig nerdigen Menschen etwas sagt.

    (1) Qemu installieren – Qemu ist zunächst ein Emulator von allerlei Hardware. Da aber Android enorm ressourcenhungrig ist (also: jetzt für meine Verhältnisse), wäre alles furchtbar lahm, wenn der Android-Code nicht direkt von der CPU in eurem Rechner ausgeführt würde – ich werde Qemu also als Virtualisierer verwenden und nur in sehr zweiter Linie als Emulator. Achtet jedenfalls darauf, dass qemu KVM machen kann. Zum Ausgleich braucht ihr nur die amd64-Fassung, nicht all die anderen Architekturen, und insbesondere nicht ARM. In Bullseye sollte sowas hier reichen:

    apt install qemu-system-gui qemu-system-amd64

    [ich selbst habe an der Stelle aus Geiz qemu-system-x86 genommen; das geht auch, und damit ist alles etwas kompakter].

    (2) Android-x86 besorgen – ich gestehe ehrlich, dass ich mich nicht sehr um die Vertrauenswürdigkeit der Leute rund um den Port von Android auf x86-Prozessoren gekümmert habe. Ich habe einfach ein passendes ISO-Image von deren FOSSHUB-Seite (Krapizität 10 lässt hoffen) runtergeladen; wenn ihr die amd64-Qemu installiert habt, wollt ihr jetzt das „64-bit ISO file“.

    (3) Container fürs Android-Filesystem anlegen – euer Android muss seine Dateien irgendwo hinspeichern, und ihr wollt ihm gewiss keinen Zugriff auf euer echtes Dateisystem geben. Erzeugt also eine „virtuelle“ Festplatte für die Qemu-Daten. Dabei kommt ihr mit einem Gigabyte auch bei i386 nicht aus. Wenn ihr euch um Plattenplatz keine Sorgen macht: baut lieber gleich eine mit vier Gigabyte (4G am Ende der Kommandozeile).

    Sucht euch auch irgendeinen Platz, wo ein Klops von der Größe nicht schlimm stört. Ich nehme hier mal ~/containers (was ihr dann wohl aus eurem Backup rausnehmen solltet):

    mkdir -p ~/containers
    qemu-img create -f qcow2 ~/containers/android.img 2G


    Jetzt stellt sich das Problem, dass euer künftiges Android die Bildschirmausgabe irgendwo hinschicken muss. Qemu kann in ein ordinäres X-Fenster ausgeben, aber das ist – aus Gründen, die ich nicht untersucht habe – furchtbar lahm. Was für mich gut funkioniert hat: VNC. Wenn ihr damit nicht zurechtkommt, probiert unten mal QDISPLAY="-display gtk" (könnte halt kreuzlahm sein).

    (4) Android-Installer starten – das braucht ein paar Optionen, damit das Ding auch ins Netz kommt und die beiden nötigen Dateien (die virtuelle Platte und den Android-Installer) findet:

    QDISPLAY="-display vnc=localhost:0"
    qemu-system-amd64 $QDISPLAY -enable-kvm -m 2000 \
      -net nic -net user -drive file=$HOME/containers/android.img,format=qcow2 \
      -boot d -cdrom /media/downloads/android-x86-9.0-r2.iso

    Den Pfad in der -cdrom-Option müsst ihr ganz sicher anpassen, damit er auf das ISO zeigt, das ihr gerade runtergeladen habt. Lasst jetzt einen VNC-Client auf localhost:5600 los; ich empfehle in diesen Tagen remmina (aus dem gleichnamigen Debian-Paket).[1]

    (5) Den Android-Container konfigurieren – wählt Installation to Hard disk aus, dann Create/Modify Devices. Ihr kommt in einen guten, alten, textbasierten Partitionierer. Als Disklabel wollt ihr nicht GPT haben (weil das später Ärger mit dem Bootloader GRUB gibt). Der Speicher, den ihr da partitioniert, ist der, den ihr in Schritt 3 angelegt habt. Packt die ganze „Platte“ in eine Partition, sagt Write (keine Sorge, mit den Optionen oben könnt ihr keine Daten von euch kaputtmachen) und dann Quit.

    Ihr kommt dann zurück in den Android-Installer. Nach einem Ok könnt ihr das Filesystem auswählen – nehmt ext4.

    Dann fragt der Installer, ob ihr einen GRUB haben wollt – ja, wollt ihr, sonst kommt euer Android nachher nur mit viel Mühe hoch.

    Das System Directory wollt ihr wahrscheinlich nicht read/write haben (es sei denn, ihr wollt ernsthaft mit dem Android spielen). Das spart einiges an Platz.

    (6) Android ins Netz lassen – an der Stelle sollte euch der Installer anbieten, Android-x86 zu starten. Tut das. Wählt eine Sprache aus – ich habe es bei „English (United States)“ belassen.

    Es kann sein (ist bei mir passiert), dass das Ding nach der Sprachabfrage abstürzt und wieder beim Grub-Prompt vom Installer ist. Wenn das passiert, beendet die qemu (also: Control-C in deren Fenster) und schaut unten bei VM starten nach der Kommandozeile, die die Qemu ohne Installer hochzieht. Wir haben es hier mit kommerzieller Software zu tun, Gesundbooten ist also durchaus eine legitime Option.

    Jedenfalls: nach der Sprachwahl will das Ding ins Netz, und ich fürchte, es hat keinen Sinn, ihm das zu verbieten. Sucht also nach Netzen. Ihr solltet genau eines sehen, VirtWifi oder sowas. Wählt das aus, seufzt und lasst schon mal auf eurem richtigen Rechner ein tcpdump -n laufen, um zu sehen, womit euer neues Android so alles plaudert (vgl. Die Wunden lecken).

    Das „Checking for Updates“ hat bei mir über Minuten hinweg 100% CPU verbraten (mensch will gar nicht wissen, was es dabei zu rechnen gibt). Da ich den „ich tu gerade was“-Feedback im emulierten Android generell nicht so prall finde, könnt ihr die Zeit ja damit verbringen, eure CPU-Last-Anzeige am Desktop auf Vordermann zu bringen (mein Tipp: wmcore).

    Dann fragt Android, ob es von irgendwoher eure Daten ziehen kann. Klar: das hätte Google gerne. Zum Glück gibts einen kleinen „Don't Copy“-Knopf. Genauso ist auch der Skip-Knopf im nächsten Dialog, dem Google-Signin, ziemlich klein, und Google nervt extra nochmal, wenn mensch ihn wählt. Wählt ihn trotzdem. Date and Time sind zur Abwechslung problemlos abnickbar, dann kommt ein Dialog zu „Google Services“, die mensch alle manuell ausschalten muss.

    Das ist offenbar die Benutzerfreundlichkeit („User Experience“), über deren Mangel im Free Software-Bereich ich immer so viel höre. Ums Akzeptieren, dass Google immer und zu jeder Zeit Kram auf die VM packen kann, kommt mensch glaube ich nicht rum. Aber dafür ist es ja auch eine VM.

    Den folgenden „Protect your Tablet“-Dialog finde ich interessant, weil die Benutzerführung, die mir gerade noch Vertrauen zu Google überhelfen wollte, nun Misstrauen gegen andere Menschen sät, und das gleich noch mit einem zweiten Extra-Mahn-Dialog, wenn ich keine Lust auf Geräte-PINs habe. Also ehrlich: wenn ich mit Google zu tun habe, mache ich mir doch über menschliche DiebInnen keine Sorgen…

    Die abschließende Frage nach der Home-App verstehe ich auch nicht. Macht einfach irgendwas. Damit seid ihr im Android-Geschäft.

    Apps ohne App-Store

    (7) Home-Screen aufräumen – Wenn ihr gleich mal den „Home-Screen“ aufräumen wollt: jeweils lang ein Icon klicken und ziehen. Dann erscheint ein „Remove“-Feld, auf das ihr das Icon ziehen könnt. Macht das am besten mit allem außer dem Chrome. Den brauchen wir gleich. Die widerliche Google-Bar lässt sich, glaube ich, mit diesem Mitteln nicht entfernen. Wozu auch – der Container gehört ja, wie ihr gerade abgenickt habt, sowieso Google.

    (8) Bahn-App finden – Die Bahn veröffentlicht, so weit ich weiß, keine APKs (also Pakete) von ihrer App. Insofern müsst ihr …

  • Speech Recognition with Whisper.cpp

    Today I stumbled across Whispers of A.I.'s Modular Future by James Somers, a piece that, at least by the standards of publications aimed at the general public, makes an excellent point of why whisper.cpp might finally be some useful and non-patronising output of the current AI hype.

    What can I say? I think I'm sold. And perhaps I'm now a little bit scared, too. If you want to understand way and speak a bit of German, you can skip to The Crazy right away.

    The Good

    You know, so far I've ignored most of the current statistical modelling (“AI”, “Machine Learning“) – if you need a graphics chip with drivers even worse than Intel's, and that then needs 8 GB of video RAM before anything works, I'm out. And I'm also out when the only way I can use some software is on some web page because there's proprietary data behind it.

    Not so for whisper.cpp. This is software as it was meant to be: trivial dependencies, compact, works on basically any hardware there is. To build it, you just run:

    git clone https://github.com/ggerganov/whisper.cpp/
    cd whisper.cpp

    – and that's it. No dependency juggling down to incompatible micro versions, no fancy build system, just a few C(++) sources and a Makefile. The thing works in place without a hitch, and it has a sensible command line interface.

    Well, you need the language models, of course. There are some reasonably free ones for English. The whisper.cpp distribution's models/README.md explains how to obtain some. I got myself ggml-small.en.bin, recorded a few words of English into a file zw.wav and ran:

    ./main -m models/ggml-small.en.bin ~/zw.wav

    The machine demanded I use a samplerate of 16 kHz, I made audacity oblige, ran the thing again and was blown away when – admittedly after a surprisingly long time – my words appeared on the screen.

    I immediately tried to figure out how to stream in data but then quickly decided that's probably not worth the effort; the software needs to see words in context, and for what I plan to do – transcribing radio shows – having an intermediate WAV file really does not hurt.

    I quickly cobbled together a piece of Python wrapping the conversion (using the perennial classic of audio processing, sox) somewhat cleverly, like this:

    # A quick hack to transcribe audio files
    # Dependencies:
    # * sox (would be mpv, but that's somehow broken)
    # * a build of whispercpp (https://github.com/ggerganov/whisper.cpp/)
    # * a language model (see models/README.md in the whisper source)
    import contextlib
    import os
    import subprocess
    import sys
    import tempfile
    WHISPER_DIR = "/usr/src/whisper.cpp"
    def workdir(wd):
            prev_dir = os.getcwd()
    def transcribe(audio_source, model, lang):
            """transcibes an audio file, creating an in-place .txt.
            model must be the name of a model file in WHISPER_DIR/models;
            lang is the ISO language code in which the output should turn up.
            audio_source = os.path.join(os.getcwd(), audio_source)
            with tempfile.TemporaryDirectory(suffix="transcribe", dir="/var/tmp") as wd:
                    with workdir(wd):
                                    "-b", "16", "-r", "16000", "-c", "1",
                            out_name = os.path.splitext(audio_source)[0]
                                    "-l", lang,
                                    "-m", WHISPER_DIR+"/models/"+model,
                                    "-otxt", "-of", out_name,
    def parse_command_line():
            import argparse
            parser = argparse.ArgumentParser(description="Wrap whisper.cpp to"
                    " bulk-transcribe audio files.")
            parser.add_argument("model", type=str, help="name of ggml language"
                    f" model to use, relative to {WHISPER_DIR}/models")
            parser.add_argument("audios", type=str, nargs="+",
                    help="Sox-translatable audio file to transliterate.")
            parser.add_argument("--lang", type=str, default="en",
                    help="Spoken language to try and recogonise")
            return parser.parse_args()
    if __name__=="__main__":
            args = parse_command_line()
            for audio in args.audios:
                    transcribe(audio, args.model, args.lang)

    Nachtrag (2023-06-26)

    (Added a --lang option as per ron's feedback below)

    I have that as transcribe.py in my path, and I can now enter the rip of an audiobook and say:

    transcribe.py ggml-small.en.bin *.ogg

    (provided I have downloaded the model as per whisper.cpp's instructions). After a little while (with high CPU usage), there is a transcript on my disk that's better what I had typed myself even after two rounds of proff-reading, except that whisper.cpp doesn't get the paragraphs right.

    For the first time in the current AI hype, I start getting carried away, in particular when I consider how much speech recognition sucked when I last played with it around 2003, using a heap of sorry failure called viavoice.

    The Bad

    Skip the rant to get to the exciting part.

    Trouble is: What I'd mainly like to transcribe is German radio, and whisper.cpp does not come with a German language model. Not to worry, one would think, as whisper.cpp comes with conversion scripts for the pyTorch-based whisper models like those one can get from Hugging Face. I downloaded what I think is the model file and cheerfully ran:

    $ python convert-h5-to-ggml.py /media/downloads/model.bin
    Traceback (most recent call last):
      File "/home/src/whisper.cpp/models/convert-h5-to-ggml.py", line 24, in <module>
        import torch
    ModuleNotFoundError: No module named 'torch'

    Oh bummer. Well, how hard can it be? Turns out: Surprisingly hard. There is no pytorch package Debian stable. Ah… I very much later realised there is, it's just that my main system still has an i386 userland, and pytorch is only available for amd64. But I hadn't figured that out then. So, I enabled a virtual python (never mix your system python and pip) and ran:

    $ pip install torch
    ERROR: Could not find a version that satisfies the requirement torch
    ERROR: No matching distribution found for torch

    Huh? What's that? I ran pip with a couple of -v sprinkled in, which at least yielded:

    Skipping link: none of the wheel's tags match: cp38-cp38-win_amd64: https://download.pytorch.org/whl/cpu/torch-1.9.0%2Bcpu-cp38-cp38-win_amd64.whl (from https://download.pytorch.org/whl/cpu/torch/)
    Given no hashes to check 0 links for project 'torch': discarding no candidates
    ERROR: Could not find a version that satisfies the requirement torch
    ERROR: No matching distribution found for torch

    The message with “Given no“ has a certain lyric quality, but other than that from the “Skipping“ messages I concluded they don't have 32 bit builds any more.

    Well, how hard can it be? Pypi says the sources are on github, and so I cloned that repo. Oh boy, AI at its finest. The thing pulls in a whopping 3.5 Gigabytes of who-knows-what. Oh, come on.

    python setup.py build fails after a short while, complaining about missing typing_extensions. Manually running pip install typing_extensions fixes that. But I killed setup.py build after a few minutes when there were only 50/5719 files built. Has AI written that software?

    In the meantime, I had gone to a machine with a 64 bit userland, and to be fair the experience wasn't too bad there, except for the hellish amount of dependencies that pytorch pulls in.

    So, my expectations regarding “AI code” were by and large met in that second part of the adventure, including the little detail that the internal links on https://pypi.org/project/torch/ are broken because right now their document processor does not produce id attributes on the headlines. Yeah, I know, they're giving it all away for free and all that. But still, after the brief glimpse into the paradise of yesteryear's software that whisper.cpp afforded, this was a striking contrast.

    The Crazy

    So, I converted the German language model doing, in effect:

    git clone https://github.com/openai/whisper.git
    git lfs install
    git clone https://huggingface.co/bofenghuang/whisper-small-cv11-german
    python convert-h5-to-ggml.py whisper-small-cv11-german/ whisper tmp

    (where I took convert-h5-to-ggml.py from whisper.cpp's repo). Then I moved the resulting tmp/ggml-model.bin to german-small.ggml and ran:

    transcribe.py german-small.ggml peer_review_wie_objektiv_ist_das_wissenschaftliche_dlf_20221214_1646_8a93e930.mp3

    with my script above and this German-language mp3 from Deutschlandfunk. From the English experience, I had expected to get an almost flawless transliteration of the German text. What I got instead was (paragraphs inserted by me); listen to the audio in parallel if you can:

    Germany. Research is on [that was: Deutschlandfunk Forschung aktuell]

    A Nobel Prize for Science is not easy without further ado. They really need to find something out. For example, Vernon Smith, who is now 95 years old, is now the father of the Experimental Economy. In 2002 he won the Nobel Prize for Science.

    This made such a prize and renommee also make impression on other Fachleuteen and that actually influenced the unabhängig well-office method for scientific publications. This has recently shown a study of Business Science in the Fachmagazin PNS. Anike Meyer spoke with one of the authors.

    When Jürgen Huber and his colleagues thought about the experiment, it was clear to them that this is not fair. The same manuscript was given by two different authors, Vernon …

  • Trailing blanks, vim and git

    Trailing blanks may be␣␣␣␣␣
    evil when git displays diffs.␣␣␣␣␣␣␣
    Time to remove them.

    I'm currently going through a major transition on my main machine in that I have configured my vim to strip trailing blanks, that is, to automatically remove space characters (as in U+0020) immediately before the ends of lines[1].

    Why do I do this? I suppose it mainly started with PEP 8, a style guide für Python source code which says trailing whitespace is evil. It has a point, but I have to say trailing whitespace really became a problem only when style checkers started rejecting trailing blanks, which then made all kinds of tools – including other peoples' editors – automatically strip trailing whitespace.

    That, in turn, causes the diffs coming out of version control systems to inflate, usually without anyone – neither the people leaving the trailing whitespace nor the ones whose tools remove them – actually wanting that. And well, I tackled this about now because I was fed up with humonguous continuous integration runs failing at the very end because they found a blank at the end of some source file.

    So, while I can't say I'm convinced trailing whitespace actually is as evil as all that, I still have to stomp it out to preserve everyones' nerves.

    Configuring vim to replace trailing blanks with nothing when saving files is relatively straightforward (at least if you're willing to accept a cursor jump now and then). The internet is full of guides explaining what to do to just about any depth and sophistication.

    Me, I am using a variant of a venerable vintage 2010 recipe that uses an extra function to preserve the state over a search/replace operation to avoid jumping cursors. I particularly like about it that the Preserve function may come in handy in other contexts, too:

    function! Preserve(command)
      " run command without changing vim's internal state (much)
      let _s=@/
      let prevpos = getcurpos()
      execute a:command
      let @/=_s
      call cursor(prevpos[1], prevpos[2])
    au BufWritePre * if !&binary | call Preserve("%s/  *$//e") | endif

    That is now in my ~/.vimrc.

    But I still have all the repositories containing files having trailing blanks. To keep their histories comprehensible, I want to remove all trailing blanks in one commit and have that commit only do these whitespace fixes. The trouble is that even with version control (that lets you back out of overzealous edits) you will want to be careful what files you change. Strip trailing blanks in a (more or less) binary file and you will probably break that file.

    So, here is what I do to fix trailing blanks in files that need it while leaving alone the ones that would break, using this blog's VCS (about) as an example:

    1. In preparation, make sure you have committed all other changes. Bulk operations are dangerous, and you may want to roll back everything in case of a fateful typo. Also, you don't want to pollute some other, meaningful commit with all the whitespace noise.

    2. In the root of the repository, look for candidate files containing trailing blanks, combining find and grep like this:

      find . -type f | xargs grep -l ' $'

      A brief reminder what's going on here: grep -l just lists file names with matches of the regular expression, ' $' is a regular expression matching a blank at the end of a line; xargs is a brilliant program reading command line arguments for the program named in its arguments from stdin, and the find invocation prints all names of actual files (as opposed to directories) below the current directory.

      It may be preferable to use some grep with built-in find functionality (I sometimes use ripgrep), but if I can make do with basic GNU or even better POSIX, I do, because that's something that's on many boxes rather reliably.

      The price to pay in this particular case: this recipe won't work if you have blanks in your file names (using -print0 in find and -0 in xargs would fix things here, but then the next step would break). Do yourself a favour and don't have blanks in your filenames. Having dashes in them looks-better-anyway: it makes you look like a die-hard-LISP-person.

    3. Now use egrep -v to filter file names, building patterns of names to ignore and process later, respectively. For instance, depending on your VCS, you will usually have lots of matches in .git or .svn or whatever, and most of these will break when you touch them (not to mention they won't spoil your history anyway). Coversely, it is clear that I want to strip trailing blanks on ReStructuredText files. My two patterns now grow in two separate egrep calls, one for files I don't want to look at, the other for files I will want to strip trailing blanks in:

      find . -type f |\
        egrep -v '\.git' |\
        egrep -v '\.rst$' | xargs grep -l ' $'

      This prints a much smaller list of names of files for which I have not yet decided whether or not to strip them.

    4. Repeat this: On the side of files I shouldn't touch, I spot some names ending in .jpeg, .png, and .db. On the side of files that need processing, I notice .html, .css, and .py. So, my next iteration is:

      find . -type f |\
        egrep -v '\.git|\.(jpeg|png|db)$' |\
        egrep -v '\.(rst|html|css|py)$' |\
        xargs grep -l ' $'

      That's a still smaller list of file names, among which I spot the index files used by my search engine in .xapian_db, .pyc files used by Python, and a vim .swp file. On the other hand I do want to process some files without an extension, so my next search command ends up as:

      find . -type f |\
        egrep -v '\.git|\.xapian_db|\.(jpeg|png|db|pyc|swp)$' |\
        egrep -v 'README|build-one|\.(rst|html|css|py)$' |\
        xargs grep -l ' $'

      That's it – this only leaves a few files as undecided, and I can quickly eyeball their names to ascertain I do not want to touch them. My second pattern now describes the set of files that I want to strip trailing blanks from.

    5. Stripping trailing blanks is easily done from the command line with sed and its inline (-i) option: sed -i 's/  *$//' <file1> <file2>...[2]. The file names I can produce with find alone, because at least GNU find supports the extended regular expressions I have just produced in my patterns; it needs a -regexptype option to correctly interpret them, though:

      find . -regextype egrep -regex 'README|build-one|.*\.(rst|html|css|py)$' |\
        xargs grep -l ' $'

      The advantage of using find alone over simply inverting the egrep (by dropping the -v) is that my gut feeling is the likelihood of false positives slipping through is lower this way. However, contrary to the egrep above, find's -regex needs to match the entire file name, and so I need the .* before my pattern of extensions, and editing REs might very well produce false positives to begin with… Ah well.

      Have a last look at the list and then run the the in-place sed:

      find . -regextype egrep -regex 'README|build-one|.*\.(rst|html|css|py)$' |\
        xargs grep -l ' $' |\
        xargs sed -i 's/  *$//'
    6. Skim the output of git diff (or svn diff or whatever). Using the blacklist built above, you can see whether you have indeed removed trailing whitespace from files you wanted to process:

      find . -type f |\
        egrep -v '\.git|\.xapian_db|\.(jpeg|png|db|pyc|swp)$' |\
        xargs grep -l ' $'

      If these checks have given you some confidence that the trailing blanks have vanished and nothing else has been damaged, commit with a comment stressing that only whitespace has been changed. Then take a deep breath before tackling the next repo in this way.

    [1]This post assumes your sed and you agree on what marks the end of the line. Given it's been quite a while since I've last had to think about CRs or CRLFs, it would seem that's far less of a problem these days than it used to be.
    [2]Incidentally, that's a nice example for why I was so hesitant about stripping white space for all these years: Imagine some edits make it so a line break sneaks in between sed -i 's/ and *$//'. Then both blanks that are there are gone, and even if the text is reflowed again later, it will still be broken (though not catastrophically so in this particular case).
  • My First Libreoffice Macro in Python

    Screenshot: a libreoffice window with the string "Libreoffice" selected, behind a browser window with a wikipedia search result for libreoffice.

    This is what I was after: Immediate Wikipedia search from within Libreoffice. In the document, you can also see the result of a Python dir() as produced by the inspect macro discussed below.

    While I still believe the creation „office“ software was one of the more fateful turns in the history of computing and this would be a much better world if there hadn't been VisiCalc and WordStar, not to mention all the software they spun off, I do get asked about Libreoffice quite a bit by people I have helped to get off of Windows.

    The other day one of them said: „You know, wouldn't it be nifty if I could mark a term, hit F3 and then that'd do a Wikipedia search for that term?“ I could sympathise with that, since the little one-line CLI I have on my desktop has a function pretty much for this, too. That program, however, probably is too mean for people using Libreoffice. But tomorrow is that guy's birthday, and so I thought: how hard can it be to teach this to Libreoffice directly?

    Turns out it's harder (for me) than I thought. Which is why I'm writing this post: perhaps a few people will find it among all the partially outdated or (to me) not terribly helpful material. I think I'd have appreciated a post like this before I started investigating Libreoffice's world of macros.

    In particular, I'd have liked the reassuring words: „There's no reason to fiddle with the odd dialect of BASIC that's built in, and there's no reason either to use the odd IDE they have.” The way things were, I did fiddle around with both until I couldn't seem to find a way to open a URL from within StarBasic or whatever that thing is called today. At that point I noticed that all it takes for Python support is the installation of a single small package. In addition, for all I can see the BASIC variant has just about as much relevant documentation as the Python API. So… let's use the latter.


    (a) To enable Python macros in libreoffice version 7 (as in Debian bullseye), you have to install the libreoffice-script-provider-python package.

    (b) The extensions go into a directory deep within your XDG .config. So, create and enter this directory:

    mkdir ~/.config/libreoffice/4/user/Scripts/python/
    cd ~/.config/libreoffice/4/user/Scripts/python/

    I'm calling this directory the script path below.

    Figuring out Libreoffice's API

    My main problem with this little project has been that I could not figure out Libreoffice's macro-related documentation. The least confusing material still seems to be maintained by openoffice (rather than libreoffice), and what I ended up doing was using Python introspection to discover attribute names and then entering the more promising ones into the search box of the openoffice wiki. I strongly suspect that's not how it's meant to work. If you know about better ways: please drop me a note and I will do an update here.

    But: How do you introspect given these macros do not (easily) have a stdout, and there seems to be no support for the Python debugger either?

    Based on an example from openoffice, I figured out that to Libreoffice, macros written in Python are just functions in Python modules in the script path, and that the basic entry point to libreoffice is through a global variable that the libreoffice runtime tricks into the interpreter's namespace, namely XSCRIPTCONTEXT. With this realisation, the example code, and some guessing I came up with this (save into the script path as introspect.py):

    def introspect():
        desktop = XSCRIPTCONTEXT.getDesktop()
        model = desktop.getCurrentComponent()
        text = getattr(model, "Text", None)
        if not text:
            # We're not in writer
        text.End.String = str(dir(model))

    If all goes well, this will append a string representation of dir(model) to the end of the document, and in this way you can look at just about any part of the API – perhaps a bit clumsily, but well enough.

    But first, run Python itself on your new module to make sure there are no syntax errors:

    python introspect.py

    That is important because if Python cannot parse your module, you will not find the function in the next step, which is linking your function to a key stroke.

    To do that, in libreoffice, create a new text document and do ToolsCustomize from the menu. Then, in Range, you should find LibreOffice Macros, then My Macros, and in there the introspect module you just created. Click it, and you should be able to select introspect (or whatever function name you picked) under Function. Then, select the key F4 in the upper part of this dialog and click Modify[1].

    After you did that, you can hit F4, and you will see all attributes that the result of getCurrentComponent has. As I said, pasting some of these attribute names into the search box on openoffice's wiki has helped, and of course you can further introspect the values of all these attributes. Thankfully, libreoffice auto-reloads modules, and so traversing all these various objects in this way is relatively interactive.

    I freely admit that I have also used this text.End.String = trick to printf-debug when I did the next steps.

    The Wikipedia-Opening Script

    These next steps included figuring out the CurrentSelection object and, in particular, resisting the temptation to get its Text attribute (which points to its parent, the whole document). Instead, use the String attribute to retrieve what the user has selected. The rest is standard python fare with a dash of what I suppose is cargo-culting on my end (the supportsService thing seeing whether I can subscript the selection; I lifted that from another example I ran into on the openoffice wiki):

    import webbrowser
    from urllib.parse import quote as urlquote
    def search_in_wikipedia():
        """Do a wikipedia search for the current selection"""
        desktop = XSCRIPTCONTEXT.getDesktop()
        model = desktop.getCurrentComponent()
        sel = model.CurrentSelection
        if sel.supportsService("com.sun.star.text.TextRanges"):
            selected = sel[0].String.strip()
            if selected:

    For all I can see, the Wikipedia search URI is the same across the instances modulo the authority part – so, replace the de in the URL with the code for whatever language you (or persons you need a birthday present for) prefer. Then, save it to the script path and bind it to a function key as before.

    I'm sure this can (and should) be made a whole lot more robust with a bit more actual Libreoffice-fu. But it seems to work nicely enough. And me, I have a very personal (in a somewhat twisted sense) birthday present.

    [1]I always find that describing operations in GUIs tends to sound like incomprehensible origami instructions. Is there a term for that kind of language already? Gooeynese, perhaps?
  • SPARQL 4: Be Extra Careful on your Birthday

    A Yak staring at the photographer

    Enough Yak Shaving already.

    I am now three parts into my yak-shaving investigation of Wikidata and SPARQL (one, two, three) with the goal of figuring out whether birthdays are more or less dangerous – measured by whether or not people survive them – than ordinary days.

    At the end of the last instalment I was stumped by timeouts on Wikidata, and so much of this post is about massaging my query so that I get some answers in the CPU minute that one gets on Wikidata's triplestore. While plain optimisation certainly is necessary, working on six million people within a minute is probably highly ambitious whatever I do[1]. I will hence have to work on a subset. Given what I have worked out in part 3,

    # This will time out and just waste CPU.
    SELECT (count(?person) AS ?n)
    WHERE {
      ?person rdfs:label ?personName.
      ?person wdt:P569 ?bdate.
      hint:Prior hint:rangeSafe true.
      ?person wdt:P570 ?ddate.
      hint:Prior hint:rangeSafe true.
      FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
      FILTER (MONTH(?bdate) = MONTH(?ddate)
        && DAY(?bdate) = DAY(?ddate)
        && YEAR(?bdate) != YEAR(?ddate))
      FILTER (YEAR(?bdate)>1850)
      FILTER (lang(?personName)="en")

    – how do I do a subset? Proper sampling takes almost as much time as working with the whole thing. But for now, I'd be content with just evaluating my query on whatever subset Wikidata likes to work on. Drawing such a (statiscally badly sampled) subset is what the LIMIT clause you have already seen in part one does. But where do I put it, since, if things work out, my query would only return a single row anyway?

    Subqueries in SPARQL, and 50 Labels per Entity

    The answer, as in SQL, is: A subquery. In SPARQL, you can have subqueries like this:

    SELECT (count(?person) as ?n)
    WHERE {
      { SELECT ?person ?personName ?bdate ?ddate
        WHERE {
          ?person rdfs:label ?personName;
            wdt:P569 ?bdate;
            wdt:P570 ?ddate.
        LIMIT 50
      FILTER (lang(?personName)="en")

    – so, within a pair of curly braces, you write another SELECT clause, and its is then the input for another WHERE, FILTER, or other SPARQL construct. In this case, by the way, I'm getting 50 records with all kinds of labels in the subquery and then filter out everything that's not English. Amazingly, only one record out of these 50 remains: there are clearly at least 50 statements on labels for the first entity Wikidata has picked here.

    Raising the innner limit to 500, I get 10 records. For the particular sample that Wikidata chose for me, a person indeed has 50 labels on average. Wow. Raising the limit to 5000, which probably lowers the the pharaohs to non-pharaohs in the sample, gives 130 records, which translates into 38 labels per person.

    Clearly, adding the labels is an expensive operation, and since I do not need them for counting, I will drop them henceforth. Also, I am doing my filtering for off-January 1st birthdays in the subquery. In this way, I probably have a good chance that everything coming out of the subquery actually counts in the outer filter, which means I can compute the rate of people dying on their birthday by dividing my count by the limit.

    Let's see where this gets us:

    SELECT (count(?person) AS ?n)
    WHERE {
      { SELECT ?person ?bdate ?ddate
        WHERE {
          ?person wdt:P569 ?bdate.
          hint:Prior hint:rangeSafe true.
          ?person wdt:P570 ?ddate.
           FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
          FILTER (YEAR(?bdate)>1850)
          hint:Prior hint:rangeSafe true.
        LIMIT 500
      FILTER (MONTH(?bdate) = MONTH(?ddate)
        && DAY(?bdate) = DAY(?ddate)
        && YEAR(?bdate) != YEAR(?ddate))
      hint:Prior hint:rangeSafe true.

    Named Subqueries and Planner Barriers

    That's returning a two for me, which is not implausible, but for just 500 records it ran for about 20 seconds, which does not feel right. Neither pulling the 500 records nor filtering them should take that long.

    When a database engine takes a lot longer than one thinks it should, what one should do is take look at the query plan, in which the database engine states in which sequence it will compute the result, which indexes it intends to use, etc.

    Working out a good query plan is hard, because in general you need to know the various partial results to find one; in my example, for instance, the system could first filter out everyone born after 1850 and then find the death dates for them, or it could first match the death dates to the birthdays (discarding everything that does not have a death day in the process) and then filter for 1850. Ff there were may people with birthdays but no death days (for instance, because your database talks mostly about living people), the second approach might be a lot faster. But you, that is, the database engine, have to have good statistics to figure that out.

    Since that is such a hard problem, it is not uncommon that the query planner gets confused and re-orders operations such that things are a lot slower than they would be if it hadn't done the ordering, and that's when one should use some sort of explain feature (cf. Wikidata's optimisation hints). On Wikidata, you can add an explain=detail parameter to the query and then somehow display the bunch of HTML you get back.

    I did that and, as I usually do when I try this kind of thing, found that query plans are notoriously hard to understand, in particular when one is unfamiliar with the database engine. But in the process of figuring out the explain thing, I had discovered that SPARQL has the equivalent of SQL's common table expressions (CTEs), which gave me an excuse to tinker rather than think about plans. Who could resist that?

    In SPARQL, CTEs are called named subqueries and used like this:

    SELECT (count(?person) AS ?n)
    WITH { SELECT ?person ?bdate ?ddate
        WHERE {
          ?person wdt:P569 ?bdate;
          hint:Prior hint:rangeSafe true.
          ?person wdt:P570 ?ddate.
          hint:Prior hint:rangeSafe true.
           FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
          FILTER (YEAR(?bdate)>1850)
        LIMIT 30000
      } AS %selection
    WHERE {
      INCLUDE %selection
      FILTER (MONTH(?bdate) = MONTH(?ddate)
        && DAY(?bdate) = DAY(?ddate)
        && YEAR(?bdate) != YEAR(?ddate))
      hint:Prior hint:rangeSafe true.

    – you write your subquery in a WITH-block and give it a name that you then INCLUDE in your WHERE clause. In several SQL database systems, such a construct provides a planner barrier, that is, the planner will not rearrange expressions across a WITH.

    So does, according to the optimisation hints, Wikidata's current SPARQL engine. But disappointingly, things don't speed up. Hmpf. Even so, I consider named subexpresisons more readable than nested ones[2], so for this post, I will stay with them. In case you come up with a brilliant all-Wikidata query, you probably want to go back to inline subqueries, though, because then you probably do not want to constrain the planner too much.

    Finally: Numbers. But what do they Mean?

    With my barrier-protected subqueries, I have definitely given up on working with all 6 million persons with birthdays within Wikidata. Interestingly, I could go from a limit of 500 to one of 30'000 and stay within the time limit. I never went back to try and figure out what causes this odd scaling law, though I'd probably learn a lot if I did. I'd almost certainly learn even more if I tried to understand why with a limit of 50'000, the queries tended to time out. But then 30'000 persons are plenty for my purpose provided they are drawn reasonably randomly, and hence I skipped all the tempting opportunities to learn.

    And, ta-da: With the above query, I get 139 matches (i.e., people who died on their birthdays).

    What does that say on the danger of birthdays? Well, let's see: If birthdays were just like other days, one would expect 30'000/365 deaths on birthdays in this sample, which works out to a bit more than 80. Is the 140 I am finding different from that 80 taking into account statistical noise? A good rule of thumb (that in the end is related to the grand central limit theorem) is that when you count n samples, your random error is something like (n) if everything is benevolent. For 140, that square root is about 12, which we physicist-hackers like to write as σ = 12, and then we quickly divide the offset (i.e., 140 − 80 = 60) by that σ and triumphantly proclaim that “we've got a 5-σ effect”, at which point everyone is convinced that birthdays are life-threatening.

    This is related to the normal distribution (“Gauss curve”) that has about 2/3s of its area within “one σ” (which is its width as half maximum and would be the standard deviation of something you draw from such a distribution), 95% of …

  • SPAQRL 3: Who Died on their Birthdays?

    Many Yaks grazing on a mountain meadow

    A lot of Yak Shaving left to do here.

    Now that I have learned how to figure out dates of birth and death in Wikidata and have made myself sensible tools to submit queries, I can finally start to figure out how I can let Wikidata pick out people dying on the same day of the year they were born on, just like Mary Lea Heger.

    I first fetched people with their birthdays and dates of death:

    SELECT ?person ?bday ?dday
    WHERE {
      ?person wdt:P569 ?bday.
      ?person wdt:P570 ?dday.
    LIMIT 2

    Consider that query for a while and appreciate that by giving two triple patterns using the same variable, ?person, I am performing what in SQL databases would be a join. That that's such a natural thing in SPARQL is, I'd say, a clear strong point for the language.

    Here is what I got back when I ran this this through my wpd shell function from part two:


    This seems to work, except the dates look a bit odd. Did these people really die more than a hundred years before they were born? Ah, wait: these are negative dates. Opening the person URIs as per part one in a browser, I one learns that Q18722 is pharaoh Senusret II, and at least his birthday clearly is… not very certain. If these kinds of estimates are common, I probably should exclude January 1st from my considerations.

    Getting Labels

    But the first thing I wanted at that point was to not have to click on the URIs to see names. I knew enough about RDF to simply try and get labels according to RDF schema:

    SELECT ?personName ?bday ?dday
    WHERE {
      ?person rdfs:label ?personName.
      ?person wdt:P569 ?bday.
      ?person wdt:P570 ?dday.
    LIMIT 10

    That's another SQL join, by the way. Nice. Except what comes back is this:

    personName=Huang Di
    personName=ኋንግ ዲ
    personName=هوان جي دي
    personName=Emperador mariellu

    If you select the URI in ?person in addition to just the name, you'll see that we now got many records per person. Actually, one per label, as is to be expected in a proper join, and since there are so many languages and scripts out there, many persons in Wikidata have many labels.

    At this point I consulted Bob DuCharme's Learning SPARQL and figured a filter on the language of the label is what I wanted. This does not call for a further join (i.e., triple pattern), as the language is something like a property of the object literal (i.e., the string) itself. To retrieve it, there is a function determining the language, used with a FILTER clause like this:

    SELECT ?personName ?bday ?dday
    WHERE {
      ?person rdfs:label ?personName.
      ?person wdt:P569 ?bday.
      ?person wdt:P570 ?dday.
      FILTER (lang(?personName)="en")
    LIMIT 10

    FILTER is a generic SPARQL thing that is more like a SQL WHERE clause than SPARQL's WHERE clause itself. We will be using it a lot more below.

    There is a nice alternative to this kind of joining and filtering I would have found in the wikidata user manual had I read it in time. You see, SPARQL also defines a service interface, and that then allows queriers to mix and match SPARQL-speaking services within a query. Wikidata has many uses for that, and one is a service that can automatically add labels with minimal effort. You just declare that you want that service to filter your query, and then you write ?varLabel to get labels instead of URIs for ?var, as in:

    # don't run this.  It'll put load on Wikidata and then time out
    SELECT ?personLabel ?bday ?dday
    WHERE {
      ?person wdt:P569 ?bday.
      ?person wdt:P570 ?dday.
      SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en" .
    LIMIT 10

    The trouble with that: This would first pull out all result triples (a couple of million) out of wikidata and then hand over these triples to the wikibase:label service, which would then add the labels and hand back all the labelled records. Only then will the engine discard the all result rows but the first 10. That obviously is terribly inefficient, and Wikidata will kill this query after a minute.

    So: Be careful with SERVICE when intermediate result sets could be large. I later had to use subqueries anyway; I could have used them here, too, to work around the millions-of-triples problem, but at that point I hadn't progressed to these subqueries, let alone means to defeat the planner (that's part four material).

    Determining Day and Month

    Turtle (about my preference for which you could already read in part two) has a nifty abbreviation where you can put a semicolon after a triple and then leave out the subject in the next triple. SPARQL will put the previous subject into this next triple. That works in SPARQL, too, so I can write my query above with fewer keystrokes:

    SELECT ?personName ?bday ?dday
    WHERE {
      ?person rdfs:label ?personName;
        wdt:P569 ?bday;
        wdt:P570 ?dday.
      FILTER (lang(?personName)="en")
    LIMIT 10

    Now I need to figure out the birthday, as in: date within a year. In Bob DuCharme's book I found a SUBSTR function, and BIND clauses that let you compute values and bind them to variables. Treating the dates as ISO strings (“YYYY-MM-DD“) should let me pull out month and date starting at index 6 (gna: SPARQL starts with index 1 rather than with 0 as all deities in known history wanted), and then five characters, no? Let's see:

    SELECT ?personName ?bdate ?ddate
     WHERE {
       ?person rdfs:label ?personName;
         wdt:P569 ?bday;
         wdt:P570 ?dday.
       BIND (SUBSTR(?bday, 6, 5) as ?bdate)
       BIND (SUBSTR(?dday, 6, 5) as ?ddate)
       FILTER (lang(?personName)="en")
     LIMIT 3

    This gives:

    personName=Sobekhotep I
    personName=Amenemhat I
    personName=Senusret II

    Well, that is a failure. And that's because my assumptions on string indices are wrong in general, that is: for people born before the Christian era, and then again for people born after 9999-12-31. Which makes we wonder: Does Wikidata have people born in the future? Well, see below.

    I'll cheat a bit and spare you a few of the dead alleys I actually followed trying to fix this, because they are not very educational. Also, when I strugged with the timeouts I will discuss in a moment I learned about the right way to do this on Wikidata's optimisation page: When something is a date, you can apply the functions DAY, MONTH, and YEAR on it, and that will plausibly even use indexes and thus be a lot faster.

    Thinking about YEAR and having seen the fantasy dates for the ancient Egyptian pharaohs, I also decided to exclude everyone before 1850; that ought to suffice for letting me forget about Gregorian vs. Julian dates, and the likelihood that the dates are more or less right ought to be a lot higher in those days than in the 14th century, say.

    With that, I can write the “birth day equals death day“ in a filter without further qualms. The resulting query is starting to look imposing:

    SELECT ?person ?personName ?bdate ?ddate
    WHERE {
      ?person rdfs:label ?personName.
      ?person rdfs:label  wdt:P569 ?bdate.
      hint:Prior hint:rangeSafe true.
      ?person rdfs:label wdt:P570 ?ddate.
      hint:Prior hint:rangeSafe true.
      FILTER (MONTH(?bdate) = MONTH(?ddate)
        && DAY(?bdate) = DAY(?ddate))
      FILTER (YEAR(?bdate)>1850)
      FILTER (lang(?personName)="en")
    LIMIT 10

    The odd triples with hint:Prior are hints for Wikidata's triple store that, or so I understand Wikidata's documentation, encourages it to use indexes on the properties mentioned in the previous lines; the query certainly works without those, and I frankly have not bothered to see if they actually do anything at all for the present queries. There. Accuse me of practising Cargo Cult if you want.

    Anyway, the result is looking as awful as I had expected given my first impressions with January 1st, and clearly, ensuring birthdays after 1850 is not enough to filter out junk:

    personName=Enrico Mezzasalma
    personName=Peter Jonsson
    personName=Galceran Marquet

    It seems Wikidata even uses 2000-01-01 as some sort of NULL value. Dang. Let's filter out January 1st, then:

    SELECT ?person ?personName ?bdate ?ddate
    WHERE {
      ?person rdfs:label …
  • SPARQL 2: Improvising a client

    A Yak on a mountain path, watching the observer

    There is still a lot of hair on the Yak I am shaving in this little series of posts on SPARQL. All the Yaks shown in the series lived on the Valüla Mountain in Vorarlberg, Austria.

    This picks up my story on figuring out whether birthdays are dangerous using SPRAQL on Wikidata. You can probably skip this part if you're only interested in writing SPARQL queries to Wikidata and are happy with the browser form they give you. But you shouldn't. On both accounts.

    At the end of part one, I, for one, was unhappy about the Javascript-based UI at Wikidata and had decided I wanted a user interface that would let me edit my queries in a proper editor (in particular, locally on my machine, giving me the freedom to choose my tooling).

    My browser's web inspector quickly showed me that the non-Javascript web UI simply sent a query argument to https://query.wikidata.org/sparql. That's easy to do using curl, except I want to read the argument from a file (that is, the one I am editing in my vi). Helpfully, curl's man page informs on the --form option:

    This enables uploading of binary files etc. To force the 'content' part to be a file, prefix the file name with an @ sign. To just get the content part from a file, prefix the file name with the symbol <. The difference between @ and < is then that @ makes a file get attached in the post as a file upload, while the < makes a text field and just get the contents for that text field from a file.

    Uploads, Multipart, Urlencoded, Oh My!

    In this case, Wikidata probably does not expect actual uploads in the query argument (and the form does not submit it in this way), so < it ought to be.

    To try it, I put:

    SELECT ?p ?o
    WHERE {
      wd:Q937 ?p ?o.
    LIMIT 5

    (the query for everything Wikidata says about Albert Einstein, plus a LIMIT clause so I only pull five triples, both to reduce load on Wikidata and to reduce clutter in my terminal while experimenting) into a file einstein.rq. And then I typed:

    curl --form query=<einstein.rq https://query.wikidata.org/sparql

    into my shell. Soberingly, this gives:

    Not writable.

    Huh? I was not trying to write anything, was I? Well, who knows: Curl, in its man page, says that using --form does a POST with a media type of multipart/form-data, which many web components (mistakenly, I would argue) take as a file upload. Perhaps the remote machinery shares this misconception?

    Going back to the source of https://query.wikidata.org/, it turns out the form there does a GET, and the query parameter hence does not get uploaded in a POST but rather appended to the URL. Appending to the URL isn't trivial with curl (I think), but curl's --data option at least POSTs the parameters in application/x-www-form-urlencoded, which is what browsers do when you don't have uploads. It can read from files, too, using @<filename>. Let's try that:

    curl --data query=@einstein.rq https://query.wikidata.org/sparql

    Oh bother. That returns a lenghty message with about a ton of Java traceback and an error message in its core:

    org.openrdf.query.MalformedQueryException: Encountered " <LANGTAG> "@einstein "" at line 1, column 1.
    Was expecting one of:
        "base" ...
        "prefix" ...
        "select" ...
        "construct" ...
        "describe" ...
        "ask" ...

    Hu? Apparently, my query was malformed? Helpfully, Wikidata says what query it saw: queryStr=@einstein.rq. So, curl did not make good on its promise of putting in the contents of einstein.rq. Reading the man page again, this time properly, I have to admit I should have expected that: “if you start the data with the letter @“, it says there (emphasis mine). But haven't I regularly put in query parameters in this way in the past?

    Sure I did, but I was using the --data-urlencode option, which is what actually simulates a browser and has a slightly different syntax again:

    curl --data-urlencode query@einstein.rq https://query.wikidata.org/sparql

    Ha! That does the trick. What comes back is a bunch of XML, starting with:

    <sparql xmlns='http://www.w3.org/2005/sparql-results#'>
        <variable name='p'/>
        <variable name='o'/>
          <binding name='p'>
          <binding name='o'>
            <literal datatype='http://www.w3.org/2001/XMLSchema#integer'>1692345626</literal>

    Making the Output Friendlier: Turtle?

    Hm. That's not nice to read. I thought: Well, there's Turtle, a nice way to write RDF triples in plain text. In RDF land, people rather regularly support the HTTP accept header, a wildly underused and cool feature of HTTP that lets a client say what kind of data it would like to get (see Content negotiation in the Wikipedia). So, I thought, perhaps I can tell Wikidata to produce Turtle using accept?

    This plan looks like this when translated to curl:

    curl --header "accept: text/turtle" \
      --data-urlencode query@einstein.rq https://query.wikidata.org/sparql

    Only, the output does not change, Wikidata ignores my request.

    Thinking again, it is well advised to do so (except it could have produced a 406 Not Acceptable response, but that would probably be even less useful). The most important thing to remember from part one is that RDF talks about triples of subject, predicate, and object. In SPARQL, you have a SELECT clause, which means a result row in general will not consist of subject, predicate, and object. Hence, the service couldn't possibly return results in Turtle: What does not consist of RDF triples canot be serialised as RDF triples.

    Making the Output Friendlier: XSLT!

    But then what do I do instead to improve result readability? For quick and (relatively) easy XML manipulation on the command line, I almost always recommend xmlstarlet. While I give you its man page has ample room for improvement, and compared to writing XSL stylesheets, the command line options of xmlstarlet sel (use its -h option for explanations) are somewhat obscure, but it just works and is compact.

    If you inspect the response from Wikidata, you will notice that the results come in result elements, which for every variable in your SELECT clause have one binding element, which in turn has a name attribute and then some sort of value in its content; for now, I'll settle for fetching either uri or literal (again, part one has a bit more on what that might mean). What I need to tell xmlstarlet thus is: “Look for all result elements and produce one output record per such element. Within each, make a name/value pair from a binding's name attribute and any uri or literal element you find.” In code, I furthermore need to add an XML prefix definition (that's totally orthogonal to RDF prefixes). With the original curl and a pipe, this results in:

    curl --data-urlencode query@einstein.rq https://query.wikidata.org/sparql \
    | xmlstarlet sel -T -N s="http://www.w3.org/2005/sparql-results#" -t \
      -m //s:result --nl -m s:binding -v @name -o = -v s:uri -v s:literal --nl

    Phewy. I told you xmlstarlet sel had a bit of an obscure command line. I certainy don't want to type that every time I run a query. Saving keystrokes that are largely constant across multiple command invocations is what shell aliases are for, or, because this one would be a bit long and fiddly, shell functions. Hence, I put the following into my ~/.aliases (which is being read by the shell in most distributions, I think; in case of doubt, ~/.bashrc would work whenever you use bash):

    function wdq() {
      curl -s --data-urlencode "query@$1" https://query.wikidata.org/sparql
      | xmlstarlet sel -T -N s="http://www.w3.org/2005/sparql-results#" -t \
        -m //s:result --nl -m s:binding -v @name -o = -v s:uri -v s:literal --nl

    (notice the $1 instead of the constant file name here). With an exec bash – my preferred way to get a shell to reflecting the current startup scripts –, I can now type:

    wdq einstein.rq | less

    and get a nicely paged output like:

    o=भौतिकशास्त्रातील नोबेल पारितोषिकविजेता शास्त्रज्ञ.

    We will look at how to filter out descriptions in languagues one can't read, let alone speak, in the next instalment.

    For now, I'm reasonably happy with this, except of course I'll get many queries wrong initially, and then Wikidata does not return XML at all. In that case, xmlstarlet produces nothing but an unhelpful error message of its own, because it …

  • SPARQL and Wikidata 1: Setting out

    Yaks todding along a mountain path

    If you continue, you will read about a first-rate example of Yak Shaving

    While listening to a short biography of the astrophysicist Mary Lea Heger (my story; sorry, in German), I learned that she died on her birthday. That made me wonder: How common is that? Are people prone to die on their birthdays, perhaps because the parties are so strenuous, perhaps because they consider them a landmark that they are so determined to reach that they hold on to dear life until they have reached it? Or are they perhaps less likely to die because all that attention strengthens their spirits?

    I figured that could be a nice question for Wikidata, a semantic database that feeds Wikipedia with all kinds of semi-linguistic or numeric information. Even if you are not Wikipedia, you can run fairly complex queries against it using a language called SPARQL. I've always wanted to play with that, in particular because SPARQL seems an interesting piece of tech. Answering the question of the letality of birthdays turned out to be a mildly exciting (in a somewhat nerdy sense) journey, and I thought my story of how I did my first steps with SPARQL might be suitably entertaining.

    Since it is a relatively long story, I will split it up into a few instalments. This first part relates a few preliminaries and then does the first few (and very simple) queries. The preliminaries are mainly introducing the design of (parts of) RDF with my take on why people built it like that.

    Basics: RDF in a few paragraphs

    For motivating the Resource Description Format RDF and why people bother with it, I couldn't possibly do better than Norman Gray in his witty piece on Jordan, Jordan and Jordan. For immediate applicability, Wikidata's User Manual is hard to beat.

    But if you're in a hurry, you can get by with remembering that within RDF, information is represented in triples of (subject, predicate, object). This is somewhat reminiscent of a natural-language sentence, although the “predicate“ typically would be a full verb phrase, possibly with a few prepositions sprinkled in for good measure. Things typically serving as predicates are called “property“ in RDF land, and the first example for those defined in wikidata, P10[1], would be something like has-a-video-about-it-at or so, as in:

    "Christer Fuglesang", P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20en.webm
    "Christer Fuglesang", P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20ru.webm

    If you know about first order logic: It's that kind of predicate. And please don't ask who Christer Fuglesang is, the triples just came up first in a query you will learn about a bit further down.

    This was a bit of a simplification, because RDF will not usually refer to a thing (an “entity“ in RDF jargon) with a string (“literal”), if only because there could be multiple Christer Fuglesangs and a computer would be at a loss as to which one I mean in the two triples above. RDF instead talks about “resources“, which is anything that has a URI and encompasses both entities and properties. So, a statement as above would actually combine three URIs:

    http://www.wikidata.org/entity/Q317382, http://www.wikidata.org/prop/direct/P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20en.webm


    That is a lot of stuff to type, and thus almost everything in the RDF world supports URL abbreviation using prefixes. Basically, in some way you say that whenever there's wpt: in a token, some magic replaces it by http://www.wikidata.org/prop/direct/. Ff you know about XML namespaces (and if you're doing any sort of non-trivial XML, you should): same thing, except that the exact syntax for writing down the mapping from prefixes to URIs depends on how you represent the RDF triples.

    These “words with a colon that expand to long URIs by some find-and-replace rules“ were called CURIEs, compact URIs, for a while, but I think that term has become unpopular again. I consider this a bit of a pity, because it was nice to have a name for them, and such a physics-related one on top. But it seems nobody cared enough to push the W3C draft for that ahead.

    As with XML namespaces, each RDF document could use its own prefix mapping; if you want, you could instruct an RDF processor to let you write wikidata-direct-property: for http://www.wikidata.org/prop/direct/ rather than wpt: as usual. But that would be an unfriendly act. At least for the more popular collections of RDF resources, there are canonical prefixes: you don't have to use them, but everyone will hate you if you don't. In particular, don't bind well-known prefixes like foaf: (see below) to URIs other than the canonical ones except when seeing whether a piece of software does it right or setting traps for unsuspecting people you don't like.

    Then again, for what we are doing here, you do not need to bother about prefix mappings at all, because the wikidata engine has all prefixes we will use prefined and built in. So, as long as you are aware that you can replace the funny prefixes with URI parts and that there is some place where those URIs parts are defined, you are fine.

    Long URIs in RDF?

    You could certainly ask why we're bothering with the URIs in the first place if people in practice use the canonical prefixes almost exclusively. I think the main reason RDF was built on URIs was because its parents on the one hand wanted to let everyone “build” resources with minimal effort. On the other hand, they wanted to ensure as best they could that two people would not accidentally use the same resource identifier while meaning different things.

    To ensure the uniqueness of identifiers, piggybacking on the domain name system, which already makes sure that there are never two machines called, say, blog.tfiu.de in the world, is a rather obvious move. In HTTP URIs, domain names show up as the authority (the host part, the thing between the first colon and the double slash), and so with URIs of that sort you can start creating (if you will) your resources and would never conflict with anyone else as long as hold on to your domain.

    In addition, nobody can predict which of these namespace URIs will become popular enough to warrant a globally reserved prefix of their own; you see, family-safe prefixes with four (or fewer) letters are a rather scarce resource, and you don't want to run a registry of those. If you did, you would become really unpopular with all the people you had to tell things like “no, your stuff is too unimportant for a nice abbreviation, but you can have aegh7Eba-veeHeql1:“

    The admittedly unwieldy URIs in practice also have a big advantage, and one that would require some complex scheme like the Handle system if you wanted to replicate it with prefixes: most of the time, you can resolve them.

    Non-speaking URIs

    While RDF itself does not care, most URIs in this business actually resolve to something that is readable in a common web browser; you can always try it with the resources I will be mentioning later. This easy resolution is particularly important in the case of Wikidata's URIs, which are essentially just numbers. Except for a few cases (wd:Q42 is Douglas Adams, and wd:Q1 is the Universe), these numbers don't tell you anything.

    There is no fixed rule that RDF URIs must have a lexical form that does not suggest their meaning. As a counterexample, http://xmlns.com/foaf/0.1/birthday is a property linking a person with its, well, birthday in a popular collection of RDF resources[2] called foaf (as in friend of a friend – basically, you can write RDF-complicant address books with that).

    There are three arguments I have heard against URIs with such a speaking form:

    • Don't favour English (a goal that the very multilingual Wikipedia projects might plausibly have).
    • It's hard to automatically generate such URIs (which certainly is an important point when someone generates resources as quickly and with minimal supervision as Wikidata).
    • People endlessly quarrel about what should be in the URI when they should really be quarrelling about the label, i.e., what is actually shown to readers in the various natural languages, and still more about the actual concepts and definitions. Also, you can't repair the URI if you later find you got the lexical form slightly wrong, whereas it's easy to fix labels.

    I'm not sure which of these made Wikidata choose their schema of Q<number> (for entities) and P<number> (for properties) – but all of them would apply, so that's what we have: without looking …

  • PSA: netsurf 3 does not accept cookies from localhost

    As I have already pointed out in April, I consider simple and compact web browsers a matter of freedom (well, Freedom as in speech, actually), and although there's been a bit of talk about ladybird lately, my favourite in this category still is netsurf, which apparently to this date is lean enough to be runnable on vintage 1990 Atari TT machines. I'll freely admit I have not tried it, but the code is there.

    Yesterday, however, netsurf drove me crazy for a while: I was developing a web site, making sure it works with netsurf. This website has a cookie-based persistent login feature, and that didn't work. I sent my Set-Cookie headers all right – ngrep is your friend if you want to be sure, somewhat like this:

    sudo ngrep -i -d lo cookie port 8888

    Ngrep also clearly showed that netsurf really did not send any Cookie headers, so the problem wasn't on the cookie header parsing side of my program, either.

    But why did the cookies disappear? Cookie policy? Ha: netsurf does accept a cookie from Google, and crunching this would be the first thing any reasonable policy would do. Did I perhaps fail to properly adhere to the standards (which is another thing netsurf tends to uncover)? Hm: looking up the cookie syntax spec gave me some confidence that I was doing the right thing. Is my Max-Age ok? Sure, it is.

    The answer to this riddle: netsurf does not store cookies if it cannot sort them into a hierarchy of host names, and it never can do that for host names without dots (as in localhost, for instance). Given the ill-thought-out Domain attribute one can set for cookies (see the spec linked above if you want to shudder), I even have a solid amount of sympathy for that behaviour.

    But given that that is something that will probably bite a lot of people caring about freedom enough to bother with netsurf, I am still a bit surprised that my frantic querying of search engines on that matter did not bring up the slightly unconventional cookie handling of netsurf. Let us hope this post's title will change that. Again, netsurf 3 will not store cookies for not only localhost but any host name without dots in it. Which is a bit inconvenient for development, and hence despite my sympathy I am considering a bug report.

    Meanwhile, I've worked around the problem by adding: victor.local.de

    to my /etc/localhost (the name really doesn't matter as long as it will never clash with an actual site you want to talk to and it contains one or more dots) and access the site I'm developing as http://victor.local.de. Presto: my cookie comes back from netsurf all right.

    A Debugging Session

    So, how did I figure this riddle out? The great thing about Debian and halfway compact software like netsurf is that it makes it reasonably simple to figure out such (mis-) features. Since I firmly believe that the use of debuggers is a very basic skill everyone touching a computer should have, let me give a brief introduction here.

    First, you need to get the package's source. Make sure it matches the version of the program that you actually run; to do that, copy the deb line in /etc/apt/sources.list for the repository the package comes from (note that this could be the security repo if you got updates from there). In the copied line, replace deb with deb-src. In my case, that would be:

    deb-src https://deb.debian.org/debian bullseye main

    On a freshly installed Debian, it's likely you already have a line like this; consider commenting out the deb-src lines when not working with source code, as that will make your apt operations a bit faster.

    After an apt update, I can now pull the source. To keep your file system tidy, I put all such sources into children of a given directory, perhaps /usr/src if you're old-school, or ~/src if not:

    mkdir -p src/netsurf
    cd src/netsurf
    apt-get source netsurf-gtk

    I'm creating the intermediate netsurf directory because apt-get source creates four items in the directory, and in case you're actually building a package (which you could, based on this), more entries will follow; keeping all that mess outside of src helps a lot. Note that apt-get source does not need any special privileges. You really should run it as yourself.

    By the way, this is the first part where monsters like webkit make this kind of thing really strenuous: libwebkit sources (which still are missing much over a full browser) pull 26 megabytes of archive expanding to a whopping 300 Megabytes of source-ish goo.

    To go on, enter the directory that apt-get source created; in my case, that was netsurf-3.10. You can now look around, and something like:

    find . -name "*.c" | xargs grep "set-cookie"

    quickly brought me to a file called netsurf/content/urldb.c (yeah, you can use software like rgrep for „grep an entire tree“; but then the find/xargs combo is useful for many other tasks, too).

    Since I still suspected a problem when netsurf parses my set-cookie header, the function urldb_parse_cookie in there caught my eye. It's not pretty that that function is such an endless beast of hand-crafted C (rather than a few lines of lex[1]), but it's relatively readable C, and they are clearly trying to accomodate some of the horrible practices out there (which is probably the reason they're not using lex), so just looking at the code cast increasing doubts on my hypothesis of some minor standards breach on my end.

    In this way, idly browsing the source code went nowhere, and I decided I needed to see the thing in action. In order to not get lost in compiled machine code while doing that, one needs debug symbols, i.e., information that tells a debugger what compiled stuff resulted from what source code. Modern Debians have packages with these symbols in an extra repository; you can guess the naming scheme from the apt.sources string one has to use for bullseye:

    deb http://debug.mirrors.debian.org/debian-debug bullseye-debug main

    After another round of apt update, you can install the package netsurf-gtk-dbgsym (i.e., just append a -dbgsym to the name of the package that contains the program you want to debug). Once that's in, you can run the GNU debugger gdb:

    gdb netsurf

    which will drop you into a command line prompt (there's also a cool graphical front-end to gdb in Debian, ddd, but for little things like this I've found plain gdb to be less in my way). Oh, and be sure to do that in the directory with the extracted sources; only then can gdb show you the source lines (ok: you could configure it to find the sources elsewhere, but that's rarely worth the effort).

    Given we want to see what happens in the function urldb_parse_cookie, we tell gdb to come back to us when the program enters that function, and then to start the program:

    (gdb) break urldb_parse_cookie
    Breakpoint 1 at 0x1a1c80: file content/urldb.c, line 1842.
    (gdb) run
    Starting program: /usr/bin/netsurf

    With that, netsurf's UI comes up and I can go to my cookie-setting page. When I try to set the cookie, gdb indeed stops netsurf and asks me what to do next:

    Thread 1 "netsurf" hit Breakpoint 1, urldb_parse_cookie (url=0x56bcbcb0,
        cookie=0xffffbf54) at content/urldb.c:1842
    1842  {
    (gdb) n
    1853    assert(url && cookie && *cookie);

    n (next) lets me execute the next source line (which I did here). Other basic commands include print (to see values), list (to see code), s (to step into functions, which n will just execute as one instruction), and cont (which resumes execution).

    In this particular debugging session, everything went smoothly, except I needed to skip over a loop that was boring to watch stepping through code. This is exactly what gdb's until command is for: typing it at the end of the loop will fast forward over the loop execution and then come back once the loop is finished (at which point you can see what its net result is).

    But if the URL parsing went just fine: Why doesn't netsurf send back my cookie?

    Well, tracing on after the function returned eventually lead to this:

    3889      suffix = nspsl_getpublicsuffix(dot);
    3890      if (suffix == NULL) {

    and a print(suffifx) confirmed: suffix for localhost is NULL. Looking at the source code (you remember the list command, and I usually keep the source open in an editor window, too) confirms that this makes netsurf return before storing the freshly parsed cookie, and a cookie not stored is a cookie not sent back to the originating site. Ha!

    You do not want to contemplate how such a session would look like with a webkit browser or, worse, firefox or chromium, not to mention stuff you don't have the source …

  • Entering Wild Unicode in Vim

    Screenshot of a vi session showing the unicode PILE OF POO symbol

    This is what this post is about: being able to type PILE OF POO (a.k.a. U+1f4a9) in vim in…tuitively. For certain notions of intuition.

    As a veteran of writing texts in TeX, I've long tended to not bother with “interesting” characters (like the quotes I've just used, or the plethora of funny characters one has for writing math) in non-TeX texts. That is, until I started writing a lot of material not (directly) formatted using TeX, as for instance this post. And until reasonably robust Unicode tooling was widely available.[1]

    I enter most of my text in vim, and once I decided I wanted the exotic unicode characters I experimented with various ways to marry unicode and vim. What I ended up doing is somewhat inspired by TeX, where one enters funny characters through macros: a backslash and a few reasonably suggestive letters, as perhaps \sigma for σ or \heartsuite for ❤.

    What lets me do a similar thing in vim are interactive mode abbreviations and unicode escapes. I've found the abbreviations do not inconvenience me otherwise if I conclude them with a slash. And so I now have quite a few definitions like:

    iab <expr> scissors/ "\u2702"

    in my ~/.vimrc. This lets me type scissors/␣ to get ✂ (and blank/␣ to get the visible blank after the scissors). This works reasonably well for me; it's only when the abbreviation is not bounded by blanks that I have have to briefly leave the insert mode to make sure that vim recognises the abbreviation. For instance y▶y – where the abbreviation needs to directly abut the letter – I have to type as y<ESC>aarrleft/ <BACKSPACE>y. I don't know about people who didn't grow up with TeX, but in my world such a thing passes as really natural, and for me it easily beats the multibyte keymaps that, I think, the vim authors would recommend here.

    And how do I figure out the unicode code points (i.e., the stuff after the \u)? Well, there is the unicode(1) command (Debian package unicode), which sounds cool but in reality only points me to what I'm looking for every other time or so: It's hard to come up with good words to look for characters the name of which one doesn't know.

    In practice, most of the time I look at the various code blocks linked from the Wikipedia unicode page. Going by their titles in my experience is a good way to optically hunt for glyphs I'm looking for. The result is the following abbreviations – if you make interesting new ones, do send them in and I will update this list:

    iab <expr> deg/ "\u00B0"
    iab <expr> pm/ "\u00B1"
    iab <expr> squared/ "\u00B2"
    iab <expr> cubed/ "\u00B3"
    iab <expr> times/ "\u00D7"
    iab <expr> half/ "\u00BD"
    iab <expr> acirc/ "\u00E2"
    iab <expr> egrave/ "\u00E8"
    iab <expr> idia/ "\u00EF"
    iab <expr> subtwo/ "\u2082"
    iab <expr> euro/ "\u20ac"
    iab <expr> trademark/ "\u2122"
    iab <expr> heart/ "\u2764"
    iab <expr> smile/ "\u263A"
    iab <expr> arrow/ "\u2192"
    iab <expr> emptyset/ "\u2205"
    iab <expr> bullet/ "\u2022"
    iab <expr> intersects/ "\u2229"
    iab <expr> scissors/ "\u2702"
    iab <expr> umbrella/ "\u2614"
    iab <expr> peace/ "\u262e"
    iab <expr> point/ "\u261b"
    iab <expr> dots/ "\u2026"
    iab <expr> mdash/ "\u2014"
    iab <expr> sum/ "\u2211"
    iab <expr> sqrt/ "\u221a"
    iab <expr> approx/ "\u2248"
    iab <expr> neq/ "\u2260"
    iab <expr> radio/ "\u2622"
    iab <expr> hazmat/ "\u2623"
    iab <expr> pick/ "\u26cf"
    iab <expr> eject/ "\u23cf"
    iab <expr> check/ "\u2713"
    iab <expr> alpha/ "\u03b1"
    iab <expr> beta/ "\u03b2"
    iab <expr> gamma/ "\u03b3"
    iab <expr> delta/ "\u03b4"
    iab <expr> epsilon/ "\u03b5"
    iab <expr> zeta/ "\u03b6"
    iab <expr> eta/ "\u03b7"
    iab <expr> theta/ "\u03b8"
    iab <expr> kappa/ "\u03ba"
    iab <expr> lambda/ "\u03bb"
    iab <expr> mu/ "\u03bc"
    iab <expr> nu/ "\u03bd"
    iab <expr> Delta/ "\u0394"
    iab <expr> Xi/ "\u039e"
    iab <expr> pi/ "\u03c0"
    iab <expr> rho/ "\u03c1"
    iab <expr> sigma/ "\u03c3"
    iab <expr> chi/ "\u03c7"
    iab <expr> supp/ "\u207A"
    iab <expr> supm/ "\u207B"
    iab <expr> tripleeq/ "\u2261"
    iab <expr> cdot/ "\u22c5"
    iab <expr> powern/ "\u207f"
    iab <expr> chevopen/ "»"
    iab <expr> chevclose/ "«"
    iab <expr> element/ "\u2208"
    iab <expr> notelement/ "\u2209"
    iab <expr> subset/ "\u2282"
    iab <expr> superset/ "\u2283"
    iab <expr> blank/ "\u2423"
    iab <expr> block/ "\u2588"
    iab <expr> achtel/ "\u266a"
    iab <expr> clef/ "\u1d11e"
    iab <expr> arrleft/ "\u25B6"
    iab <expr> arrdown/ "\u25BC"
    iab <expr> poop/ "\U1f4a9"

    One last thing one should know: quite a few interesting unicode characters are outside of what's known as the „Basic Multilingual Plane“, which is a pompous way to say: within the first 65536 code points. That in particular includes all the emoijs (please don't torture me with those), but also the timeless PILE OF POO character, the rendering of which in the Hack font is shown in the opening image. Addressing codepoints above 65536 needs more than four hex characters, and to make vim grok those, you need to say \U rather than \u.

    [1]Which, I give you, has been the case as of about 10 years ago, so it's not like all this is bleeding-edge.
  • HTML and Word in mutt

    I read my mail using mutt, and even though I was severely tempted by astroid, mutt just works too nicely for me to make moving away an attractive proposition. And it is a fine piece of software. If you're still stuck with Thunderbird (let alone some webmail interface in the browser) and wonder what text-based software you might adopt, right after vim I'd point you to mutt.

    I'm saying all that because the other day I complained about a snooping mail marketing firm (in German) abusing MIME's multipart/alternative type to clickbait people reading plain text mails into their tracker-infested web pages, and I promised to give an account on how I configured my mutt to cope with HTML mails and similar calamities.

    The basic mechanism is ~/.mutt/mailcap. That's a file analogous to /etc/mailcap, for which there's a man page, mailcap (5)[1]. That explains how, in general, software uses this file to figure out which program to use to display (or print or compose) files of which types.

    Mutt reads system-wide mailcaps, too, but I've found I generally want to handle a solid number of media differently in mails than, say, in browsers or from the shell[2], and hence I'm keeping most of this configuration in mutt's private mailcap. For HTML mail, I've put into that file:

    text/html; w3m -dump -T %t -I utf-8 -cols 72 -o display_link_number=1 %s; copiousoutput

    This uses w3m to format HTML rather than te lynx that the mutt docs give. Lynx these days really is too basic for my taste (I'm not even sure whether it has learned utf-8). Still, this will not execute javascript or retrieve images, so most of the ugly aspects of HTML mails are sidestepped. The copiousoutput option makes mutt use its standard pager when showing the program's output, and thus HTML mail will look almost like sane mail.

    To make that really seamless, you need an extra setting in your ~/.mutt/muttrc:

    auto_view text/html

    This makes mutt automatically render HTML (which, contrary to the behaviour of gmail or thunderbird I consider relatively safe if it's parsimonious w3m that does the rendering). In addition, since I still believe in the good in humans, I believe that when there is both HTML and plain text in a mail, the plain text will be better suited for my text terminal, and so I tell mutt to prefer text/plain, which, again in the muttrc, translates into:

    alternative_order text/plain

    And that's it: If the villains at cleverreach (the marketing firm I complained about) didn't have their treacherous text/plain alternative, my w3m would render their snooping HTML without retrieving their tracking pixel and I could read whatever they send me without them ever knowng if and when. I'm still not sure if that's the reason they have the nasty clickbait text/plain alternative. In general, I support the principle that you should never explain with malice what you can just as well explain with stupidity. But then we're dealing with a marketing firm here…

    Anyway: The best part of this setup is that you can quote-reply to HTML mails, giving your replies inline as $DEITY wanted e-mail to work. That is something that also is nice when folks send around MS office files (I get the impression that still happens quite a lot outside of my bubble). To cater for that, I have in my mailcap:

    application/msword; antiword %s; copiousoutput
    application/vnd.openxmlformats-officedocument.wordprocessingml.document; docx2txt %s - ; copiousoutput

    and consequently in the muttrc:

    auto_view application/msword
    auto_view application/vnd.openxmlformats-officedocument.wordprocessingml.document

    I admit I actually enjoy commenting inline when replying to office documents, and I trust antiword (though perhaps docx2txt a bit less) to not do too many funny things, so that I think I can run the risk of auto-rendering MS office files. I've not had to regret this for the, what, 15 years that I've been doing this for (in the antiword case; according to my git history, I've only given in to autorendering nasty docx in 2019).

    I mention in passing that I have similar rules for libreoffice, but there I have a few lines of python to do the text rendering, and that is material for another post (also, folks decent enough to use libreoffice are usually decent enough to not send around office files, and hence auto-displaying ODT is much less of a use case).

    Two more remarks: This actually cooperates nicely with rules not using copiousoutput. So, for instance, I also have in my mailcap:

    text/html; x-www-browser file://%s

    With that, if need by I can still navigate to an HTML file in the attachments menu and then fire up a “normal” browser (with all the privacy implications).

    And: people indecent enough to mail around MS office files often are not even decent enough to configure their mail clients to produce proper media types. Therefore, mutt lets you edit these to sanity. Just hit v, go to the misdeclared attachment and then press ^E. Since the “Office Open XML“ (i.e., modern Microsoft Office) media types are so insanely long and unmemoralisable, I have made up a profane media type that I can quickly type and remember for that particular purpose:

    application/x-shit; libreoffice %s
    [1]In case you're not so at home in Unix, writing “mailcap (5)” means you should type man 5 mailcap (for “show me the man page for mailcap in the man section 5”) to read or skim the documentation on that particular thing. Explicitly specifying a section has a lot of sense for things like getopt (which exists in sections 1 and 3) and otherwise is just an indication that folks ought to have a look at the man page.
    [2]You can use programs like see (1) or even compose (1) to use the information in your non-mutt mailcaps.
  • Globaldokumente in Libreoffice zusammenführen

    Ich hatte gerade einer armen Seele zu helfen, die drei Bücher mit Microsoft Word für Windows geschrieben hat und dazu Word-Zentraldokumente verwendet hat, vor allem wohl, weil vor 20 Jahren – als diese Projekte starteten – Word nicht mit mehreren hundert Seiten auf einmal umgehen konnte. Unbenommen, dass mensch mit Office-Software keine richtige Arbeit machen sollte: da die arme Seele auf Linux migrierte, musste der ganze Kram auf Libreoffice. Für eine Migration auf vernünftige Technologien (TeX, ReStructuredText, Docbook oder was auch immer) reichten weder meine Geduld noch der Erlösungswille der armen Seele.

    Erste Ernüchterung: Word-Zentraldokumente (wie auch Libreoffice-Globaldokumente) speichern ihre Pfade absolut: sie können also ohne Tricks nicht bewegt werden, insbesondere nicht von einem Windows-Dateisystem in ein Linux-Dateisystem. Wie so oft stellt sich die Frage, warum das mal ein wie ein gutes Design ausgesehen haben mag.

    Nach einem kurzen Blick auf das Arbeiten mit Globaldokumenten habe ich beschlossen, dass Rechner jetzt wirklich groß genug sind, um 500-seitige Dokumente im Speicher zu halten und dass der ganze Zentral- und Globaldokumentenzauber mit den begleitenden Komplikationen nicht mehr sein muss.

    Nur: Wie kopiert mensch 20, 30 solche Dateien zusammen? Was ich dazu in der verbleibenden Libreoffice-Doku (das Wiki, das das jetzt sein soll, überzeugt mich übrigens nicht) und dem weiteren Netz gefunden habe, fand ich eher… unbefriedigend – Erinnerungen an Windows-Foren („um den Sound-Treiber zu reparieren, musst du den CD-Treiber deinstallieren. Hat bei mir funktioniert“) werden da dann und wann schon wach. Deshalb dachte ich mir, es könnte nützlich sein, wenn ich auch ein paar Rezepte beitrage, auch wenn ich – Disclaimer – selbst keine Office-Software verwende und in dem Sinn selbst höchstens einäugig bin.

    Wie machte ich aus einem Satz von ODTs oder DOCs ein einziges ODT?

    1. Ein Verzeichnis anlegen und alle Filialdateien reinkopieren.

    2. Die Dateien so benennen, dass eine einfache Sortierung sie in die richtige Reihenfolge bringt (ich habe einfach 000_, 001_ usf vor die Namen gesezt).

    3. Libreoffice starten, Neu → Globaldokument.

    4. F5 drücken, um in den Navigator zu kommen, dort aufs Einfügen-Icon klicken; das poppt eine Auwahlbox auf.

    5. In dieser Auswahlbox in das Verzeichnis mit den Filialdateien gehen und diese in die Reihenfolge sortieren lassen, in der sie nachher im Dokument erscheinen sollen.

    6. Alle Dateien im Verzeichnis auswählen, z.B. durch Control-A oder per Shift-Klick, dann den Import bestätigen.

    7. Datei → Exportieren, dabei ODT als Zielformat auswählen; das ist der erste Schritt, um von der Einbettung im Globaldokument wegzukommen. Ich nenne diese Datei jetzt mal joined.odt.

    8. Das so erzeugte ODT ist leider überall schreibgeschützt, und ich habe keinen Weg gefunden, diesen Schreibschutz per Klicken wegzuzaubern, bevor ich die Geduld mit Doku, Menüs und vor allem Foren verloren habe und mit epubedit beigegangen bin (vgl. unten). Mit dem kleinen Skript dort unten könnt ihr Folgendes in einer Shell laufen lassen:

      epubedit joined.odt
      sed -ie 's/text:protected="[^"]*"//g' content.xml

      (ihr könnt natürlich auch mit einem Editor oder gar mit dem hervorragenden xmlstarlet die ganzen text:protected-Attribute löschen)[1]. Geht dann aus der Shell vom epubedit wieder raus; das schreibt joined.odt neu.

    9. Das neue joined.odt in libreoffice öffnen.

    10. Bearbeiten → Verknüpfungen, wieder alle Auswählen (^A), und dann den Lösen-Knopf drücken.

    Das Ergebnis dieser Prozedur ist ein zusammenhängendes „Dokument“ (wenn mensch keine großen Ansprüche an Dokumente hat).

    Zumindest in meinem Fall fing damit die Arbeit allerdings erst an, weil jedes Filialdokument eigene und verrückte Absatzvorlagen hatte. Ich schreibe gleich, wie ich das aufgeräumt habe, aber zunächst müsen wir geschwind über das erwähnte epubedit reden.


    Was Open Document-Dateien tatsächlich etwas angenehmer im Umgang macht als einige andere Office-Dateien, die ich hier erwähnen könnte, ist, dass sie eigentlich nur zip-Archive sind, in denen von für sich nicht unvernünftigen Standards (z.B. XML und CSS) beschriebene Textdateien leben. Das hat mir beispielsweise den obigen Trick mit der Ersetzung der text:protected-Attribute erlaubt.

    Diese Architektur haben sie mit Ebooks im epub-Format gemein, und um an denen geschwind mal kleine Korrekturen vorzunehmen, habe ich mir vor Jahren ein kleines Shell-Skript geschrieben:

    if [ $# -ne 1 ]; then
            echo "Usage: $0 <epub> -- unpack an epub, open a shell, pack it again."
            exit 0
    workdir=$(mktemp -d /tmp/workXXXXXX)
    cleanup() {
            rm -rf $workdir
    trap cleanup EXIT
    if [ ! -f  "$1".bak ]; then
            cp -a "$1" "$1".bak
    unzip "$1" -d $workdir
    (cd $workdir; bash)
    cd $workdir
    zip -r "$fullpath" *

    Nehmt das und legt es als – sagen wir – epubedit irgendwo in euren Pfad und macht es ausführbar. Ihr könnt dann für irgendein epub oder odt epubedit datei.odt sagen und landet in einer Shell, die im Wurzelverzeichnis des jeweiligen ZIP-Archivs läuft. Dort könnt ihr nach Herzenslust editieren – bei ODTs ist der Inhalt in content.xml –, und wenn ihr fertig seid, beendet ihr die Shell und habt ein entsprechend verändertes ODT oder epub.

    Weil dabei gerne mal was schief geht, legt das Skript ein Backup der Originaldatei an (es sei denn, es gäbe schon so ein Backup; die Erfahrung zeigt, dass mensch in der Regel lieber das ursprüngliche Backup behalten will…).


    Nun ist das vereinte Dokument zwar immerhin nur noch eine einzige Datei, die zudem – wow! – auch bewegt werden kann. Zumindest mit der Genese in meinem Fall, also den vielen Einzel-Word-Dateien, ist sie trotzdem kaum brauchbar, weil Word einige hundert Formatvorlagen erzeugt hat, meist mit so nützlichen Namen wie Formatvorlage_20_16_20_pt_20_Block_20_Erste_20_Zeile_3a__20__20_05_20_cm_20_Zchn oder Fußnotentext1_20_Zchn oder auch apple-converted-space. Dieses Problem ist schlimm, und ich habe schließlich eingesehen, dass es ohne ein kleines Programm und einige Handarbeit nicht lösbar ist.

    Das Programm hat am Anfang nur Stilnamen aus dem Dokument rausgeprökelt und auf die Standardausgabe gelegt. Inzwischen ist das zu einer Basis für eine Abbildungsdatei geworden, und auch für die Abbildung als solche haben reguläre Ausdrücke noch gereicht. Wegen der Abhängigkeiten der Stile untereinander blieb jedoch immer noch jede Menge Mist in der Liste der verwendeten Stile zurück. Deshalb musste ich schließlich doch noch ordentliche XML-Verarbeitung anwerfen, um die styles.xml umzufummeln. Das Ergebnis ist das Programm defuse-libreoffice-style.py. Wenn ihr dieses Programm für die Dauer der Verarbeitung in euer Homeverzeichnis legt, würdet ihr die Stile wie folgt vereinheitlichen:

    1. epubedit joined.odt; alles Weitere passiert in der Shell, die das öffnet.

    2. python3 ~/defuse_libreoffice-style.py > ~/style-map.txt – wenn ihr das Skript nicht in eurem Home lagert, müsst ihr diesen Pfad anpassen. Und ich lege die Stil-Abbildung ins Home und nicht ins aktuelle Verzeichnis, damit die Abbildung (die recht viel Arbeit ist) nicht gleich verloren ist, wenn ihr die Shell verlasst. Ich jedenfalls habe besonders beim ersten Mal ein paar Anläufe gebraucht, bis das Mapping gut gepasst hat.

    3. Editiert die Datei ~/style-map.txt mit einem Texteditor (also auf keinen Fall mit libreoffice selbst). Da drin stehen Zeilen wie:

      Footnote_20_Symbol -> Footnote_20_Symbol

      – in meinem Fall ungefähr 200 davon. Die Aufgabe ist jetzt, die rechten Seiten dieser Zeilen auf eine Handvoll Stile runterzubringen (Textkörper, Überschrift_1, Überschrift_2, Zitat, Fußnotenzeichen und Fußnote waren mein Minimum); die Zeile oben habe ich zum Beispiel zu:

      Footnote_20_Symbol -> Fußnotenzeichen

      gemacht. Es ist nicht immer einfach, herauszukriegen, was wohl eine Vorlage mal tun sollte; meist hat Word aber doch einen gewissen Hinweis darauf im Namen hinterlassen.

    4. Wenn die Abbildung fertig ist, lasst das Python-Skript nochmal laufen; wenn es nämlich ein Argument bekommt, interpretiert es das als Abbildung und passt sowohl content.xml als auch style.xml entsprechend an:

      python3 ~/defuse_libreoffice-style.py ~/style-map.txt
    5. Um zu sehen, welche Stile noch übrig sind, könnt ihr das Skript ein weiteres Mal ohne Argumente laufen lassen; das gibt dann die noch vorhandenen Stile ins Terminal aus:

      python3 ~/defuse_libreoffice-style.py

      Wenn noch was dabei ist, das nicht übrig bleiben soll, könnt ihr style-map.txt anpassen und Schritt (4) nochmal laufen lassen (oder nochmal vom Backup des ODT anfangen).

    6. Verlasst zum Abschluss die Shell vom epubedit und guckt im libreoffice nach, ob alles geklappt hat. Libreoffice erzählt wahrscheinlich, dass das Dokument beschädigt sei (aber nicht genauer, was eigentlich; hier rächt sich, dass ich die Open Document-Standards nicht gelesen und stattdessen einfach munter drauflosgehackt habe). Das, was es zur Reparatur unternimmt, hat aber bei mir immer gut funktioniert – insofern: Nur Mut.

    Und für den Fall, dass jemand in den Python-Code reinguckt: Nein, auch wenn der StyleSanitiser immerhin ordentlich XML bearbeitet (im Gegensatz zu dem RE-Hacks von oben), ist das immer noch nicht Open Document-allgemein, denn ich habe die spezifische Wahl des text:-Präfix von Libreoffice darin hart kodiert, was sich für „richtige“ Software nicht gehören würde. Aber SAX mit richtigen Namespaces macht keinen Spaß, und ich rechne erstmal nicht damit, dass dieser Code je mit ODTs laufen wird, die nicht von Libreoffice kommen.

    Und die Stichworte?

    Die Bücher hatten auch je ein Stichwortverzeichnis. Bei einem Dokument hat das gut funktioniert, bei den anderen standen im Verzeichnis ein paar ordentliche Begriffe, ein paar Begriffe mit sinnlosen typografischen Anführungszeichen und ganz viele Einträge für das leere Wort. Ich habe keine Ahnung, wie es dazu kam.

    Bei der Reparatur hilft erneut der Umstand, dass ODT im Kern ein nicht ganz unvernünftiges XML ist. Dabei sieht das Markup für ein Stichwort beispielsweise so aus:

    Karl Valentin …
  • Replacing root-tail when there is a compositor

    Since there hasn't been real snow around here this year until right this morning, I've been running xsnow off and on recently[1]. And that made me feel the lack of a compositor on my everyday desktop. Certainly, drop shadows and fading windows aren't all that necessary, but I've been using a compositor on the big screen at work for about a decade now, and there are times when the extra visual cues are nice. More importantly, the indispensable xcowsay only has peudo-transparency when there's no compositor ever since it moved to gtk-3 (i.e., in Debian bullseye).

    Well: enough is enough. So, I'm now running picom in my normal desktop sessions (which are managed by sawfish).

    Another near-indispensable part of my desktop is that the syslog is shown in a part of the root window (a.k.a. desktop background), somewhat like this:

    Windows, and a green syslog in the background

    This was enabled by the nice program root-tail for ages, but alas, it does not play well with compositors. It claims its --windowed flag does provide a workaround, but at least for me that failed in rather crazy ways (e.g., ghosts of windows were left behind). I figured that might be hard to fix and thought about an alternative. Given compositors are great for making things transparent: well, perhaps I can replace root-tail with a heavily customised terminal?

    The answer: essentially, yes.

    My terminal program is unicode-rxvt. In the presence of a compositor, you can configure it for a transparent background by telling it to not use its pseudo-transparency (+tr), telling it to use an X visual with an alpha channel (-depth 32) and then using a background colour with the desired opacity prefixed in square brackets. For a completely transparent terminal, that is:

    urxvt -depth 32 +tr -bg "[0]#000000"

    This still has the scrollbar sticking out, which for my tail -f-like application I don't want; a +sb turns it off. Also, I'm having black characters in my terminals by default, which really doesn't work with a transparent background. Making them green looks techy, and it even becomes readable when I'm making the background 33% opaque black:

    urxvt -depth 32 +tr +sb -bg "[33]#000000" -fg green

    To replace root-tail, I have to execute my tail -f, put the window into the corner and choose a somewhat funky font. To simply let me reference the whole package from startup files, I'm putting all that into a shell script, and to avoid having the shell linger around, all that this script does is call an exec (yes, for interactive use this probably would be an alias):

    exec /usr/bin/urxvt -title "syslog-on-root" +sb \
            +tr -depth 32 -bg "[33]#000000" -g 83x25-0-0 \
            -fg green -fn "xft:monofur-11:weight=black" \
            -e tail -f /var/log/syslog

    This already looks pretty much as it should, except that it's a normal window with frames and all, and worse, when alt-tabbing through the windows, it will come up, and it will also pollute my window list.

    All that needs to be fixed by the window manager, which is why I gave the window a (hopefully unique) title and then configured sawfish (sawfish-config, “Window Rules”) to make windows with that name depth 16, fixed-position, fixed-size, sticky, never-focus, cycle-skip, window-list-skip, task-list-skip, ignore-stacking-requests. I think one could effect about the same with a judicious use of wmctrl – if you rig that up, be sure to let me know, as I give you it would be nice to make that part a bit more independent of the window manager.

    There's one thing where this falls short of root-tail: Clicks into this are not clicks into the root window. That hurts me because I have root menus, and it might hurt other people because they have desktop icons. On the other hand, I can now mouse-select from the syslog, which is kind of nice, too. Let's see.

    [1]Well, really: Mainly because I'm silly.
  • Perhaps I should be moving to gentoo

    I'm reading PDFs quite a bit, most of them in my beloved zathura. So, I was dismayed when today I paged through a book that showed in zathura as on the left side of this figure:

    Renderings of a PDF in poppler and mupdf.

    The metrics are off so badly that readability suffers.

    Rather than try to fix the PDF, I remembered I had for a long time wanted to look into using mupdf as a backend for zathura rather than its default poppler, if only because poppler used to have a shocking amount of rather serious bugs a couple of years ago (now that I think of it: It's been a while since I last heard of any – hm).

    Bringing up the PDF in mupdf looked a lot better (the right panel in the above figure). Which then led to a bout of yak shaving, because there is a plugin for zathura that promised to do what I wanted, zathura-pdf-mupdf, but of course nobody has bothered to package it up for Debian yet. So… let's try to build it.

    It's probably not a good sign that the package's README talks about make to build the thing, whereas the web page talks about a build system with commands meson and ninja (that, frankly, I had never heard about before, but at least it's in Debian). But, never mind, let's do meson build && cd build && ninjia (oh wow).

    Of course, building fails with something like:

    ../zathura-pdf-mupdf/index.c: In function ‘build_index’:
    ../zathura-pdf-mupdf/index.c:68:7: error: unknown type name ‘fz_location’; did you mean ‘fz_catch’?
           fz_location location = fz_resolve_link(ctx, document, outline->uri, &x, &y);

    A quick web search shows that this fz_location is part of the mupdf API and has indeed undergone an API change. So, I backported libmupdf from Debian testing (I'm on Debian stable almost always), and because that needs mujs, I backported that, too. Mujs sounds a lot like javascript in PDF, and that's where I first think gentoo: with its USE flags it would proabably make it easier to just keep javascript out of my PDF rendering engines altogether. Which is something I'd consider an excellent idea.

    Anyway, with a bit of hacking around – I don't have a libmupdf-third library that the meson build file mentions but perhaps doesn't need any more – I then got the plugin to build.

    Regrettably, zathura still would not use mupdf to render, saying:

    error: Could not load plugin '/usr/lib/i386-linux-gnu/zathura/libpdf-mupdf.so'
    undefined symbol: jpeg_resync_to_restart).

    Again asking a search engine about typical scenearios that would lead to this failure when loading a plugin, there's quite a bit of speculation, one of it being about using libjpeg-turbo instead of libjpeg. Which made me see what this plugin links again. Fasten your seat belts:

    $ ldd /usr/lib/i386-linux-gnu/zathura/libpdf-mupdf.so
            linux-gate.so.1 (0xf7fa7000)
            libgirara-gtk3.so.3 => /usr/lib/i386-linux-gnu/libgirara-gtk3.so.3 (0xf5d23000)
            libcairo.so.2 => /usr/lib/i386-linux-gnu/libcairo.so.2 (0xf5bd3000)
            libglib-2.0.so.0 => /usr/lib/i386-linux-gnu/libglib-2.0.so.0 (0xf5a9a000)
            libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xf58bc000)
            libgtk-3.so.0 => /usr/lib/i386-linux-gnu/libgtk-3.so.0 (0xf50bc000)
            libgdk-3.so.0 => /usr/lib/i386-linux-gnu/libgdk-3.so.0 (0xf4fae000)
            libpango-1.0.so.0 => /usr/lib/i386-linux-gnu/libpango-1.0.so.0 (0xf4f5f000)
            libgio-2.0.so.0 => /usr/lib/i386-linux-gnu/libgio-2.0.so.0 (0xf4d57000)
            libgobject-2.0.so.0 => /usr/lib/i386-linux-gnu/libgobject-2.0.so.0 (0xf4cf2000)
            libjson-c.so.3 => /usr/lib/i386-linux-gnu/libjson-c.so.3 (0xf4ce5000)
            libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xf4cc2000)
            libpixman-1.so.0 => /usr/lib/i386-linux-gnu/libpixman-1.so.0 (0xf4c12000)
            libfontconfig.so.1 => /usr/lib/i386-linux-gnu/libfontconfig.so.1 (0xf4bc5000)
            libfreetype.so.6 => /usr/lib/i386-linux-gnu/libfreetype.so.6 (0xf4b02000)
            libpng16.so.16 => /usr/lib/i386-linux-gnu/libpng16.so.16 (0xf4ac3000)
            libxcb-shm.so.0 => /usr/lib/i386-linux-gnu/libxcb-shm.so.0 (0xf4abe000)
            libxcb.so.1 => /usr/lib/i386-linux-gnu/libxcb.so.1 (0xf4a90000)
            libxcb-render.so.0 => /usr/lib/i386-linux-gnu/libxcb-render.so.0 (0xf4a81000)
            libXrender.so.1 => /usr/lib/i386-linux-gnu/libXrender.so.1 (0xf4a75000)
            libX11.so.6 => /usr/lib/i386-linux-gnu/libX11.so.6 (0xf4926000)
            libXext.so.6 => /usr/lib/i386-linux-gnu/libXext.so.6 (0xf4911000)
            libz.so.1 => /lib/i386-linux-gnu/libz.so.1 (0xf48f0000)
            librt.so.1 => /lib/i386-linux-gnu/librt.so.1 (0xf48e5000)
            libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xf47df000)
            libpcre.so.3 => /lib/i386-linux-gnu/libpcre.so.3 (0xf4766000)
            /lib/ld-linux.so.2 (0xf7fa9000)
            libgmodule-2.0.so.0 => /usr/lib/i386-linux-gnu/libgmodule-2.0.so.0 (0xf4760000)
            libpangocairo-1.0.so.0 => /usr/lib/i386-linux-gnu/libpangocairo-1.0.so.0 (0xf4750000)
            libXi.so.6 => /usr/lib/i386-linux-gnu/libXi.so.6 (0xf473d000)
            libXcomposite.so.1 => /usr/lib/i386-linux-gnu/libXcomposite.so.1 (0xf4739000)
            libXdamage.so.1 => /usr/lib/i386-linux-gnu/libXdamage.so.1 (0xf4734000)
            libXfixes.so.3 => /usr/lib/i386-linux-gnu/libXfixes.so.3 (0xf472d000)
            libcairo-gobject.so.2 => /usr/lib/i386-linux-gnu/libcairo-gobject.so.2 (0xf4721000)
            libgdk_pixbuf-2.0.so.0 => /usr/lib/i386-linux-gnu/libgdk_pixbuf-2.0.so.0 (0xf46f4000)
            libatk-1.0.so.0 => /usr/lib/i386-linux-gnu/libatk-1.0.so.0 (0xf46cb000)
            libatk-bridge-2.0.so.0 => /usr/lib/i386-linux-gnu/libatk-bridge-2.0.so.0 (0xf4693000)
            libxkbcommon.so.0 => /usr/lib/i386-linux-gnu/libxkbcommon.so.0 (0xf464d000)
            libwayland-cursor.so.0 => /usr/lib/i386-linux-gnu/libwayland-cursor.so.0 (0xf4644000)
            libwayland-egl.so.1 => /usr/lib/i386-linux-gnu/libwayland-egl.so.1 (0xf463f000)
            libwayland-client.so.0 => /usr/lib/i386-linux-gnu/libwayland-client.so.0 (0xf4630000)
            libepoxy.so.0 => /usr/lib/i386-linux-gnu/libepoxy.so.0 (0xf451e000)
            libharfbuzz.so.0 => /usr/lib/i386-linux-gnu/libharfbuzz.so.0 (0xf4407000)
            libpangoft2-1.0.so.0 => /usr/lib/i386-linux-gnu/libpangoft2-1.0.so.0 (0xf43ee000)
            libXinerama.so.1 => /usr/lib/i386-linux-gnu/libXinerama.so.1 (0xf43e7000)
            libXrandr.so.2 => /usr/lib/i386-linux-gnu/libXrandr.so.2 (0xf43da000)
            libXcursor.so.1 => /usr/lib/i386-linux-gnu/libXcursor.so.1 (0xf43cd000)
            libthai.so.0 => /usr/lib/i386-linux-gnu/libthai.so.0 (0xf43c1000)
            libfribidi.so.0 => /usr/lib/i386-linux-gnu/libfribidi.so.0 (0xf43a5000)
            libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xf439f000)
            libmount.so.1 => /lib/i386-linux-gnu/libmount.so.1 (0xf4333000)
            libselinux.so.1 => /lib/i386-linux-gnu/libselinux.so.1 (0xf4306000)
            libresolv.so.2 => /lib/i386-linux-gnu/libresolv.so.2 (0xf42ec000)
            libffi.so.6 => /usr/lib/i386-linux-gnu/libffi.so.6 (0xf42e2000)
            libexpat.so.1 => /lib/i386-linux-gnu/libexpat.so.1 (0xf42a5000)
            libuuid.so.1 => /lib/i386-linux-gnu/libuuid.so.1 (0xf429b000)
            libXau.so.6 => /usr/lib/i386-linux-gnu/libXau.so.6 (0xf4296000)
            libXdmcp.so.6 => /usr/lib/i386-linux-gnu/libXdmcp.so.6 (0xf428f000)
            libdbus-1.so.3 => /lib/i386-linux-gnu/libdbus-1.so.3 (0xf4230000)
            libatspi.so.0 => /usr/lib/i386-linux-gnu/libatspi.so.0 (0xf41fb000)
            libgraphite2.so.3 => /usr/lib/i386-linux-gnu/libgraphite2.so.3 (0xf41cd000)
            libdatrie.so.1 => /usr/lib/i386-linux-gnu/libdatrie.so.1 (0xf41c3000)
            libblkid.so.1 => /lib/i386-linux-gnu/libblkid.so.1 (0xf4163000)
            libbsd.so.0 => /usr/lib/i386-linux-gnu/libbsd.so.0 (0xf4144000)
            libsystemd.so.0 => /lib/i386-linux-gnu/libsystemd.so.0 (0xf4099000)
            liblzma.so.5 => /lib/i386-linux-gnu/liblzma.so.5 (0xf406d000)
            liblz4.so.1 => /usr/lib/i386-linux-gnu/liblz4.so.1 (0xf404d000)
            libgcrypt.so.20 => /lib/i386-linux-gnu/libgcrypt.so.20 (0xf3f6a000)
            libgpg-error.so.0 => /lib/i386-linux-gnu/libgpg-error.so.0 (0xf3f45000)

    Now, I appreciate that glueing two pieces of relatively complex code together can be a bit involved, but: 69 libraries!? Among them Xrandr, Xinerama, wayland (which I don't use), systemd (even if I used it: what would that plugin have to do with it?), gpg-error, selinux, and then some things I've never heard about.

    I'm sorry, but this is wrong. Which makes me think hard if gentoo's USE flags might not really be the way to go in this day and age of exploding dependencies.

    Holy cow.

    In case you came here from a search engine that hit on one of the error messages: After writing this, I was hungry and let it sit. The one thing I can tell you is that the elusive jpeg_resync_to_restart is in libjpeg, indeed. What I don't know yet is why that library hasn't made it into the heap of libraries that the plugin links to.

    I'd suspect that zathura and mupdf are built against different libjegs – but then I'd have to explain how that would have happened. Hm.

    Nachtrag (2021-02-13)

    Well, I couldn't let it sit, so here's what I needed to do (and I suspect I'd have found it in one of the upstream branches):

    1. The libjpeg thing really is that the libjpeg library needs to be linked into the plugin in Debian; the details I can't quite work out, because I'd say the linker should be able to work that out, but clearly it's not, because the situation is actually being cared for in the plugin's meson file. However, you need to manually flip a switch: this would be a ./configure run in autoconf, but here, the command line is:

      meson setup --wipe -D link-external=true  build
    2. However, one link-time dependency is missing with the mupdf from Debian bullseye, and that's the spooky mujs. To fix this, patch the build file like so:

      diff --git a/meson.build b/meson.build
      index 23cdc6a..24929ca 100644
      --- a/meson.build
      +++ b/meson.build
      @@ -19,8 +19,8 @@ zathura = dependency('zathura', version: '>=0.3.9')
       girara = dependency('girara-gtk3')
       glib = dependency('glib-2.0')
       cairo = dependency('cairo' …
  • An Xlib-based Screen Ruler in Python

    In my bad prediction (in both senses of the word) on what we'll see in the intensive care stations I've mentioned a screen ruler I've written.

    Well, since I've still not properly published it and I doubt I'll ever do that without further encouragement, let me at least mention it here. It's pyscreenruler. I've written it because all the other screen rulers for X11 I found in Debian (and a bit beyond) would only let you measure either horizontal or vertical lines. I, however, needed to measure slopes, as in this part of the curve for the SARS-2 patient count in German critical care stations:

    Plot: curve with a ruler

    – which, by the way, shows that, surprisingly to me, our patient numbers still go down exponentially (if slowly, with a halving time of almost two months). Hm. But today's point is pyscreenruler, which is what has produced the ruler in that image.

    When I wrote this last November, I quickly found Tkinter (which is what I still usually use to write quick graphical hacks) doesn't really support shaped windows – and anything but shaped windows would, I figured, be extremely painful here. Plain Xlib would make the shaping part relatively straighforward. So, I based the thing directly on the Xlib even for the remaining code.

    That brought forth fond memories of programs I wrote on Atari ST's GEM in the late 1980ies. For instance, with the explicit event loop; on the left code from November 2021, on the right code I touched last in March 1993 (according to the time stamp):

    def loop(self):                     void event_loop(watch *clstr)
       drag_start = False               { [...]
       while 1:
           e = self.d.next_event()        do
                                           { event =  evnt_multi(
           if e.type==X.DestroyNotify:        MU_MESAG|(clstr->on?MU_TIMER:0),
               sys.exit(0)                      2,0x0,1, [...]
                                              if (event&MU_MESAG)
           elif e.type==X.Expose:               handle_message(pipe,clstr);
               [...]                          if (event&MU_TIMER)
           elif e.type==X.KeyPress:             updclck(clstr);
               [...]                       }
           elif e.type==X.ButtonPress:     while(1);
               [...]                    }

    I'd not indent C like on the right any more.

    For the record, Xlib's next_event call looks simpler that GEM's evnt_multi because you configure all the various parameters – what events to listen for, timeouts, and so forth – in other calls.

    Oh, the contortions one has to go through to have updates when a part of the window is exposed (I'm playing it simple and just use offscreen pixmaps)! Or the fact that the whole thing will (I think) not run on displays with 8 and 16 bits of colour. Or the (seemingly) odd behaviour of programs if you do not at least implicitly wait for the window to be mapped (in actual pyscreenrules, I sidestep the problem by deferring all painting to the screen to the expose handler). And X won't tell you about mapping unless you ask for it. And so on. Note to self: Next time you write an Xlib program, let Christophe Tronche remind you of the various little quirks before hacking off.

    Anyway: If you need a rotatable ruler, install the python3-xlib package (that's really all it needs), download pyscreenruler, make it executable and start it. If you click and drag in the inner half of the ruler, you can move it around, if you click and drag on the outer parts, you can rotate the thing (which you can also do by hitting + and -).

    What's missing: Well, I could imagine letting people change the length of the thing (shift-drag on the edges, perhaps?), then giving length and angle in text somewhere. And it's thoroughly pixel-based right now, which it shouldn't be when there's displays with 200 dpi and more out there.

    Send mails and/or patches to make me properly fix it up. Since I've really missed it, I could even imagine packing it up for Debian...

Seite 1 / 1

Letzte Ergänzungen