Engelszüngeln: Kategorien

Artikel aus edv

Speech Recognition with Whisper.cpp

12023:033:2 ( 2023-02-02)
Today I stumbled across Whispers of A.I.'s Modular Future by James Somers, a piece that, at least by the standards of publications aimed at the general public, makes an excellent point of why whisper.cpp might finally be some useful and non-patronising output of the current AI hype.

What can I say? I think I'm sold. And perhaps I'm now a little bit scared, too. If you want to understand way and speak a bit of German, you can skip to The Crazy right away.
The Good

You know, so far I've ignored most of the current statistical modelling (“AI”, “Machine Learning“) – if you need a graphics chip with drivers even worse than Intel's, and that then needs 8 GB of video RAM before anything works, I'm out. And I'm also out when the only way I can use some software is on some web page because there's proprietary data behind it.

Not so for whisper.cpp. This is software as it was meant to be: trivial dependencies, compact, works on basically any hardware there is. To build it, you just run:
```
git clone https://github.com/ggerganov/whisper.cpp/
cd whisper.cpp
make
```
– and that's it. No dependency juggling down to incompatible micro versions, no fancy build system, just a few C(++) sources and a Makefile. The thing works in place without a hitch, and it has a sensible command line interface.

Well, you need the language models, of course. There are some reasonably free ones for English. The whisper.cpp distribution's models/README.md explains how to obtain some. I got myself ggml-small.en.bin, recorded a few words of English into a file zw.wav and ran:
```
./main -m models/ggml-small.en.bin ~/zw.wav
```
The machine demanded I use a samplerate of 16 kHz, I made audacity oblige, ran the thing again and was blown away when – admittedly after a surprisingly long time – my words appeared on the screen.

I immediately tried to figure out how to stream in data but then quickly decided that's probably not worth the effort; the software needs to see words in context, and for what I plan to do – transcribing radio shows – having an intermediate WAV file really does not hurt.

I quickly cobbled together a piece of Python wrapping the conversion (using the perennial classic of audio processing, sox) somewhat cleverly, like this:
```
#!/usr/bin/python
# A quick hack to transcribe audio files
#
# Dependencies:
# * sox (would be mpv, but that's somehow broken)
# * a build of whispercpp (https://github.com/ggerganov/whisper.cpp/)
# * a language model (see models/README.md in the whisper source)

import contextlib
import os
import subprocess
import sys
import tempfile

WHISPER_DIR = "/usr/src/whisper.cpp"


@contextlib.contextmanager
def workdir(wd):
        prev_dir = os.getcwd()
        try:
                os.chdir(wd)
                yield
        finally:
                os.chdir(prev_dir)


def transcribe(audio_source, model, lang):
        """transcibes an audio file, creating an in-place .txt.

        model must be the name of a model file in WHISPER_DIR/models;
        lang is the ISO language code in which the output should turn up.
        """
        audio_source = os.path.join(os.getcwd(), audio_source)
        with tempfile.TemporaryDirectory(suffix="transcribe", dir="/var/tmp") as wd:
                with workdir(wd):
                        subprocess.check_call(["sox",
                                audio_source,
                                "-b", "16", "-r", "16000", "-c", "1",
                                "audiodump.wav"])

                        out_name = os.path.splitext(audio_source)[0]
                        subprocess.check_call([WHISPER_DIR+"/main",
                                "-l", lang,
                                "-m", WHISPER_DIR+"/models/"+model,
                                "-otxt", "-of", out_name,
                                "audiodump.wav"])


def parse_command_line():
        import argparse
        parser = argparse.ArgumentParser(description="Wrap whisper.cpp to"
                " bulk-transcribe audio files.")
        parser.add_argument("model", type=str, help="name of ggml language"
                f" model to use, relative to {WHISPER_DIR}/models")
        parser.add_argument("audios", type=str, nargs="+",
                help="Sox-translatable audio file to transliterate.")
        parser.add_argument("--lang", type=str, default="en",
                help="Spoken language to try and recogonise")

        return parser.parse_args()


if __name__=="__main__":
        args = parse_command_line()
        for audio in args.audios:
                transcribe(audio, args.model, args.lang)
```
Nachtrag (2023-06-26)

(Added a --lang option as per ron's feedback below)

I have that as transcribe.py in my path, and I can now enter the rip of an audiobook and say:
```
transcribe.py ggml-small.en.bin *.ogg
```
(provided I have downloaded the model as per whisper.cpp's instructions). After a little while (with high CPU usage), there is a transcript on my disk that's better what I had typed myself even after two rounds of proof-reading, except that whisper.cpp doesn't get the paragraphs right.

For the first time in the current AI hype, I start getting carried away, in particular when I consider how much speech recognition sucked when I last played with it around 2003, using a heap of sorry failure called viavoice.
The Bad

Skip the rant to get to the exciting part.

Trouble is: What I'd mainly like to transcribe is German radio, and whisper.cpp does not come with a German language model. Not to worry, one would think, as whisper.cpp comes with conversion scripts for the pyTorch-based whisper models like those one can get from Hugging Face. I downloaded what I think is the model file and cheerfully ran:
```
$ python convert-h5-to-ggml.py /media/downloads/model.bin
Traceback (most recent call last):
  File "/home/src/whisper.cpp/models/convert-h5-to-ggml.py", line 24, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'
```
Oh bummer. Well, how hard can it be? Turns out: Surprisingly hard. There is no pytorch package Debian stable. Ah… I very much later realised there is, it's just that my main system still has an i386 userland, and pytorch is only available for amd64. But I hadn't figured that out then. So, I enabled a virtual python (never mix your system python and pip) and ran:
```
$ pip install torch
ERROR: Could not find a version that satisfies the requirement torch
ERROR: No matching distribution found for torch
```
Huh? What's that? I ran pip with a couple of -v sprinkled in, which at least yielded:
```
[...]
Skipping link: none of the wheel's tags match: cp38-cp38-win_amd64: https://download.pytorch.org/whl/cpu/torch-1.9.0%2Bcpu-cp38-cp38-win_amd64.whl (from https://download.pytorch.org/whl/cpu/torch/)
[...]
Given no hashes to check 0 links for project 'torch': discarding no candidates
ERROR: Could not find a version that satisfies the requirement torch
ERROR: No matching distribution found for torch
[...]
```
The message with “Given no“ has a certain lyric quality, but other than that from the “Skipping“ messages I concluded they don't have 32 bit builds any more.

Well, how hard can it be? Pypi says the sources are on github, and so I cloned that repo. Oh boy, AI at its finest. The thing pulls in a whopping 3.5 Gigabytes of who-knows-what. Oh, come on.

python setup.py build fails after a short while, complaining about missing typing_extensions. Manually running pip install typing_extensions fixes that. But I killed setup.py build after a few minutes when there were only 50/5719 files built. Has AI written that software?

In the meantime, I had gone to a machine with a 64 bit userland, and to be fair the experience wasn't too bad there, except for the hellish amount of dependencies that pytorch pulls in.

So, my expectations regarding “AI code” were by and large met in that second part of the adventure, including the little detail that the internal links on https://pypi.org/project/torch/ are broken because right now their document processor does not produce id attributes on the headlines. Yeah, I know, they're giving it all away for free and all that. But still, after the brief glimpse into the paradise of yesteryear's software that whisper.cpp afforded, this was a striking contrast.
The Crazy

So, I converted the German language model doing, in effect:
```
git clone https://github.com/openai/whisper.git
git lfs install
git clone https://huggingface.co/bofenghuang/whisper-small-cv11-german
python convert-h5-to-ggml.py whisper-small-cv11-german/ whisper tmp
```
(where I took convert-h5-to-ggml.py from whisper.cpp's repo). Then I moved the resulting tmp/ggml-model.bin to german-small.ggml and ran:
```
transcribe.py german-small.ggml peer_review_wie_objektiv_ist_das_wissenschaftliche_dlf_20221214_1646_8a93e930.mp3
```
with my script above and this German-language mp3 from Deutschlandfunk. From the English experience, I had expected to get an almost flawless transliteration of the German text. What I got instead was (paragraphs inserted by me); listen to the audio in parallel if you can:

Germany. Research is on [that was: Deutschlandfunk Forschung aktuell]

A Nobel Prize for Science is not easy without further ado. They really need to find something out. For example, Vernon Smith, who is now 95 years old, is now the father of the Experimental Economy. In 2002 he won the Nobel Prize for Science.

This made such a prize and renommee also make impression on other Fachleuteen and that actually influenced the unabhängig well-office method for scientific publications. This has recently shown a study of Business Science in the Fachmagazin PNS. Anike Meyer spoke with one of the authors.

When Jürgen Huber and his colleagues thought about the experiment, it was clear to them that this is not fair. The same manuscript was given by two different authors, Vernon …
⎆

Trailing blanks, vim and git

12023:031:2 ( 2023-01-31)

Trailing blanks may be␣␣␣␣␣
evil when git displays diffs.␣␣␣␣␣␣␣
Time to remove them.

I'm currently going through a major transition on my main machine in that I have configured my vim to strip trailing blanks, that is, to automatically remove space characters (as in U+0020) immediately before the ends of lines[1].

Why do I do this? I suppose it mainly started with PEP 8, a style guide für Python source code which says trailing whitespace is evil. It has a point, but I have to say trailing whitespace really became a problem only when style checkers started rejecting trailing blanks, which then made all kinds of tools – including other peoples' editors – automatically strip trailing whitespace.

That, in turn, causes the diffs coming out of version control systems to inflate, usually without anyone – neither the people leaving the trailing whitespace nor the ones whose tools remove them – actually wanting that. And well, I tackled this about now because I was fed up with humonguous continuous integration runs failing at the very end because they found a blank at the end of some source file.

So, while I can't say I'm convinced trailing whitespace actually is as evil as all that, I still have to stomp it out to preserve everyones' nerves.

Configuring vim to replace trailing blanks with nothing when saving files is relatively straightforward (at least if you're willing to accept a cursor jump now and then). The internet is full of guides explaining what to do to just about any depth and sophistication.

Me, I am using a variant of a venerable vintage 2010 recipe that uses an extra function to preserve the state over a search/replace operation to avoid jumping cursors. I particularly like about it that the Preserve function may come in handy in other contexts, too:

function! Preserve(command)
  " run command without changing vim's internal state (much)
  let _s=@/
  let prevpos = getcurpos()
  execute a:command
  let @/=_s
  call cursor(prevpos[1], prevpos[2])
endfunction

au BufWritePre * if !&binary | call Preserve("%s/  *$//e") | endif

That is now in my ~/.vimrc.

But I still have all the repositories containing files having trailing blanks. To keep their histories comprehensible, I want to remove all trailing blanks in one commit and have that commit only do these whitespace fixes. The trouble is that even with version control (that lets you back out of overzealous edits) you will want to be careful what files you change. Strip trailing blanks in a (more or less) binary file and you will probably break that file.

So, here is what I do to fix trailing blanks in files that need it while leaving alone the ones that would break, using this blog's VCS (about) as an example:

In preparation, make sure you have committed all other changes. Bulk operations are dangerous, and you may want to roll back everything in case of a fateful typo. Also, you don't want to pollute some other, meaningful commit with all the whitespace noise.
In the root of the repository, look for candidate files containing trailing blanks, combining find and grep like this:
```
find . -type f | xargs grep -l ' $'
```
A brief reminder what's going on here: grep -l just lists file names with matches of the regular expression, ' $' is a regular expression matching a blank at the end of a line; xargs is a brilliant program reading command line arguments for the program named in its arguments from stdin, and the find invocation prints all names of actual files (as opposed to directories) below the current directory.

It may be preferable to use some grep with built-in find functionality (I sometimes use ripgrep), but if I can make do with basic GNU or even better POSIX, I do, because that's something that's on many boxes rather reliably.

The price to pay in this particular case: this recipe won't work if you have blanks in your file names (using -print0 in find and -0 in xargs would fix things here, but then the next step would break). Do yourself a favour and don't have blanks in your filenames. Having dashes in them looks-better-anyway: it makes you look like a die-hard-LISP-person.
Now use egrep -v to filter file names, building patterns of names to ignore and process later, respectively. For instance, depending on your VCS, you will usually have lots of matches in .git or .svn or whatever, and most of these will break when you touch them (not to mention they won't spoil your history anyway). Coversely, it is clear that I want to strip trailing blanks on ReStructuredText files. My two patterns now grow in two separate egrep calls, one for files I don't want to look at, the other for files I will want to strip trailing blanks in:
```
find . -type f |\
  egrep -v '\.git' |\
  egrep -v '\.rst$' | xargs grep -l ' $'
```
This prints a much smaller list of names of files for which I have not yet decided whether or not to strip them.
Repeat this: On the side of files I shouldn't touch, I spot some names ending in .jpeg, .png, and .db. On the side of files that need processing, I notice .html, .css, and .py. So, my next iteration is:
```
find . -type f |\
  egrep -v '\.git|\.(jpeg|png|db)$' |\
  egrep -v '\.(rst|html|css|py)$' |\
  xargs grep -l ' $'
```
That's a still smaller list of file names, among which I spot the index files used by my search engine in .xapian_db, .pyc files used by Python, and a vim .swp file. On the other hand I do want to process some files without an extension, so my next search command ends up as:
```
find . -type f |\
  egrep -v '\.git|\.xapian_db|\.(jpeg|png|db|pyc|swp)$' |\
  egrep -v 'README|build-one|\.(rst|html|css|py)$' |\
  xargs grep -l ' $'
```
That's it – this only leaves a few files as undecided, and I can quickly eyeball their names to ascertain I do not want to touch them. My second pattern now describes the set of files that I want to strip trailing blanks from.
Stripping trailing blanks is easily done from the command line with sed and its inline (-i) option: sed -i 's/ *$//' <file1> <file2>...[2]. The file names I can produce with find alone, because at least GNU find supports the extended regular expressions I have just produced in my patterns; it needs a -regexptype option to correctly interpret them, though:
```
find . -regextype egrep -regex 'README|build-one|.*\.(rst|html|css|py)$' |\
  xargs grep -l ' $'
```
The advantage of using find alone over simply inverting the egrep (by dropping the -v) is that my gut feeling is the likelihood of false positives slipping through is lower this way. However, contrary to the egrep above, find's -regex needs to match the entire file name, and so I need the .* before my pattern of extensions, and editing REs might very well produce false positives to begin with… Ah well.

Have a last look at the list and then run the the in-place sed:
```
find . -regextype egrep -regex 'README|build-one|.*\.(rst|html|css|py)$' |\
  xargs grep -l ' $' |\
  xargs sed -i 's/  *$//'
```
Skim the output of git diff (or svn diff or whatever). Using the blacklist built above, you can see whether you have indeed removed trailing whitespace from files you wanted to process:
```
find . -type f |\
  egrep -v '\.git|\.xapian_db|\.(jpeg|png|db|pyc|swp)$' |\
  xargs grep -l ' $'
```
If these checks have given you some confidence that the trailing blanks have vanished and nothing else has been damaged, commit with a comment stressing that only whitespace has been changed. Then take a deep breath before tackling the next repo in this way.

[1]	This post assumes your sed and you agree on what marks the end of the line. Given it's been quite a while since I've last had to think about CRs or CRLFs, it would seem that's far less of a problem these days than it used to be.

[2] Incidentally, that's a nice example for why I was so hesitant about stripping white space for all these years: Imagine some edits make it so a line break sneaks in between sed -i 's/ and *$//'. Then both blanks that are there are gone, and even if the text is reflowed again later, it will still be broken (though not catastrophically so in this particular case).

Oracle, Mozilla, W3C: Broken Links, Broken Web, Slugs in a Bucket

12023:020:3 ( 2023-01-20)
This post is about why today I had intense fantasies about having an offline switch at least as large as the one on this DEC PDP-12 photographed at Vienna Observatory in 2018.
I give you it has become somewhat trite to state the obvious: The Web (and quite possibly most of the Internet) is broken. Admittedly, a few relatively simple measures – a few well-placed dnsmasq statements, a default-no-javascript browser and a general avoidance of for-profit pages – can hide quite a bit of that brokenness. Today, however, it hit me hard; and two nonprofits were part of what makes me once more seriously consider a livelihood as a gardener.

Oracle clickjacks W3C links

My problem sounded simple: In an XSLT stylesheet applied within a web-browser, I wanted to make sure that large or small numbers are as readable (to people dealing with these kinds of numbers) as 7.206e+16 rather than show up as a digit soup like 72057594037927936.

This made me inspect the XSLT specification at the W3C [1]. Yes, that's more than 20 years old, but it's the sort of XSLT that browsers (and libxml) support today, as opposed to XSLT's later versions. We are not talking “legacy“ material at all.

In the XSLT spec, you can read about the format-number function:

The format pattern string is in the syntax specified by the JDK 1.1 DecimalFormat class.

Aw, I said to myself, couldn't they just have used the venerable printf format codes? Undeterred, I tried to follow the DecimalFormat link, which points to http://java.sun.com/products/jdk/1.1/docs/api/java.text.DecimalFormat.html. In my browser, that looks like this with its default settings:

Yes: That's a header trying to sell me something and then a lot of white space. About 1000 pixels down, there's „Click to view our Accessibility Policy [line feed] Skip to content“, where „Skip to Content“ is a link to a „maincontent“ anchor that doesn't exist on the page.

Given the utter breakage of the page on even a relatively mainstream browser (webkit-based), I expected the “Accessibility Policy“ to be utterly depressing. I therefore refrained from even trying to click on the link, although it probably would have been broken, too, and hence there would have been no harm. In all fairness: If I let Oracle execute Javascript, the page looks a bit better, but it's still stupid advertising rather than information on DecimalFormat.

No wonder – look at the address bar: Oracle has redirected from the URL the W3C links to to the ad page at https://www.oracle.com/java/technologies/. Now, I don't expect much from a commercial entity, but they could at least return a proper 404 Not Found response and a page apologising they broke it; that way, machines could automatically notice the broken link and humans wouldn't irritatedly enable Javascript to see if that gives the expected content after all.

Well: of course they shouldn't break links from a major specification from the governing body of Web tech – and from a spec they are widely using themselves on top – in the first place.
A long history of evil getting worse

I must concede that it didn't take the pure evil of Oracle to fool around with redirects instead of 404s and to wantonly tear down static pages that could have been maintained forever at zero cost. Even the lesser evil that Sun was did that, as evinced by this 2006 capture from the Web Archive. That's already less than ten years after JDK 1.1 was released a quarter century ago (February 1997).

Which brings me to another reason to want to retire: This link has been broken for more than 15 years now. I suspect I'm imagining the W3C to be a lot better staffed than it actually is, and yes, touching normative content is a tricky thing to do. But either through an Erratum or as an editorial change the W3C could and should have fixed the clickjacked link some time within those 15 years. That would have been particularly thoughtful given that my reaction probably isn't too untypical: Rather than go to the Internet Archive to fetch the original documentation right away, I looked for DecimalFormat in today's Java documentation. And yay!, it says:

It also supports […] scientific notation […] Example: "0.###E0" formats the number 1234 as "1.234E3".

Except it didn't for me. The numbers were stubbornly unformatted regardless of what I did as soon as I tried any sort of scientific notation. Which drove me entirely crazy until I ran the browser from a terminal and saw log output there:
```
error
xsltFormatNumberConversion : error in format string '0.###E0', using default
```
It still took a while until I realised the problem was not actually with the format string – that I had taken literally from the modern documentation: Gnah! – but that JDK 1.1 just didn't have the feature. I actually had to go to an older capture of Sun's documentation to figure that out (do donate to the Internet Archive, please).

That was not the end of my webaches of the day.
Mozilla: Crashes in the supply chain

Proclamation: I'm a huge fan of the MDN whenever I do “web development“. You may hence get an idea of my desperation when, minutes after I had somewhat worked out the Sun-Oracle-W3C debacle, I was looking up the border-spacing CSS property at MDN and just saw a brief flash of content and then this:

What deviltry…? From folks who are at least reasonably concerned about standards compliance and accessability? Well, a quick glance into the developer tools made my me shiver. Here's what I found in the error console:
```
[Error] The source list for Content Security Policy directive 'script-src' contains an invalid source: ''report-sample''. It will be ignored.
[and a lot more CSP stuff]
[Error] Failed to load resource: Unacceptable TLS certificate (analytics.js, line 0)
"Error executing Glean task: NotFoundError: Failed to execute 'transaction' on 'IDBDatabase': One of the specified object stores was not found."
[Error] (Glean.core.Dispatcher) – "Error initializing dispatcher, won't execute anything further." – "There might be more error logs above."
[Error] TypeError: null is not an object (evaluating 'localStorage.getItem')
[Error] TypeError: null is not an object (evaluating 'localStorage.getItem')
```
So… the Mozilla folks list all kinds of new-fangled (i.e., bullseye Webkit doesn't know about them) Content Security Policies. I'll allow they do that out of concern for my privacy – though frankly, I'm not much worried about cross-site scripting from MDN, in particular because the site generally works fine without Javascript (I don't remember when or why I switched it on for them). Be that as it may, after all the CSP-ing, they turn around and try to rat me out to Google analytics (which fails here because google-analytics.com resolves to 127.0.0.1 on my boxes). Hu?
Want to report a bug? Feed Microsoft!

Anyway, I had Javascript on, and then I got the blank page because rendering crashes when the thing cannot do Javascript local storage. If you wonder why I keep that off by default, check out my rant against it; sorry: in German). That's why I get a blank page. It all goes belly-up starting with a message “Unable to read theme from localStorage“ from witin some React code. Call me a purist, but I don't even see how you'd need the ghastly React Javascript “framework“ <cough> for a documentation site. Why this would need to store a theme at all and why this should be crashing the whole doument[2] when it can't do that is entirely beyond me.

Given that Mozilla are (to some extent) good guys, I wanted to file a bug. They also were home to the premier bugtracker of the 2000s, bugzilla, so… No. Even the contribution guide [warning: github link] – a static document! – sits in github, collecting behavioural surplus for Microsoft.

Ah… sorry, Mozilla, if React and github are your ideas of an open, accessible, and standards-compliant Web, there is no point filing bugs against MDN. That's not a good use of the surveillance capital I have to give.

But then: being a gardener isn't all that idyllic either:

Given that, Oracle and Mozilla are not much worse. So, I guess I'll keep doing computers for a little while, at least as long as I can somehow still work around the postmodern Web's breakage most of the time.

[1] Well, actually I was looking at a local mirror of that, but if turns out that looking at the online material right away wouldn't have made a difference.

[2] I just notice that the existence of the phrase “crashing a document“ is a rather characteristic symptom of the state of the Web.
⎆
My First Libreoffice Macro in Python

12022:355:0 ( 2022-12-21)
This is what I was after: Immediate Wikipedia search from within Libreoffice. In the document, you can also see the result of a Python dir() as produced by the inspect macro discussed below.
While I still believe the creation „office“ software was one of the more fateful turns in the history of computing and this would be a much better world if there hadn't been VisiCalc and WordStar, not to mention all the software they spun off, I do get asked about Libreoffice quite a bit by people I have helped to get off of Windows.

The other day one of them said: „You know, wouldn't it be nifty if I could mark a term, hit F3 and then that'd do a Wikipedia search for that term?“ I could sympathise with that, since the little one-line CLI I have on my desktop has a function pretty much for this, too. That program, however, probably is too mean for people using Libreoffice. But tomorrow is that guy's birthday, and so I thought: how hard can it be to teach this to Libreoffice directly?

Turns out it's harder (for me) than I thought. Which is why I'm writing this post: perhaps a few people will find it among all the partially outdated or (to me) not terribly helpful material. I think I'd have appreciated a post like this before I started investigating Libreoffice's world of macros.

In particular, I'd have liked the reassuring words: „There's no reason to fiddle with the odd dialect of BASIC that's built in, and there's no reason either to use the odd IDE they have.” The way things were, I did fiddle around with both until I couldn't seem to find a way to open a URL from within StarBasic or whatever that thing is called today. At that point I noticed that all it takes for Python support is the installation of a single small package. In addition, for all I can see the BASIC variant has just about as much relevant documentation as the Python API. So… let's use the latter.
Preparations

(a) To enable Python macros in libreoffice version 7 (as in Debian bullseye), you have to install the libreoffice-script-provider-python package.

(b) The extensions go into a directory deep within your XDG .config. So, create and enter this directory:
```
mkdir ~/.config/libreoffice/4/user/Scripts/python/
cd ~/.config/libreoffice/4/user/Scripts/python/
```
I'm calling this directory the script path below.
Figuring out Libreoffice's API

My main problem with this little project has been that I could not figure out Libreoffice's macro-related documentation. The least confusing material still seems to be maintained by openoffice (rather than libreoffice), and what I ended up doing was using Python introspection to discover attribute names and then entering the more promising ones into the search box of the openoffice wiki. I strongly suspect that's not how it's meant to work. If you know about better ways: please drop me a note and I will do an update here.

But: How do you introspect given these macros do not (easily) have a stdout, and there seems to be no support for the Python debugger either?

Based on an example from openoffice, I figured out that to Libreoffice, macros written in Python are just functions in Python modules in the script path, and that the basic entry point to libreoffice is through a global variable that the libreoffice runtime tricks into the interpreter's namespace, namely XSCRIPTCONTEXT. With this realisation, the example code, and some guessing I came up with this (save into the script path as introspect.py):
```
def introspect():
    desktop = XSCRIPTCONTEXT.getDesktop()
    model = desktop.getCurrentComponent()
    text = getattr(model, "Text", None)
    if not text:
        # We're not in writer
        return

    text.End.String = str(dir(model))
```
If all goes well, this will append a string representation of dir(model) to the end of the document, and in this way you can look at just about any part of the API – perhaps a bit clumsily, but well enough.

But first, run Python itself on your new module to make sure there are no syntax errors:
```
python introspect.py
```
That is important because if Python cannot parse your module, you will not find the function in the next step, which is linking your function to a key stroke.

To do that, in libreoffice, create a new text document and do Tools → Customize from the menu. Then, in Range, you should find LibreOffice Macros, then My Macros, and in there the introspect module you just created. Click it, and you should be able to select introspect (or whatever function name you picked) under Function. Then, select the key F4 in the upper part of this dialog and click Modify[1].

After you did that, you can hit F4, and you will see all attributes that the result of getCurrentComponent has. As I said, pasting some of these attribute names into the search box on openoffice's wiki has helped, and of course you can further introspect the values of all these attributes. Thankfully, libreoffice auto-reloads modules, and so traversing all these various objects in this way is relatively interactive.

I freely admit that I have also used this text.End.String = trick to printf-debug when I did the next steps.
The Wikipedia-Opening Script

These next steps included figuring out the CurrentSelection object and, in particular, resisting the temptation to get its Text attribute (which points to its parent, the whole document). Instead, use the String attribute to retrieve what the user has selected. The rest is standard python fare with a dash of what I suppose is cargo-culting on my end (the supportsService thing seeing whether I can subscript the selection; I lifted that from another example I ran into on the openoffice wiki):
```
import webbrowser
from urllib.parse import quote as urlquote

def search_in_wikipedia():
    """Do a wikipedia search for the current selection"""
    desktop = XSCRIPTCONTEXT.getDesktop()
    model = desktop.getCurrentComponent()
    sel = model.CurrentSelection
    if sel.supportsService("com.sun.star.text.TextRanges"):
        selected = sel[0].String.strip()
        if selected:
            webbrowser.open_new_tab("https://de.wikipedia.org/w/index.php"
                "?fulltext=Suchen&search="+urlquote(selected))
```
For all I can see, the Wikipedia search URI is the same across the instances modulo the authority part – so, replace the de in the URL with the code for whatever language you (or persons you need a birthday present for) prefer. Then, save it to the script path and bind it to a function key as before.

I'm sure this can (and should) be made a whole lot more robust with a bit more actual Libreoffice-fu. But it seems to work nicely enough. And me, I have a very personal (in a somewhat twisted sense) birthday present.

[1] I always find that describing operations in GUIs tends to sound like incomprehensible origami instructions. Is there a term for that kind of language already? Gooeynese, perhaps?
⎆
A QR Code Scanner for the Desktop

12022:335:1 ( 2022-12-01)

qropen.py in action: Here, it has scanned a QR code on a chocolate wrapper and asks for confirmation that you actually want to open the URI contained (of course, it doesn't come with the glitzy background).

When I was investigating the SARS-2 vaccination certificates last year (sorry: that post is in German), I played a bit with QR codes. A cheap by-product of this was a little Python program scanning QR codes (and other symbologies). I cannot say I am a huge fan of these things – I'd take short-ish URIs without cat-on-the-keyboard strings like “?id=508“ instead any day –, but sometimes I get curious, and then this thing comes in handy given that my telephone cannot deal with QR codes.

Yesterday, I have put that little one-file script, qropen.py, on codeberg, complemented by a readme that points to the various ways you could hack the code. I'll gratefully accept merge requests, perhaps regarding a useful CLI for selecting cameras – you see, with an external camera, something like this thing starts being actually useful, as when I used it to cobble together a vaccination certificate checker in such a setup. Or perhaps doing something smart with EAN codes parsed (so far, they just end up on stdout)?

On that last point, I will admit that with the camera on my Thinkpad X240, most product EAN codes do not scan well. The underlying reason has been the biggest challenge with this program even for QR codes: Laptop cameras generally have a wide field of view with a fixed focus.

The wide field of view means that you have to bring the barcodes pretty close to the camera in order to have the features be something like three pixels wide (which is what zbar is most fond of). At that small distance, the fixed focus means that the object is severely out of focus and hence the edges are so blurry that zbar again does not like them.

Qropen.py tries to mitigate that by unsharp masking and potentially steep gammas. But when the lines are very thin to begin with – as with EAN stripes –, that does not really help. Which means that the QR codes, perhaps somewhat surprisingly given their higher information content, in general work a lot better for qropen.py than to the simple and ancient EAN codes.

There clearly is a moral to this part of the story. I'm just not sure which (beyond the triviality that EANs were invented for lasers rather than cameras).

⎆
Running a Current Zoom Package on 32-bit Debian

12022:325:2 ( 2022-11-21)
[There's a TL;DR at the end of this rant in case you're just desperate to get your Zoom client to work again]

There are many reasons why proprietary, non-interoperable services are a bane and why I generally say no to them, be them Twitter, Discourse, or Google Docs. However, I regrettably cannot say no to Zoom, even though there are perfectly Free alternatives like, say, mumble plus perhaps VNC if you need screen sharing. Still, just about everyone simply expects you to cope with this heap of proprietary lock-in. Granted, at least it doesn't require downloading and running ridiculous amounts of Javascript code into your web browser each time you run it like some other telecon systems I could mention (including Free ones).

Upgrade now! But… to what?

And one has to give the Zoom client that its audio interface used to be a lot better than what today's major browser vendors have to offer (minor browsers can't run the telecon crapware anyway).

One strong reason to say no even to Zoom, however, is the feeling of helplessness when the software depends on a central server and that server suddenly locks you out. That happened to me with Zoom last week, when the damn thing started to say “you need to upgrade to connect to this conference” or something to that effect.

After I have been running the probably multiply exploitable, ugly binary package for perhaps two years without ever upgrading (Zoom doesn't operate a proper repository, and hence there's no apt-upgrading it), that was not unreasonable per se. Except that even manually pulling a .deb from Zoom's Javascript-infested web pages did not help. Umm, what?

It turns out that the Zoom folks do not bother any more to update their 32-bit package and simply distribute the 5.4.whatever that they're locking out. No amount of installing that changed Zoom's refusal to let me into the conference, and working with the telecon host to see whether any setting in the management interface would let me in again went nowhere. It is this feeling and fact of infant-like helplessness that I so detest about being forced into proprietary technologies.

How great would it be if Zoom were proper Free software and I could build this myself? How even greater if it were built on open standards and I could just switch to an alternative client? Or cobble together my own?

But no, I had to eat dirt. I fetched the 64-bit version and dpkg -i-ed it. I had already run:
```
dpkg --add-architecture amd64
```
on that box a long time ago, so I figured Debian's dependency resolution magic should cover the rest. How wrong I was.

Typing zoom after dpkg -i zoom-amd64.deb followed by apt install -f to satisfy dependencies brought back the command prompt right away – but not zoom window. The program just silently crashed. What?

In such a situation, the first thing I do is run strace to see what syscalls the program does before dying. However, the output made no sense at all, starting with chdir(0x01). This would have to immediately crash (there certainly is no path name at the memory address 1), but the strace of Zoom instead went on for a few pages. Hu?

After having been stumped for a few minutes, it dawned on me that stracing an amd64 binary will plausibly require an amd64 strace, so I typed:
```
apt install strace:amd64
```
and tried again. This time, strace's output made a lot more sense, and right before dying, it said:
```
stat("/usr/lib", {st_mode=S_IFDIR|0755, st_size=32768, ...}) = 0
writev(2, [{iov_base="/opt/zoom/zoom", iov_len=14},
 {iov_base=": ", iov_len=2},
 {iov_base="error while loading shared libraries", iov_len=36},
 {iov_base=": ", iov_len=2}, {iov_base="libpango-1.0.so.0", iov_len=17},
 {iov_base=": ", iov_len=2}, {iov_base="cannot open shared object file", iov_len=30},
 {iov_base=": ", iov_len=2}, {iov_base="No such file or directory", iov_len=25},
 {iov_base="\n", iov_len=1}], 10) = 131
```
In other words: It tried to load the pango library (which draws text strings in Gtk and elsewhere) but failed. What? I installed this from a Debian package and it has not noticed the dependency?

Well, the Zoom folks clearly got it wrong and fooled the machine into accepting the 32-bit libraries (which of course are already on the box) for 64-bit dependencies, which simply cannot work. I am not going to do research for a commercial entity and hence just gritted my teeth. Repeatedly letting the thing crash, I eventually figured out that I need to manually say:
```
apt install libxcb-keysyms1:amd64 libglib2.0-0:amd64\
  libxcb-shape0:amd64 libxcb-shm0:amd64 libxcb-xfixes0:amd64\
  libxcb-randr0:amd64 libxcb-image0:amd64 libfontconfig1:amd64\
  libgl1-mesa-glx:amd64 libegl1-mesa:amd64 libxi6:amd64\
  libsm6:amd64 libxrender1:amd64 libpulse0:amd64 libxcomposite1:amd64\
  libxslt1.1:amd64 libsqlite3-0:amd64 libxcb-xtest0:amd64\
  libdbus-1-3:amd64 libxtst6:amd64 libxcb-xinerama0:amd64\
  libxkbcommon-x11-0:amd64 desktop-file-utils:amd64\
  libgbm1:amd64 libdrm2:amd64 libfreetype6 libatk-bridge2.0-0:amd64\
  libxrandr2:amd64 libpango-1.0-0:amd64 libcairo2:amd64\
  libcups2:amd64 libnss3:amd64 libxdamage1:amd64
```
to have the damn thing at least not crap out during startup. What, if I may ask, does it need cups for?

Nachtrag (2024-10-31)

In the meantime, there are two extra packages zoom wants to see, so I added these, too.

Alas, that's still not good enough. While Zoom at least did not immediately terminate any more, it still did not properly connect either. This time, strace -f-ing ends with:
```
5783  poll([{fd=3, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}], 3, 0) = 0 (Timeout)
5783  poll([{fd=3, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}], 3, 14995 <unfinished ...>
5789  <... futex resumed>)              = -1 ETIMEDOUT (Connection timed out)
5789  futex(0x7fa62a77b4d0, FUTEX_WAKE_PRIVATE, 1) = 0
an then a lot of
5789  +++ exited with 0 +++
...
5783  +++ exited with 0 +++
```
Oh great. A successful exit even though the program failed. The joys of commercial software development.

But be that as it may: it is failing because whatever should be feeding file descriptor 3 apparently is not fast enough. The question is: what? Well, let's see what this file descriptor 3 is. In my vi, I'm grepping through the strace protocol for a call in process 5783 returning three like this:
```
?5783.*= 3$
```
and I find:
```
5783  socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
5783  connect(3, {sa_family=AF_UNIX, sun_path=@"/tmp/.X11-unix/X0"}, 20) = 0
```
Oh dang. The thing is waiting for the X server when it dies? Why would the X server time out? Spoiler: That I have not found out. But quite close to this I saw that Zoom also opens the file ~/.zoom/zoom_stdout_stderr.log and dumps the messages I was missing on stderr there. In fact, I could have gathered the missing libraries from that file rather than strace had I known about it.

What did I find in there? Well:
```
ZoomLauncher started.
Zoom path is: /opt/zoom
cmd line:
Start subprocess: /opt/zoom/zoom sucessfully,  process pid: 5916
sh: 1: pactl: not found
```
This looks almost as if it needs pulseaudio? But what about the timeout on the X socket? I don't know, but I can report that installing pulseaudio-utils did fix the pactl failure (if you want to call it that), but it still did not make the thing run. At least according to strace:
```
7193  execve("/bin/sh", ["sh", "-c", "pactl --version"], 0x7ffd6d009b08 /* 38 vars */ <unfinished ...>
...
7193  <... execve resumed>)             = 0
```
it's not because pactl would fail, though frankly it would seem a bit odd that Zoom is calling an external binary in the first place and then go through the shell on top – what's wrong with execve and friends? Zoom, however, still exits on the X timeout:
```
7223  poll([{fd=3, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}], 3, 14994 <unfinished ...>
[...]
7223  <... poll resumed>)               = 0 (Timeout)
```
At this point, I seemed to be at a dead end: do I really want to debug whatever Zoom's Qt basis had to work out with X11 that would make X11 fail that dramatically?

On a wild guess, I suspected some stale setting now that I had noticed there is a .zoom directory. I hence moved that away. Lo and behold: suddenly the messages so far hidden in .zoom arrived on stderr. And it now said “No PulseAudio daemon running, or not running as session daemon”. Aw, bother. That the Zoom client properly dealt with plain ALSA was, frankly, one of the reasons I sort of gave in to zoom. That's now over, too.

I sprinkled the box with a bit of holy water (something to the effect of pulseaudio --start), and zoom finally came up and connected to their nasty, proprietary server that had locked me out with the 32-bit client.

Too long; didn't read

So… If you're in the same situation as a was and Zoom's servers lock you out: I'd hope that installing the extra libraries and pulseaudio and then moving the .zoom subdirectory out of the way – I'd probably not remove it altogether immediately, as it might contain credentials or a Zoom bitcoin …
⎆

Making Linux React to Power Gain and Loss

12022:319:3 ( 2022-11-15)

Photo of a mains switch built into a power socket

This is what this post is about: having a single switch for monitor, amplifier, and computer.

Make the Box Wake Up On Power
Make the Box Hibernate on Mains Loss

I use an oldish notebook with a retired monitor and an amplifier I picked up from kerbside junk to watch TV („consume visual media“, if you prefer), and I want all that hardware to be switched on and off using a single power switch (see, um… Figure 1).

Given that the notebook's battery still is good for many minutes, it's a perfectly reasonable stand-in for a UPS. Hence, my problem is quite like the one in the ancient days when big-iron servers had UPSes with just enough juice to let them orderly shut down when the power grid was failing. This used to involve daemons watching a serial line coming in from the UPS. Today, with ACPI in almost every x86 box and batteries in many of them, it's quite a bit simpler.

This post shows how to automatically power (up and) down with acpid. If you run systemd, you probably will have to do a few extra tweaks to keep it from interfering. Please write in if you figure them out or if things just work.

Make the Box Wake Up On Power

The first half is to configure the machine to wake up when mains power returns. Notebooks typically don't do that out of the box, but most ACPI firmwares can be configured that way. On my box, a Thinkpad X230 controlled by a legacy BIOS rather than UEFI, it is a setting in the BIOS setup pages[1]. If you boot through UEFI, you may be able to do this configuration from within the Linux (please write in if you can provide details on that).

Having said that, let me, perhaps only loosely relatedly, mention /proc/acpi/wakeup, which may play a role in this for you (although it does not on the X230). If you cat this file, you will see something like:

LID       S4    *enabled   platform:PNP0C0D:00
SLPB      S3    *enabled   platform:PNP0C0E:00
IGBE      S4    *enabled   pci:0000:00:19.0
EXP3      S4    *disabled  pci:0000:00:1c.2
XHCI      S3    *enabled   pci:0000:00:14.0
EHC1      S3    *enabled   pci:0000:00:1d.0
EHC2      S3    *enabled   pci:0000:00:1a.0
HDEF      S4    *disabled  pci:0000:00:1b.0

Whatever is enabled here will wake the machine up, sometimes depending on whether it is hibernating or just suspended. There are various events that could cause a wakeup, such as when the lid is opened (in the ACPI lingo used here, LID), when a Wake-on-LAN packet arrives (IGBE), when the sleep/power button is pressed (SLPB) or when someone puts in a signal via USB (XHCI, EHC1, ECH2; typically, that would be a keyboard)[2]. To change this, you echo the respective string into the file, which toggles the enabledness:

$ echo LID | sudo tee /proc/acpi/wakeup
LID
$ cat /proc/acpi/wakeup | grep LID
LID       S4    *disabled  platform:PNP0C0D:00

If there's no obvious BIOS setting for waking up the machine on power, look for something like PWR in /proc/acpi/wakeup. Incidentally, disabling wakeup sources here may actually conserve battery power when hibernating.

Make the Box Hibernate on Mains Loss

The second half is that the machine should go into hibernation when I flip the central power switch. A straightforward way to get there is to talk to the acpid. It seems it is still standard on PC-style hardware even when there is systemd.

So, let us configure it to call an appropriate script when it switches to battery mode (i.e., the power has gone). You can do that sufficiently well by writing:

# /etc/acpi/events/battery
# Called when AC power goes away and we switch to battery

event=battery.*
action=/etc/acpi/to-battery.sh

to /etc/acpi/events/battery. The Debian-distributed acpid already has that file, but it calls the script power.sh, which, as delivered, does something entirely different; you could modify power.sh to your liking, but it's cleaner to use a different, custom script, for instance, because it is less hassle on dist-upgrades. Disclaimer: This will fire too often, namely both on power up and down. However, at least on my hardware that doesn't hurt, and it doesn't seem acpid generates different events for battery in/out.

Then create the script /etc/acpi/to-battery.sh. I've written this there:

#!/bin/sh

sleep 2
if [ `cat /sys/class/power_supply/AC/online` -eq 1 ]; then
  exit
fi

# x230 specials; you probably won't need them
buslist="pci i2c"
for bus in $buslist; do
  for i in /sys/bus/$bus/devices/*/power/control; do
      echo on > $i
  done
done

logger "powerbutton-acpi-support enter"
sync
sync
echo platform > /sys/power/disk
echo disk > /sys/power/state
logger "powerbutton-acpi-support leave"

(sleep 12; ntpdate pool.ntp.org) &
# this is just an example of an extra hack for resetting a TV
# card that would be broken after the wakeup.
(sleep 2; logger reloading tv; /usr/local/bin/uhubctl -l 1-1 -a cycle) &

This thing first waits two seconds and then ensures AC is really gone before doing anything else; this is because on my box I occasionally received spurious power loss notifications, and hibernating the box just when something interesting was on TV has interrupted the rare moments of enjoyable programming a few times too often. Besides, this will catch cases where the battery event is generated by power coming back.

After that, I'm running a few specials where I enable power management on the PCI and I²C busses of the machine. That has been broken for some reason or another at least on one kernel version or another on this particular box. I've left it the script above in as an inspiration for how you could intervene if something doesn't quite work and needs some fiddling.

It then proceeds to sync the disk, just in case something goes wrong on suspend or resume and eventually does a low-level hibernation. You could probably use pm-hibernate or some systemd facility just as well, but I personally have found the direct operation of /sys/power to be at the same time the least hassle and the least fragile option (at least if you're prepared to write a few lines of script like the bus loop in my example).

The last two commands – an NTP update and a hack to reset a USB device that is confused after a wakeup – are executed as part of the wakeup (but in background shells so the box is quickly responsive again). Adapt to your needs.

Enjoy – and conserve energy by centrally pulling power from all the greedy little wall plug transformers.

[1]

On the X230, to change it I had to press Enter immediately after power-up, then F1, and then navigate to “Power On with AC Attach“ in the Config pane – but regrettably, there's nothing even resembling a standard there, and given this is tech supposedly obsolete since, what, 15 years, I don't think there will ever be one.

[2]	In case you're wondering what HDEF is: Well, it's audio, according to other things ACPI. What I don't know is how to make the audio hardware send a wakeup sinal. I've tried plugging in a headphone, and that didn't work. If you know more… well, again, totally feel free to write in.

Blog Extensions on Codeberg

12022:280:2 ( 2022-10-07)
This post takes an odd bend to become an apology for CGI (as in common gateway interface) scripts. This is the netsurf browser communicating with the CGI shell script at the foot of this post.

I have written a few plugins and extensions for this blog, and I have discussed them in a few posts (e.g., feedback form, tag explanations, cited-by links, or the search engine). The code implementing these things has been strewn across the various posts. I have to admit that having that code attached to just a few blog posts has always felt somewhat too early-90iesy to me.

Now that I have created my Codeberg account, I have finally copied together all the various bits and pieces to create a repository on Codeberg that you are welcome to clone if you're running pelican or perhaps some other static blog engine. And of course I appreciate merge requests with improvements.

There is one major file in there I have not previously discussed here: cgiserver.py. You see, I'm a big fan of CGI scripts. They're reasonably simple to write, trivial to deploy, and I have CGIs that have been working with minimal maintenance for more than 20 years. Sure, pulling up an interpreter for every request is not terribly efficient, but for your average CGI that is perhaps called a dozen times per day (depending on how many web crawlers find it interesting) this really doesn't matter. And that's why both the feedback script and the search engine are written as CGIs.

However, in contrast to apache, nginx (which serves this blog) does not support CGI scripts. I even by and large agree with their rationale for that design decision. Still, I would like to run CGIs, and that's why I've written the cgiserver. It is based on Python's built-in HTTP server and certainly will not scale – but for your average blog (or similar site) it should be just fine. And I don't think it has any glaring security problems (that you don't introduce with your CGIs, that is).

Installation is almost trivial: put the file somewhere (the in-source sysvinit script assumes /var/www/blog-media/cgiserver.py, but there's absolutely no magic about this), and then run it with a port number (it will only bind to localhost; the default in the sysvinit script is 6070) and a directory into which you put your CGI scripts (the sysvinit script assumes /var/www/blog-media/cgi).

When you have a cgi script foo, you can dump it in this directory, make it executable and then run it by retrieving http://localhost:6070/foo. In case you have nothing else, you can try a shell script like:
```
#!/bin/sh
echo "content-type: text/plain"
echo
/usr/games/fortune
```
(which of course only works in this form if you have something like fortunes-en installed on a Debian box). That should be enough to give you something like the screenshot opening this post. Even more than 25 years after I have written my first CGI, I am still amazed how simple this is.

Disclaimer: Writing CGI scripts that take input such that they are not trivially exploitable is higher art. So… don't do it, except as a game. Oh, and to debug your scripts, simply let cgiserver run in a terminal – that way, you will see what your scripts emit on stderr. Note, however, that the way the sysvinit script starts cgiserver, it will run as nobody; if things work when you start cgiserver yourself but not when it's running as a daemon, that's the most likely reason.
⎆
Maintaining Static Blogs Using git push

12022:252:7 ( 2022-09-10)
```
local                server

main  --- push --->   main
                        |
                        | (merge)
                        |
                        v
                   published --- make publish --->  nginx

Fig 1.  Our scheme in classic ASCII art.
```
- Two Repos, Two Branches
- Automating the Publication
In my post on how I'm using pelican – the static blog engine that formats this site –, I had described that on a make install, I would do a local build (make publish) and then rsync the result to the production site. Since about June, I no longer do that, because the way pelican works – it touches every generated file every time – is not a good match for rsync. With a growing site, this means a substantial amount of data (well: a few megabytes for me at this time) is being transferred. What's a few megabytes these days, you ask? Well, ever since UMTS has been shut down, on the road all I have is GPRS (i.e., 10 kB/s with a bit of luck), and then a few Megabytes is a lot.

I hence finally changed things to benefit from the fact that I keep the text content in a version control system. For a post without media, all that needs to be transferred are a few kilobytes for a git push. Here is how that is done (assuming a Debian-like setup).

First, unless your source directory already is under git version control, in there run:
```
git init
git add Makefile content plugins pelicanconf.py publishconf.py theme tasks.py
git commit -am "Migrating into git"
```
You will probably also want to have a .gitignore, and then probably several other files on top, but that's beside the current point.
Two Repos, Two Branches

The rough plan is to have a complete, checked-out git repository on the server side (ahem: see Figure 1). It is updated from your local repo through pushes. Since you cannot push into a checked-out branch, the server-side repository has a branch published checked out, while your authoring happens in the main (traditionally called master) branch. After every push, main is merged into published, and then pelican's site generation runs.

A word of warning: these merges will fail when you force-push. Don't do that. If you do, you will have to fix the breakage on the server side, either by dropping and re-creating the published branch, or by massaging all places that a force-pushed commit changed.

To set this up, on the web server do (adapting to your site and taste if you don't like the path):
```
sudo mkdir -p /var/blog/source
sudo chown `id -u` /var/blog/source # you'll be pushing as yourself
cd /var/blog/source
# create a git repo you can push into
git init
# go away from the main/master branch so you can push into it
git checkout -b published
```
Then, in your local git repository for the blog, add the repository you just created as a remote named prod and push the main branch (this assumes you have the main branch checked out):
```
git remote add prod ssh://USER@SERVER.NAME//var/blog/source
git push prod
```
On the remote server, you are still on the published branch, and hence you will not see what you have just pushed. You have to merge main using:
```
git merge main
```
(or master, if that's still the name of your main branch). You should now see whatever you have put into your local git above. If that's true, you can say make publish and see your publishable site in the output subdirectory. If it's not true, start debugging by making sure your main branch on the server side really contains what you think you have pushed.
Automating the Publication

This completes the basic setup. What is still missing is automation. That we can do with a git hook (see the githooks man page for more information on that nifty stuff) that is installed on the server side into /var/blog/source/.git/hooks/post-update. This file contains a shell script that is executed each time commits are pushed into a repository once git has updated everything. In this case, it is almost trivial, except for some bookkeeping and provisions for updating the search engine (all lines with BLOG_ROOT in them; delete these when you have not set that up):
```
#!/bin/sh
# This hook merges the main branch, builds the web page, and does
# housekeeping.
#
# This *assumes* we have the published branch checked out.  It should
# probably check that one day.

set -e

unset GIT_DIR # this is important, since we're manipulating the
   # working tree, which is a bit uncommon in a post-update hook.
cd ..
BLOG_ROOT=/var/blog

git merge master
make publish
BLOG_DIR=$BLOG_ROOT/source/output $BLOG_ROOT/media/cgi/blogsearch
```
Do not forget to chmod +x that file, or git will ignore it.

Again at the local side, you have to modify your install target so something like:
```
rsync:
       # adapt the paths!
              rsync --info=progress2 -av /var/www-local/blog-media/ blog.tfiu.de:/var/blog/media/

install: rsync
              -git commit -a
              git push -u prod master
```
(the - in front of the git commit is because git returns non-zero if there is nothing to commit; in the present case, you may still want to push, perhaps because previous commits have not been pushed, and hence we tell make to not bother about the status of git commit).

With this path and the separate media directory still updated through rsync (cf. the previous post on this), an nginx config would have to contain lines like:
```
location / {
  root /var/blog/source/output;
}

location /media/ {
  alias /var/blog/media/;
}
```
This setup has worked nicely and without a flaw in the past few months. It makes a lot more sense the my previous setup, not the least because any junk that may accumulate in my local output directory while I'm fooling around will not propagate to the published server. So: If you work with pelican or a similar static blog generator, I'd say this is the way to partial bliss.
⎆
Bahnauskuft auf antiken Geräten – und auf Codeberg

12022:246:6 ( 2022-09-04)

Bahnauskunft von 2022 auf einem Nokia N900 von 2009: Es braucht inzwischen etwas Mühe, um das gebastelt zu kriegen.

Als die Bahn-Webseite nicht mehr ordentlich auf kompakten Browsern wie dillo funktionierte und auch nicht per WAP– also Mitte der 2010er Jahre –, habe ich mir ein ein kleines Skript geschrieben, das die wesentlichen Infos zur Zugauskunft aus dem HTML herausklaubte und dann in einem einfachen Kommandozeilen-Interface darstellte. Das war, worum es im letzten Sommer bei meinem Rant gegen Zwangs-Redirects umittelbar ging. Der weitere Hintergrund: Ich will Zugauskünfte von meinem alten Nokia N900 aus bekommen (und im Übrigen seit der Abschaltung von UMTS auch über eine 2G-Funkverbindung, also etwas wie 10 kB/s)[1].

Nachdem das – jedenfalls nach Maßstäben von Programmen, die HTML auf Webseiten zerpflücken – überraschend lang gut ging, ist das im Rahmen der derzeitigen Verschlimmbesserung der Bahn-Seite neulich kaputt gegangen. Obendrauf ist die Javascript-Soße auf bahn.de damit so undurchsichtig geworden, dass mich die Lust, das Skript zu pflegen, sehr nachhaltig verlassen hat. In dieser Lage kam ein Vortrag über die Bahn-APIs, den jemand bei der Gulasch-Programmiernacht 2019 gehalten hat, gerade recht. Also: Das Video davon.

In diesem Video habe ich gelernt, dass mein „unpromising“ im Rant vor einem Jahr,

I know bahn.de has a proper API, too, and I'm sure it would be a lot faster if I used it, but alas, my experiments with it were unpromising [...],

einen tiefen Hintergrund hat. Die Bahn hat nämlich keine API für die Fahrplanauskunft.

Was es aber stattdessen gibt: die HaFAS-API, auf die die Reiseplanung der Bahn-App selbst aufsetzt. Und es stellt sich heraus, dass Leute schon mit viel Fleiß ausbaldowert haben, wie die so funktioniert, etwa in pyhafas.

Mit pyhafas kann ich all das schreckliche HTML-parsing aus dem alten bahnconn.py durch ein paar Aufrufe in pyhafas rein ersetzen. Aber leider: pyhafas ist echt modernes Python, und weil es viel mehr kann als es für bahnconn.py bräuchte, wäre das Rückportieren davon nach Python 2.5 ein ernsthaftes Projekt; mehr habe ich aber auf meinem N900 nicht. Außerdem bin ich bekennender Fan von ein-Modul-und-stdlib-Programmen: die brauchen keine Installation und laufen zudem mit allem, das irgendwie Python verdauen kann, also etwa auch jython oder sowas, was spätestens dann in Frage steht, wenn Abhängigkeiten C-Code enthalten.

Deshalb habe ich aus pyhafas die Dinge, die bahnconn dringend braucht, abgeschaut und eine minimale, Python-2.5-verträgliche Implementation gebastelt. Das Ergebnis: ein neues bahnconn. Holt es euch, wenn ihr Bahnauskunft auf älteren Geräten haben wollt. Ich habe es jetzt nicht auf Atari TTs probiert, aber ich kann mir gut vorstellen, dass es selbst da noch benutzbar ist.

Codeberg

Gerade, als ich den Code einfach wieder hier auf dem Blog abwerfen wollte, habe ich beschlossen, das könne ein guter Anlass sein, endlich mal einen zweiten Blick auf Codeberg zu werfen.

Bisher habe ich nämlich für allen etwas langlebigeren oder größeren Code (also: nicht einfach nur am Blog abgeworfenen Kram), ganz DIY, ein eigenes Subversion-Repository betrieben. Was in den letzten Jahren neu dazukam, habe ich in git+ssh+cgit gesteckt.

Natürlich hat das niemand mehr gesehen; nicht mal Suchmaschinen gucken mehr auf sowas, seit aller Code der Welt bei github landet. Deshalb, und auch, weil ich Monstren wie gitea und gitlab wirklich nicht auf meiner Maschine haben will (allerdings: cgit ist ok und würde für Publikation auf subversion-Niveau reichen), habe ich mich mit dem Gedanken, dass mein Kram auf einer öffentlichen Plattform besser aufgehoben sein mag, mehr oder minder abgefunden.

Auf Github bin ich beruflich schon so viel zu viel unterwegs, und der Laden ist deutlich zu nah am Surveillance Capitalism. Zwar kenne ich hinreichend Projekte und Firmen, die ihnen Geld geben, so dass sie gewiss ein konventionell-kapitalistisches Geschäftsmodell fahren könnten; aber schon da fehlt mir der Glaube. Obendrauf hat mir Microsoft in meinem Leben schon so viel Kummer bereitet, dass ich ihnen (bzw. ihrem Tochterunternehmen) nicht noch mehr KundInnen zutreiben will.

Codeberg, auf der anderen Seite, wird von einem Verein betrieben und macht generell vieles richtig, bis hin zu Einblendungen von Javascript-Exceptions (warum machen das eigentlich nicht alle?), so dass die Seite nicht einfach heimlich kaputt ist, wenn ich Local Storage verbiete (gitea, die Software, auf der Codeberg soweit ich sehe aufsetzt, kann leider immer noch nicht ohne).

Von dem gitea-Krampf abgesehen hat gestern alles schön funktioniert, nichts an der Anmeldeprozedur war fies oder unzumutbar. Codeberg hat hiermit erstmal das Anselm Flügel Seal of Approval. Ich denke, da werde ich noch mehr Code hinschaffen. Und mal ernsthaft über Spenden nachdenken.

[1] Janaja, und natürlich nervte mich die fette Bahn-Webseite mit all dem Unsinn darauf auch auf dem Desktop und auch schon vor der gegenwärtigen Verschlimmbesserung.

⎆
Abenteuer Irland: Kaputtes Drupal und eine Mail an die Datenschutzbehörde

12022:242:1 ( 2022-08-30)
Als Reaktion auf meinen Hilferuf gegen Google hat @ulif@chaos.social getrötet:

Vielleicht einfach mal unverbindlich bei der irischen "Datenschutzbehörde" nachfragen? Nicht als Beschwerde, sondern als einfache Anfrage. Denen müssen sie diese Daten ja eigentlich gemeldet haben.

Na schön. Das könnte interessant werden. Das erste Ergebnis einer duckduckgo-Anfrage nach „data protection ireland“. führt gleich zur data protection commission (bzw. Choimisiún um Chosaint Sonraí), https://www.dataprotection.ie/, und ich bekomme beim Draufklicken original das hier:

Keine GET-Parameter, kein POST-Payload, einfach nur https://www.dataprotection.ie/, und schon habe ich eine support ID. Oh wow. Interessanterweise ändert sich das auch nicht, wenn ich dataprotection.ie Javascript erlaube; mit einem Firefox (statt einem luakit) erscheint hingegen die Webseite, wie sich die Leute das wohl vorgestellt haben.

Wie kommt das? Ich curl-e mal eben die Seite und sehe schon recht weit oben:
```
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:site" content="@dpcireland" />
<meta name="twitter:title" content="Homepage | Data Protection Commission" />
```
und noch ein paar mehr Zeilen Twitter-Service. Diese Leute sollten dringend mal ihrem Kollegen in Baden-Württemberg zuhören.

Immerhin kommen aber keine Webfonts von Google, und es laufen auf den ersten Blick auch keine externen Tracking-Dienste („Analytics“). Aber ich finde kein Refresh-Meta oder etwas anderes, das erklären könnte, warum luakit diese eigenartige Fehlermeldung ausgeliefert bekommen könnte, während an curl und firefox recht anständige Antworten gehen.

Leider macht auch dataprotection.ie die bedauerlichen Zwangs-Redirects auf https, so dass es nicht ganz einfach ist, zuzusehen, was mein Browser und der Webserver der IrInnen eigentlich miteinander ausmachen. Aber ich bin neugierig genug auf das, was da zwischen meinem Browser und dem dataprotection.ie-Server vorgeht, dass ich meinen mitmproxy auspacke und damit in die Kommunikation meines eigenen Computers einbreche[1].

Auf diese Weise sehe ich meinen Request:
```
GET https://www.dataprotection.ie/
Host: www.dataprotection.ie
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Tracking is lame.
Accept-Encoding: gzip, deflate
Accept-Language: C
Connection: Keep-Alive
```
Ah… richtig… ich bin ein wenig gemein mit der Sprach-Aushandlung in meinem normalen Browser und frage die Webseiten nach der Sprache C (was weniger gemein ist als es scheinen mag, aber das ist ein längeres Thema). Ein schnelles Experiment bestätitgt, dass es das ist, was den Drupal (das ist das Programm, das deren Webseite macht) der irischen Datenschutzbehörde getötet hat.

Wenn das noch oder wieder kaputt ist[2], wenn du das hier liest, ist eine einfache Kommandozeile, um das Problem zu reproduzieren:
```
$ curl -s -H "Accept-Language: C" https://www.dataprotection.ie/ | head -5
<!DOCTYPE html>
<html lang="en">
<head>
<title> Website error notice | Data Protection Commission </title>
</head>
```
Aber egal, ich war ja eigentlich nicht hier, um Webseiten zu debuggen. Wichtig ist: Ich habe eine Mailadresse. Und das ist viel besser als das, was auf der normalen Webseite steht:

Echt jetzt? Papierpost ist ja schon noch sowas wie ein offener Standard, aber dann nur die proprietären, überwachungskapitalistischen Dienste Twitter, Instagram und Linkedin für Kontaktaufnahme anzubieten und nicht die offene Mail, das wäre auch für einen normalen Laden schon ein starkes Stück. Für eine Datenschutzbehörde… Na ja, ok, wir reden hier über die irische.

Immerhin steht in deren data protection statement:

If you wish to contact our Data Protection Officer in relation to the processing of your personal data by the Commission, you can do so by e-mailing dpo@dataprotection.ie.

Schön: immerhin gibts da eine Mailadresse, bei der ich mich beschweren könnte, aber ganz ehrlich: Anständige DatenschützerInnen sollten da bitte noch einen PGP-Schlüssel dazuschreiben. Jaja, ich weiß: das hier sind die irischen…

Ich sollte natürlich nicht so voreingenommen sein; nur weil die bisher ein Witz waren, heißt das ja nicht, dass sie das auch weiter sein werden, und so habe ich ihnen gerade eine Mail geschickt:

Dear DPO,

It seems your staff has already fixed it, presumably after I triggered some sort of alarm system while investigating the problem (in which case: apologies!), but your CMS until a few minutes ago produced error messages like the one on http://blog.tfiu.de/media/2022/ie-data-protection-breakage.png when queried with an

Accept-Language: C

header. I'm reporting this partly to make sure the apparent fix wasn't a fluke. If it wasn't: kudos to your operations people to have noticed and fixed the problem so quickly.

While I'm here, can I also put forward the reason I'm contacting you in the first place?

You see, I'm trying to get rid of a Google account I created perhaps 15 years ago. To do that, Google tells me to log in. When I try that, Google asks for the e-mail address associated to the account (which is <withheld here>), then for the password. After I've put that in, Google sends a mail to the account with a confirmation code, which is perhaps not entirely unreasonable given I've steered clear of Google services requiring authentication for many years.

But even after entering that confirmation code, it will not let me in, requiring me to enter a telephone number. This is absolutely unreasonable, and I would be grateful if you could tell Google that much; given that Google does not know any telephone number associated to me, there is no way this information could fend off abuse. This is clearly a blantant attempt to skim off the extra piece of data.

I would normally not be bothering you with this obvious imposition, though; I would have liked to first take this to Google's data protection officer. However, I was unable to locate contact information in Google's privacy statements (I was served the German version), which I claim is in open violation of GDPR Article 13. So, could you

(a) tell Google to publish a proper e-mail contact address as part of their GDPR information? While I have to admit that the GDPR is not explicit about it, it is clear to me that Google's own web forms, in particular when they require Javascript and Captchas, or, even worse, a google id, are insufficient to fulfil Art 13 1 (b) GDPR.

(b) meanwhile, provide me with the contact e-mail of Google's data protection officer so I can take my issue to them myself?

Thanks,

(not Anselm Flügel)

Ich bin neugierig, wie es weitergeht. Lobend will ich schon mal erwähnen, dass der irische DPO offenbar keine automatisierten Empfangsbestätigungen („Wir werden uns Ihrem Anliegen so bald wie möglich widmen“) verschickt.

Fortsetzung folgt. Voraussichtlich.

Nachtrag (2022-08-31)

Ich muss das Lob zurücknehmen. Es gab doch eine (halb-) automatisierte Empfangsbestätigung, abgeschickt um 14:47 Lokalzeit in Dublin. Für ein Verfahren, das nur auf Computer setzt, ist das eine komische Zeit bei einer Mail, die am Vorabend um 19:17 MESZ rausging. Wirklich gelesen hat die Mail aber auch niemand. Das weiß ich schon, weil sie mich mit „To Whom It May Concern“ anreden, aber auch wegen der angesichts meiner Anfrage widersinnigen Empfehlung, ich möge mich doch an den Datenschutzbeauftragten „for that organisation“ wenden.

Weil Leute vielleicht später mal die Evolution des Kundendienstesisch des irischen DPO nachvollziehen wollen, belästige ich euch mit dem Volltext:

To Whom It May Concern,

I acknowledge receipt of your e-mail to the Data Protection Commission (DPC) .

In line with our Customer Service Charter, we aim to reply to the concerns raised by you within 20 working days, though complex complaints may require further time for initial assessment. In doing so, we will communicate clearly, providing you with relevant information or an update regarding your correspondence.

What can I do to progress my concern?

In the meantime, if your concern relates to processing of your personal data by an organisation (a “data controller”), or you wish to exercise your data protection rights (for example, access, erasure, rectification), you may wish to contact the data protection officer for that organisation in writing in the first instance. You may wish to forward copies of all written exchanges with the data controller to the DPC if you remain dissatisfied with the response you receive from them. You should send this documentation to info@dataprotection.ie and include the above reference number.

What if I have already contacted an organisation (“data controller”) about my concerns?

If you have already exchanged written correspondence with the data controller, and have not included this information with your initial contact with the DPC, you should send this documentation to info@dataprotection.ie quoting the case reference shown above.

What happens when I send the DPC additional correspondence or documents?

Please be advised that the Data Protection Commission does not issue acknowledgements for each item of follow up or supplementary correspondence received, but this correspondence will be included on the file reference above and assessed alongside your initial concern. Once this assessment has been carried out, a substantive response will be issued to you in due course.

This acknowledgement, and the reference number above, is confirmation that we have received your correspondence and that it will receive a response at the earliest opportunity.

Yours sincerely,

Alexandra X. [und noch ein Nachname]

[Ein paar Footer-Zeilen]

Is le haghaidh an duine nó an eintitis ar …
⎆
How to Block a USB Port on Smart Hubs in Linux

12022:232:1 ( 2022-08-20)
Somewhere beneath the fan on the right edge of this image there is breakage. This post is about how to limit the damage in software until I find the leisure to dig deeper into this pile of hitech.

My machine (a Lenovo X240) has a smart card reader built in, attached to its internal USB. I don't need that device, but until a while ago it did not really hurt either. Yes, it may draw a bit of power, but I'd be surprised if that were more than a few milliwatts or, equivalently, one level of screen backlight brightness; at that level, not even I will bother.

However, two weeks ago the thing started to become flaky, presumably because the connecting cable is starting to rot. The symptom is that the USB stack regularly re-registers the device, spewing a lot of characters into the syslog, like this:
```
Aug 20 20:31:51 kernel: usb 1-1.5: USB disconnect, device number 72
Aug 20 20:31:51 kernel: usb 1-1.5: new full-speed USB device number 73 using ehci-pci
Aug 20 20:31:52 kernel: usb 1-1.5: device not accepting address 73, error -32
Aug 20 20:31:52 kernel: usb 1-1.5: new full-speed USB device number 74 using ehci-pci
Aug 20 20:31:52 kernel: usb 1-1.5: New USB device found, idVendor=058f, idProduct=9540, bcdDevice= 1.20
Aug 20 20:31:52 kernel: usb 1-1.5: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Aug 20 20:31:52 kernel: usb 1-1.5: Product: EMV Smartcard Reader
Aug 20 20:31:52 kernel: usb 1-1.5: Manufacturer: Generic
Aug 20 20:31:53 kernel: usb 1-1.5: USB disconnect, device number 74
Aug 20 20:31:53 kernel: usb 1-1.5: new full-speed USB device number 75 using ehci-pci
[as before]
Aug 20 20:32:01 kernel: usb 1-1.5: new full-speed USB device number 76 using ehci-pci
Aug 20 20:32:01 kernel: usb 1-1.5: New USB device found, idVendor=058f, idProduct=9540, bcdDevice= 1.20
Aug 20 20:32:01 kernel: usb 1-1.5: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[as before]
Aug 20 20:32:02 kernel: usb 1-1.5: USB disconnect, device number 76
```
And that's coming back sometimes after a few seconds, sometimes after a few 10s of minutes. Noise in the syslog is never a good thing (even when you don't scroll syslog on the desktop), as it will one day obscure something one really needs to see, and given that device registrations involve quite a bit of computation, this also is likely to become relevant power-wise. In short: this has to stop.

One could just remove the device physically or at least unplug it. Unfortunately, in this case that is major surgery, which in particular would involve the removal of the CPU heat sink. For that I really want to replace the thermal grease, and I have not been to a shop that sells that kind of thing for a while. So: software to the rescue.

With suitable hubs – the X240's internal hub with the smart card reader is one of them – the tiny utility uhubctl lets one cut power to individual ports. Uhubctl regrettably is not packaged yet; you hence have to build it yourself. I'd do it like this:
```
sudo apt install git build-essential libusb-dev
git clone https://github.com/mvp/uhubctl
cd uhubctl
prefix=/usr/local/ make
sudo env prefix=/usr/local make install
```
After that, you have a program /usr/local/sbin/uhubctl that you can run (as root or through sudo, as it needs elevated permissions) and that then tells you which of the USB hubs on your system support power switching, and it will also tell you about devices connected. In my case, that looks like this:
```
$ sudo /usr/local/sbin/uhubctl
Current status for hub 1-1 [8087:8000, USB 2.00, 8 ports, ppps]
  Port 1: 0100 power
  [...]
  Port 5: 0107 power suspend enable connect [058f:9540 Generic EMV Smartcard Reader]
  [...]
```
This not only tells me the thing can switch off power, it also tells me the flaky device sits on port 5 on the hub 1-1 (careful inspection of the log lines above will reconfirm this finding). To disable it (that is, power it down), I can run:
```
$ sudo /usr/local/sbin/uhubctl -a 0 -l 1-1 -p 5
```
(read uhubctl --help if you don't take my word for it).

Unfortunately, we are not done yet. The trouble is that the device will wake up the next time anyone touches anything in the wider vicinity of that port, as for instance run uhubctl itself. To keep the system from trying to wake the device up, you also need to instruct the kernel to keep its hands off. For our port 5 on the hub 1-1, that's:
```
$ echo disabled > /sys/bus/usb/devices/1-1.5/power/wakeup
```
or rather, because you cannot write to that file as a normal user and I/O redirection is done by your shell and hence wouldn't be influenced by sudo:
```
$ echo disabled | sudo tee /sys/bus/usb/devices/1-1.5/power/wakeup
```
That, indeed, shuts the device up.

Until the next suspend/resume cycle that is, because these settings do not survive across one. To solve that, arrange for a script to be called after resume. That's simple if you use the excellent pm-utils. In that case, simply drop the following script into /etc/pm/sleep.d/30killreader (or so) and chmod +x the file:
```
#!/bin/sh
case "$1" in
  resume|thaw)
    echo disabled > /sys/bus/usb/devices/1-1.5/power/wakeup
    /usr/local/sbin/uhubctl -a 0 -l 1-1 -p 5
    ;;
esac
exit 0
```
If you are curious what is going on here, see /usr/share/doc/pm-utils/HOWTO.hooks.gz.

However, these days it is rather unlikely that you are still leaving suspend and hibernate to pm-utils; instead, on your box this will probably be handled by systemd-logind. You could run pm-utils next to that, I suppose, if you tactfully configured the host of items with baroque names like HandleLidSwitchExternalPower in logind.conf, but, frankly, I wouldn't try that. Systemd's reputation for wanting to manage it all is not altogether undeserved.

I have tried to smuggle in my own code into logind's wakeup procedures years ago in systemd's infancy and found it hard if not impossible. I'm sure it is simpler now. If you know a good way to make logind run a script when resuming: Please let me know. I promise to amend this post for the benefit of people running systemd (which, on some suitable boxes, does include me).
⎆
SPARQL 4: Be Extra Careful on your Birthday

12022:217:0 ( 2022-08-05)
Enough Yak Shaving already.
I am now three parts into my yak-shaving investigation of Wikidata and SPARQL (one, two, three) with the goal of figuring out whether birthdays are more or less dangerous – measured by whether or not people survive them – than ordinary days.

At the end of the last instalment I was stumped by timeouts on Wikidata, and so much of this post is about massaging my query so that I get some answers in the CPU minute that one gets on Wikidata's triplestore. While plain optimisation certainly is necessary, working on six million people within a minute is probably highly ambitious whatever I do[1]. I will hence have to work on a subset. Given what I have worked out in part 3,
```
# This will time out and just waste CPU.
SELECT (count(?person) AS ?n)
WHERE {
  ?person rdfs:label ?personName.
  ?person wdt:P569 ?bdate.
  hint:Prior hint:rangeSafe true.
  ?person wdt:P570 ?ddate.
  hint:Prior hint:rangeSafe true.

  FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))
  FILTER (YEAR(?bdate)>1850)

  FILTER (lang(?personName)="en")
}
```
– how do I do a subset? Proper sampling takes almost as much time as working with the whole thing. But for now, I'd be content with just evaluating my query on whatever subset Wikidata likes to work on. Drawing such a (statiscally badly sampled) subset is what the LIMIT clause you have already seen in part one does. But where do I put it, since, if things work out, my query would only return a single row anyway?
Subqueries in SPARQL, and 50 Labels per Entity

The answer, as in SQL, is: A subquery. In SPARQL, you can have subqueries like this:
```
SELECT (count(?person) as ?n)
WHERE {
  { SELECT ?person ?personName ?bdate ?ddate
    WHERE {
      ?person rdfs:label ?personName;
        wdt:P569 ?bdate;
        wdt:P570 ?ddate.
    }
    LIMIT 50
  }
  FILTER (lang(?personName)="en")
}
```
– so, within a pair of curly braces, you write another SELECT clause, and its is then the input for another WHERE, FILTER, or other SPARQL construct. In this case, by the way, I'm getting 50 records with all kinds of labels in the subquery and then filter out everything that's not English. Amazingly, only one record out of these 50 remains: there are clearly at least 50 statements on labels for the first entity Wikidata has picked here.

Raising the innner limit to 500, I get 10 records. For the particular sample that Wikidata chose for me, a person indeed has 50 labels on average. Wow. Raising the limit to 5000, which probably lowers the the pharaohs to non-pharaohs in the sample, gives 130 records, which translates into 38 labels per person.

Clearly, adding the labels is an expensive operation, and since I do not need them for counting, I will drop them henceforth. Also, I am doing my filtering for off-January 1st birthdays in the subquery. In this way, I probably have a good chance that everything coming out of the subquery actually counts in the outer filter, which means I can compute the rate of people dying on their birthday by dividing my count by the limit.

Let's see where this gets us:
```
SELECT (count(?person) AS ?n)
WHERE {
  { SELECT ?person ?bdate ?ddate
    WHERE {
      ?person wdt:P569 ?bdate.
      hint:Prior hint:rangeSafe true.
      ?person wdt:P570 ?ddate.
       FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
      FILTER (YEAR(?bdate)>1850)
      hint:Prior hint:rangeSafe true.
    }
    LIMIT 500
  }

  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))
  hint:Prior hint:rangeSafe true.
}
```
Named Subqueries and Planner Barriers

That's returning a two for me, which is not implausible, but for just 500 records it ran for about 20 seconds, which does not feel right. Neither pulling the 500 records nor filtering them should take that long.

When a database engine takes a lot longer than one thinks it should, what one should do is take look at the query plan, in which the database engine states in which sequence it will compute the result, which indexes it intends to use, etc.

Working out a good query plan is hard, because in general you need to know the various partial results to find one; in my example, for instance, the system could first filter out everyone born after 1850 and then find the death dates for them, or it could first match the death dates to the birthdays (discarding everything that does not have a death day in the process) and then filter for 1850. Ff there were may people with birthdays but no death days (for instance, because your database talks mostly about living people), the second approach might be a lot faster. But you, that is, the database engine, have to have good statistics to figure that out.

Since that is such a hard problem, it is not uncommon that the query planner gets confused and re-orders operations such that things are a lot slower than they would be if it hadn't done the ordering, and that's when one should use some sort of explain feature (cf. Wikidata's optimisation hints). On Wikidata, you can add an explain=detail parameter to the query and then somehow display the bunch of HTML you get back.

I did that and, as I usually do when I try this kind of thing, found that query plans are notoriously hard to understand, in particular when one is unfamiliar with the database engine. But in the process of figuring out the explain thing, I had discovered that SPARQL has the equivalent of SQL's common table expressions (CTEs), which gave me an excuse to tinker rather than think about plans. Who could resist that?

In SPARQL, CTEs are called named subqueries and used like this:
```
SELECT (count(?person) AS ?n)
WITH { SELECT ?person ?bdate ?ddate
    WHERE {
      ?person wdt:P569 ?bdate;
      hint:Prior hint:rangeSafe true.
      ?person wdt:P570 ?ddate.
      hint:Prior hint:rangeSafe true.
       FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
      FILTER (YEAR(?bdate)>1850)
    }
    LIMIT 30000
  } AS %selection
WHERE {
  INCLUDE %selection

  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))
  hint:Prior hint:rangeSafe true.
```
– you write your subquery in a WITH-block and give it a name that you then INCLUDE in your WHERE clause. In several SQL database systems, such a construct provides a planner barrier, that is, the planner will not rearrange expressions across a WITH.

So does, according to the optimisation hints, Wikidata's current SPARQL engine. But disappointingly, things don't speed up. Hmpf. Even so, I consider named subexpresisons more readable than nested ones[2], so for this post, I will stay with them. In case you come up with a brilliant all-Wikidata query, you probably want to go back to inline subqueries, though, because then you probably do not want to constrain the planner too much.
Finally: Numbers. But what do they Mean?

With my barrier-protected subqueries, I have definitely given up on working with all 6 million persons with birthdays within Wikidata. Interestingly, I could go from a limit of 500 to one of 30'000 and stay within the time limit. I never went back to try and figure out what causes this odd scaling law, though I'd probably learn a lot if I did. I'd almost certainly learn even more if I tried to understand why with a limit of 50'000, the queries tended to time out. But then 30'000 persons are plenty for my purpose provided they are drawn reasonably randomly, and hence I skipped all the tempting opportunities to learn.

And, ta-da: With the above query, I get 139 matches (i.e., people who died on their birthdays).

What does that say on the danger of birthdays? Well, let's see: If birthdays were just like other days, one would expect 30'000/365 deaths on birthdays in this sample, which works out to a bit more than 80. Is the 140 I am finding different from that 80 taking into account statistical noise? A good rule of thumb (that in the end is related to the grand central limit theorem) is that when you count n samples, your random error is something like √(n) if everything is benevolent. For 140, that square root is about 12, which we physicist-hackers like to write as σ = 12, and then we quickly divide the offset (i.e., 140 − 80 = 60) by that σ and triumphantly proclaim that “we've got a 5-σ effect”, at which point everyone is convinced that birthdays are life-threatening.

This is related to the normal distribution (“Gauss curve”) that has about 2/3s of its area within “one σ” (which is its width as half maximum and would be the standard deviation of something you draw from such a distribution), 95% of …
⎆
SPAQRL 3: Who Died on their Birthdays?

12022:216:0 ( 2022-08-04)
A lot of Yak Shaving left to do here.
Now that I have learned how to figure out dates of birth and death in Wikidata and have made myself sensible tools to submit queries, I can finally start to figure out how I can let Wikidata pick out people dying on the same day of the year they were born on, just like Mary Lea Heger.

I first fetched people with their birthdays and dates of death:
```
SELECT ?person ?bday ?dday
WHERE {
  ?person wdt:P569 ?bday.
  ?person wdt:P570 ?dday.
}
LIMIT 2
```
Consider that query for a while and appreciate that by giving two triple patterns using the same variable, ?person, I am performing what in SQL databases would be a join. That that's such a natural thing in SPARQL is, I'd say, a clear strong point for the language.

Here is what I got back when I ran this this through my wpd shell function from part two:
```
person=http://www.wikidata.org/entity/Q18722
dday=-1871-06-29T00:00:00Z
bday=-2000-01-01T00:00:00Z

person=http://www.wikidata.org/entity/Q18734
dday=-1884-01-01T00:00:00Z
bday=-2000-01-01T00:00:00Z
```
This seems to work, except the dates look a bit odd. Did these people really die more than a hundred years before they were born? Ah, wait: these are negative dates. Opening the person URIs as per part one in a browser, I one learns that Q18722 is pharaoh Senusret II, and at least his birthday clearly is… not very certain. If these kinds of estimates are common, I probably should exclude January 1st from my considerations.
Getting Labels

But the first thing I wanted at that point was to not have to click on the URIs to see names. I knew enough about RDF to simply try and get labels according to RDF schema:
```
SELECT ?personName ?bday ?dday
WHERE {
  ?person rdfs:label ?personName.
  ?person wdt:P569 ?bday.
  ?person wdt:P570 ?dday.
}
LIMIT 10
```
That's another SQL join, by the way. Nice. Except what comes back is this:
```
dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=Хуан-ди

dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=Huang Di

dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=ኋንግ ዲ

dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=هوان جي دي

dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=Emperador mariellu
```
If you select the URI in ?person in addition to just the name, you'll see that we now got many records per person. Actually, one per label, as is to be expected in a proper join, and since there are so many languages and scripts out there, many persons in Wikidata have many labels.

At this point I consulted Bob DuCharme's Learning SPARQL and figured a filter on the language of the label is what I wanted. This does not call for a further join (i.e., triple pattern), as the language is something like a property of the object literal (i.e., the string) itself. To retrieve it, there is a function determining the language, used with a FILTER clause like this:
```
SELECT ?personName ?bday ?dday
WHERE {
  ?person rdfs:label ?personName.
  ?person wdt:P569 ?bday.
  ?person wdt:P570 ?dday.

  FILTER (lang(?personName)="en")
}
LIMIT 10
```
FILTER is a generic SPARQL thing that is more like a SQL WHERE clause than SPARQL's WHERE clause itself. We will be using it a lot more below.

There is a nice alternative to this kind of joining and filtering I would have found in the wikidata user manual had I read it in time. You see, SPARQL also defines a service interface, and that then allows queriers to mix and match SPARQL-speaking services within a query. Wikidata has many uses for that, and one is a service that can automatically add labels with minimal effort. You just declare that you want that service to filter your query, and then you write ?varLabel to get labels instead of URIs for ?var, as in:
```
# don't run this.  It'll put load on Wikidata and then time out
SELECT ?personLabel ?bday ?dday
WHERE {
  ?person wdt:P569 ?bday.
  ?person wdt:P570 ?dday.

  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}
LIMIT 10
```
The trouble with that: This would first pull out all result triples (a couple of million) out of wikidata and then hand over these triples to the wikibase:label service, which would then add the labels and hand back all the labelled records. Only then will the engine discard the all result rows but the first 10. That obviously is terribly inefficient, and Wikidata will kill this query after a minute.

So: Be careful with SERVICE when intermediate result sets could be large. I later had to use subqueries anyway; I could have used them here, too, to work around the millions-of-triples problem, but at that point I hadn't progressed to these subqueries, let alone means to defeat the planner (that's part four material).
Determining Day and Month

Turtle (about my preference for which you could already read in part two) has a nifty abbreviation where you can put a semicolon after a triple and then leave out the subject in the next triple. SPARQL will put the previous subject into this next triple. That works in SPARQL, too, so I can write my query above with fewer keystrokes:
```
SELECT ?personName ?bday ?dday
WHERE {
  ?person rdfs:label ?personName;
    wdt:P569 ?bday;
    wdt:P570 ?dday.

  FILTER (lang(?personName)="en")
}
LIMIT 10
```
Now I need to figure out the birthday, as in: date within a year. In Bob DuCharme's book I found a SUBSTR function, and BIND clauses that let you compute values and bind them to variables. Treating the dates as ISO strings (“YYYY-MM-DD“) should let me pull out month and date starting at index 6 (gna: SPARQL starts with index 1 rather than with 0 as all deities in known history wanted), and then five characters, no? Let's see:
```
SELECT ?personName ?bdate ?ddate
 WHERE {
   ?person rdfs:label ?personName;
     wdt:P569 ?bday;
     wdt:P570 ?dday.

   BIND (SUBSTR(?bday, 6, 5) as ?bdate)
   BIND (SUBSTR(?dday, 6, 5) as ?ddate)

   FILTER (lang(?personName)="en")
 }
 LIMIT 3
```
This gives:
```
personName=Sobekhotep I
bdate=-01-0
ddate=-01-0

personName=Amenemhat I
ddate=-02-1

personName=Senusret II
bdate=-01-0
ddate=-06-2
```
Well, that is a failure. And that's because my assumptions on string indices are wrong in general, that is: for people born before the Christian era, and then again for people born after 9999-12-31. Which makes we wonder: Does Wikidata have people born in the future? Well, see below.

I'll cheat a bit and spare you a few of the dead alleys I actually followed trying to fix this, because they are not very educational. Also, when I strugged with the timeouts I will discuss in a moment I learned about the right way to do this on Wikidata's optimisation page: When something is a date, you can apply the functions DAY, MONTH, and YEAR on it, and that will plausibly even use indexes and thus be a lot faster.

Thinking about YEAR and having seen the fantasy dates for the ancient Egyptian pharaohs, I also decided to exclude everyone before 1850; that ought to suffice for letting me forget about Gregorian vs. Julian dates, and the likelihood that the dates are more or less right ought to be a lot higher in those days than in the 14th century, say.

With that, I can write the “birth day equals death day“ in a filter without further qualms. The resulting query is starting to look imposing:
```
SELECT ?person ?personName ?bdate ?ddate
WHERE {
  ?person rdfs:label ?personName.
  ?person rdfs:label  wdt:P569 ?bdate.
  hint:Prior hint:rangeSafe true.
  ?person rdfs:label wdt:P570 ?ddate.
  hint:Prior hint:rangeSafe true.

  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate))
  FILTER (YEAR(?bdate)>1850)

  FILTER (lang(?personName)="en")
}
LIMIT 10
```
The odd triples with hint:Prior are hints for Wikidata's triple store that, or so I understand Wikidata's documentation, encourages it to use indexes on the properties mentioned in the previous lines; the query certainly works without those, and I frankly have not bothered to see if they actually do anything at all for the present queries. There. Accuse me of practising Cargo Cult if you want.

Anyway, the result is looking as awful as I had expected given my first impressions with January 1st, and clearly, ensuring birthdays after 1850 is not enough to filter out junk:
```
person=http://www.wikidata.org/entity/Q112689783
ddate=0080-01-01T00:00:00Z
bdate=1920-01-01T00:00:00Z
personName=Enrico Mezzasalma

person=http://www.wikidata.org/entity/Q19976926
ddate=1342-01-01T00:00:00Z
bdate=2000-01-01T00:00:00Z
personName=Peter Jonsson

person=http://www.wikidata.org/entity/Q19291026
ddate=1400-01-01T00:00:00Z
bdate=2000-01-01T00:00:00Z
personName=Galceran Marquet
...
```
It seems Wikidata even uses 2000-01-01 as some sort of NULL value. Dang. Let's filter out January 1st, then:
```
SELECT ?person ?personName ?bdate ?ddate
WHERE {
  ?person rdfs:label …
```
⎆
SPARQL 2: Improvising a client

12022:211:6 ( 2022-07-31)
There is still a lot of hair on the Yak I am shaving in this little series of posts on SPARQL. All the Yaks shown in the series lived on the Valüla Mountain in Vorarlberg, Austria.
This picks up my story on figuring out whether birthdays are dangerous using SPRAQL on Wikidata. You can probably skip this part if you're only interested in writing SPARQL queries to Wikidata and are happy with the browser form they give you. But you shouldn't. On both accounts.

At the end of part one, I, for one, was unhappy about the Javascript-based UI at Wikidata and had decided I wanted a user interface that would let me edit my queries in a proper editor (in particular, locally on my machine, giving me the freedom to choose my tooling).

My browser's web inspector quickly showed me that the non-Javascript web UI simply sent a query argument to https://query.wikidata.org/sparql. That's easy to do using curl, except I want to read the argument from a file (that is, the one I am editing in my vi). Helpfully, curl's man page informs on the --form option:

This enables uploading of binary files etc. To force the 'content' part to be a file, prefix the file name with an @ sign. To just get the content part from a file, prefix the file name with the symbol <. The difference between @ and < is then that @ makes a file get attached in the post as a file upload, while the < makes a text field and just get the contents for that text field from a file.
Uploads, Multipart, Urlencoded, Oh My!

In this case, Wikidata probably does not expect actual uploads in the query argument (and the form does not submit it in this way), so < it ought to be.

To try it, I put:
```
SELECT ?p ?o
WHERE {
  wd:Q937 ?p ?o.
}
LIMIT 5
```
(the query for everything Wikidata says about Albert Einstein, plus a LIMIT clause so I only pull five triples, both to reduce load on Wikidata and to reduce clutter in my terminal while experimenting) into a file einstein.rq. And then I typed:
```
curl --form query=<einstein.rq https://query.wikidata.org/sparql
```
into my shell. Soberingly, this gives:
```
Not writable.
```
Huh? I was not trying to write anything, was I? Well, who knows: Curl, in its man page, says that using --form does a POST with a media type of multipart/form-data, which many web components (mistakenly, I would argue) take as a file upload. Perhaps the remote machinery shares this misconception?

Going back to the source of https://query.wikidata.org/, it turns out the form there does a GET, and the query parameter hence does not get uploaded in a POST but rather appended to the URL. Appending to the URL isn't trivial with curl (I think), but curl's --data option at least POSTs the parameters in application/x-www-form-urlencoded, which is what browsers do when you don't have uploads. It can read from files, too, using @<filename>. Let's try that:
```
curl --data query=@einstein.rq https://query.wikidata.org/sparql
```
Oh bother. That returns a lenghty message with about a ton of Java traceback and an error message in its core:
```
org.openrdf.query.MalformedQueryException: Encountered " <LANGTAG> "@einstein "" at line 1, column 1.
Was expecting one of:
    "base" ...
    "prefix" ...
    "select" ...
    "construct" ...
    "describe" ...
    "ask" ...
```
Hu? Apparently, my query was malformed? Helpfully, Wikidata says what query it saw: queryStr=@einstein.rq. So, curl did not make good on its promise of putting in the contents of einstein.rq. Reading the man page again, this time properly, I have to admit I should have expected that: “if you start the data with the letter @“, it says there (emphasis mine). But haven't I regularly put in query parameters in this way in the past?

Sure I did, but I was using the --data-urlencode option, which is what actually simulates a browser and has a slightly different syntax again:
```
curl --data-urlencode query@einstein.rq https://query.wikidata.org/sparql
```
Ha! That does the trick. What comes back is a bunch of XML, starting with:
```
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
  <head>
    <variable name='p'/>
    <variable name='o'/>
  </head>
  <results>
    <result>
      <binding name='p'>
        <uri>http://schema.org/version</uri>
      </binding>
      <binding name='o'>
        <literal datatype='http://www.w3.org/2001/XMLSchema#integer'>1692345626</literal>
      </binding>
    </result>
```
Making the Output Friendlier: Turtle?

Hm. That's not nice to read. I thought: Well, there's Turtle, a nice way to write RDF triples in plain text. In RDF land, people rather regularly support the HTTP accept header, a wildly underused and cool feature of HTTP that lets a client say what kind of data it would like to get (see Content negotiation in the Wikipedia). So, I thought, perhaps I can tell Wikidata to produce Turtle using accept?

This plan looks like this when translated to curl:
```
curl --header "accept: text/turtle" \
  --data-urlencode query@einstein.rq https://query.wikidata.org/sparql
```
Only, the output does not change, Wikidata ignores my request.

Thinking again, it is well advised to do so (except it could have produced a 406 Not Acceptable response, but that would probably be even less useful). The most important thing to remember from part one is that RDF talks about triples of subject, predicate, and object. In SPARQL, you have a SELECT clause, which means a result row in general will not consist of subject, predicate, and object. Hence, the service couldn't possibly return results in Turtle: What does not consist of RDF triples canot be serialised as RDF triples.
Making the Output Friendlier: XSLT!

But then what do I do instead to improve result readability? For quick and (relatively) easy XML manipulation on the command line, I almost always recommend xmlstarlet. While I give you its man page has ample room for improvement, and compared to writing XSL stylesheets, the command line options of xmlstarlet sel (use its -h option for explanations) are somewhat obscure, but it just works and is compact.

If you inspect the response from Wikidata, you will notice that the results come in result elements, which for every variable in your SELECT clause have one binding element, which in turn has a name attribute and then some sort of value in its content; for now, I'll settle for fetching either uri or literal (again, part one has a bit more on what that might mean). What I need to tell xmlstarlet thus is: “Look for all result elements and produce one output record per such element. Within each, make a name/value pair from a binding's name attribute and any uri or literal element you find.” In code, I furthermore need to add an XML prefix definition (that's totally orthogonal to RDF prefixes). With the original curl and a pipe, this results in:
```
curl --data-urlencode query@einstein.rq https://query.wikidata.org/sparql \
| xmlstarlet sel -T -N s="http://www.w3.org/2005/sparql-results#" -t \
  -m //s:result --nl -m s:binding -v @name -o = -v s:uri -v s:literal --nl
```
Phewy. I told you xmlstarlet sel had a bit of an obscure command line. I certainy don't want to type that every time I run a query. Saving keystrokes that are largely constant across multiple command invocations is what shell aliases are for, or, because this one would be a bit long and fiddly, shell functions. Hence, I put the following into my ~/.aliases (which is being read by the shell in most distributions, I think; in case of doubt, ~/.bashrc would work whenever you use bash):
```
function wdq() {
  curl -s --data-urlencode "query@$1" https://query.wikidata.org/sparql
  | xmlstarlet sel -T -N s="http://www.w3.org/2005/sparql-results#" -t \
    -m //s:result --nl -m s:binding -v @name -o = -v s:uri -v s:literal --nl
}
```
(notice the $1 instead of the constant file name here). With an exec bash – my preferred way to get a shell to reflecting the current startup scripts –, I can now type:
```
wdq einstein.rq | less
```
and get a nicely paged output like:
```
p=http://schema.org/version
o=1692345626

p=http://schema.org/dateModified
o=2022-07-31T01:52:04Z

p=http://schema.org/description
o=ލިޔުންތެރިއެއް

p=http://schema.org/description
o=ಗಣಿತಜ್ಞ

p=http://schema.org/description
o=भौतिकशास्त्रातील नोबेल पारितोषिकविजेता शास्त्रज्ञ.
```
We will look at how to filter out descriptions in languagues one can't read, let alone speak, in the next instalment.

For now, I'm reasonably happy with this, except of course I'll get many queries wrong initially, and then Wikidata does not return XML at all. In that case, xmlstarlet produces nothing but an unhelpful error message of its own, because it …
⎆
SPARQL and Wikidata 1: Setting out

12022:208:1 ( 2022-07-27)
If you continue, you will read about a first-rate example of Yak Shaving
While listening to a short biography of the astrophysicist Mary Lea Heger (my story; sorry, in German), I learned that she died on her birthday. That made me wonder: How common is that? Are people prone to die on their birthdays, perhaps because the parties are so strenuous, perhaps because they consider them a landmark that they are so determined to reach that they hold on to dear life until they have reached it? Or are they perhaps less likely to die because all that attention strengthens their spirits?

I figured that could be a nice question for Wikidata, a semantic database that feeds Wikipedia with all kinds of semi-linguistic or numeric information. Even if you are not Wikipedia, you can run fairly complex queries against it using a language called SPARQL. I've always wanted to play with that, in particular because SPARQL seems an interesting piece of tech. Answering the question of the letality of birthdays turned out to be a mildly exciting (in a somewhat nerdy sense) journey, and I thought my story of how I did my first steps with SPARQL might be suitably entertaining.

Since it is a relatively long story, I will split it up into a few instalments. This first part relates a few preliminaries and then does the first few (and very simple) queries. The preliminaries are mainly introducing the design of (parts of) RDF with my take on why people built it like that.
Basics: RDF in a few paragraphs

For motivating the Resource Description Format RDF and why people bother with it, I couldn't possibly do better than Norman Gray in his witty piece on Jordan, Jordan and Jordan. For immediate applicability, Wikidata's User Manual is hard to beat.

But if you're in a hurry, you can get by with remembering that within RDF, information is represented in triples of (subject, predicate, object). This is somewhat reminiscent of a natural-language sentence, although the “predicate“ typically would be a full verb phrase, possibly with a few prepositions sprinkled in for good measure. Things typically serving as predicates are called “property“ in RDF land, and the first example for those defined in wikidata, P10 [1], would be something like has-a-video-about-it-at or so, as in:
```
"Christer Fuglesang", P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20en.webm
"Christer Fuglesang", P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20ru.webm
```
If you know about first order logic: It's that kind of predicate. And please don't ask who Christer Fuglesang is, the triples just came up first in a query you will learn about a bit further down.

This was a bit of a simplification, because RDF will not usually refer to a thing (an “entity“ in RDF jargon) with a string (“literal”), if only because there could be multiple Christer Fuglesangs and a computer would be at a loss as to which one I mean in the two triples above. RDF instead talks about “resources“, which is anything that has a URI and encompasses both entities and properties. So, a statement as above would actually combine three URIs:

http://www.wikidata.org/entity/Q317382, http://www.wikidata.org/prop/direct/P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20en.webm
CURIEs

That is a lot of stuff to type, and thus almost everything in the RDF world supports URL abbreviation using prefixes. Basically, in some way you say that whenever there's wpt: in a token, some magic replaces it by http://www.wikidata.org/prop/direct/. Ff you know about XML namespaces (and if you're doing any sort of non-trivial XML, you should): same thing, except that the exact syntax for writing down the mapping from prefixes to URIs depends on how you represent the RDF triples.

These “words with a colon that expand to long URIs by some find-and-replace rules“ were called CURIEs, compact URIs, for a while, but I think that term has become unpopular again. I consider this a bit of a pity, because it was nice to have a name for them, and such a physics-related one on top. But it seems nobody cared enough to push the W3C draft for that ahead.

As with XML namespaces, each RDF document could use its own prefix mapping; if you want, you could instruct an RDF processor to let you write wikidata-direct-property: for http://www.wikidata.org/prop/direct/ rather than wpt: as usual. But that would be an unfriendly act. At least for the more popular collections of RDF resources, there are canonical prefixes: you don't have to use them, but everyone will hate you if you don't. In particular, don't bind well-known prefixes like foaf: (see below) to URIs other than the canonical ones except when seeing whether a piece of software does it right or setting traps for unsuspecting people you don't like.

Then again, for what we are doing here, you do not need to bother about prefix mappings at all, because the wikidata engine has all prefixes we will use prefined and built in. So, as long as you are aware that you can replace the funny prefixes with URI parts and that there is some place where those URIs parts are defined, you are fine.

Long URIs in RDF?

You could certainly ask why we're bothering with the URIs in the first place if people in practice use the canonical prefixes almost exclusively. I think the main reason RDF was built on URIs was because its parents on the one hand wanted to let everyone “build” resources with minimal effort. On the other hand, they wanted to ensure as best they could that two people would not accidentally use the same resource identifier while meaning different things.

To ensure the uniqueness of identifiers, piggybacking on the domain name system, which already makes sure that there are never two machines called, say, blog.tfiu.de in the world, is a rather obvious move. In HTTP URIs, domain names show up as the authority (the host part, the thing between the first colon and the double slash), and so with URIs of that sort you can start creating (if you will) your resources and would never conflict with anyone else as long as hold on to your domain.

In addition, nobody can predict which of these namespace URIs will become popular enough to warrant a globally reserved prefix of their own; you see, family-safe prefixes with four (or fewer) letters are a rather scarce resource, and you don't want to run a registry of those. If you did, you would become really unpopular with all the people you had to tell things like “no, your stuff is too unimportant for a nice abbreviation, but you can have aegh7Eba-veeHeql1:“

The admittedly unwieldy URIs in practice also have a big advantage, and one that would require some complex scheme like the Handle system if you wanted to replicate it with prefixes: most of the time, you can resolve them.
Non-speaking URIs

While RDF itself does not care, most URIs in this business actually resolve to something that is readable in a common web browser; you can always try it with the resources I will be mentioning later. This easy resolution is particularly important in the case of Wikidata's URIs, which are essentially just numbers. Except for a few cases (wd:Q42 is Douglas Adams, and wd:Q1 is the Universe), these numbers don't tell you anything.

There is no fixed rule that RDF URIs must have a lexical form that does not suggest their meaning. As a counterexample, http://xmlns.com/foaf/0.1/birthday is a property linking a person with its, well, birthday in a popular collection of RDF resources[2] called foaf (as in friend of a friend – basically, you can write RDF-complicant address books with that).

There are three arguments I have heard against URIs with such a speaking form:
- Don't favour English (a goal that the very multilingual Wikipedia projects might plausibly have).
- It's hard to automatically generate such URIs (which certainly is an important point when someone generates resources as quickly and with minimal supervision as Wikidata).
- People endlessly quarrel about what should be in the URI when they should really be quarrelling about the label, i.e., what is actually shown to readers in the various natural languages, and still more about the actual concepts and definitions. Also, you can't repair the URI if you later find you got the lexical form slightly wrong, whereas it's easy to fix labels.
I'm not sure which of these made Wikidata choose their schema of Q<number> (for entities) and P<number> (for properties) – but all of them would apply, so that's what we have: without looking …
⎆
PSA: netsurf 3 does not accept cookies from localhost

12022:190:4 ( 2022-07-10)
As I have already pointed out in April, I consider simple and compact web browsers a matter of freedom (well, Freedom as in speech, actually), and although there's been a bit of talk about ladybird lately, my favourite in this category still is netsurf, which apparently to this date is lean enough to be runnable on vintage 1990 Atari TT machines. I'll freely admit I have not tried it, but the code is there.

Yesterday, however, netsurf drove me crazy for a while: I was developing a web site, making sure it works with netsurf. This website has a cookie-based persistent login feature, and that didn't work. I sent my Set-Cookie headers all right – ngrep is your friend if you want to be sure, somewhat like this:
```
sudo ngrep -i -d lo cookie port 8888
```
Ngrep also clearly showed that netsurf really did not send any Cookie headers, so the problem wasn't on the cookie header parsing side of my program, either.

But why did the cookies disappear? Cookie policy? Ha: netsurf does accept a cookie from Google, and crunching this would be the first thing any reasonable policy would do. Did I perhaps fail to properly adhere to the standards (which is another thing netsurf tends to uncover)? Hm: looking up the cookie syntax spec gave me some confidence that I was doing the right thing. Is my Max-Age ok? Sure, it is.

The answer to this riddle: netsurf does not store cookies if it cannot sort them into a hierarchy of host names, and it never can do that for host names without dots (as in localhost, for instance). Given the ill-thought-out Domain attribute one can set for cookies (see the spec linked above if you want to shudder), I even have a solid amount of sympathy for that behaviour.

But given that that is something that will probably bite a lot of people caring about freedom enough to bother with netsurf, I am still a bit surprised that my frantic querying of search engines on that matter did not bring up the slightly unconventional cookie handling of netsurf. Let us hope this post's title will change that. Again, netsurf 3 will not store cookies for not only localhost but any host name without dots in it. Which is a bit inconvenient for development, and hence despite my sympathy I am considering a bug report.

Meanwhile, I've worked around the problem by adding:
```
127.0.0.1 victor.local.de
```
to my /etc/localhost (the name really doesn't matter as long as it will never clash with an actual site you want to talk to and it contains one or more dots) and access the site I'm developing as http://victor.local.de. Presto: my cookie comes back from netsurf all right.
A Debugging Session

So, how did I figure this riddle out? The great thing about Debian and halfway compact software like netsurf is that it makes it reasonably simple to figure out such (mis-) features. Since I firmly believe that the use of debuggers is a very basic skill everyone touching a computer should have, let me give a brief introduction here.

First, you need to get the package's source. Make sure it matches the version of the program that you actually run; to do that, copy the deb line in /etc/apt/sources.list for the repository the package comes from (note that this could be the security repo if you got updates from there). In the copied line, replace deb with deb-src. In my case, that would be:
```
deb-src https://deb.debian.org/debian bullseye main
```
On a freshly installed Debian, it's likely you already have a line like this; consider commenting out the deb-src lines when not working with source code, as that will make your apt operations a bit faster.

After an apt update, I can now pull the source. To keep your file system tidy, I put all such sources into children of a given directory, perhaps /usr/src if you're old-school, or ~/src if not:
```
cd
mkdir -p src/netsurf
cd src/netsurf
apt-get source netsurf-gtk
```
I'm creating the intermediate netsurf directory because apt-get source creates four items in the directory, and in case you're actually building a package (which you could, based on this), more entries will follow; keeping all that mess outside of src helps a lot. Note that apt-get source does not need any special privileges. You really should run it as yourself.

By the way, this is the first part where monsters like webkit make this kind of thing really strenuous: libwebkit sources (which still are missing much over a full browser) pull 26 megabytes of archive expanding to a whopping 300 Megabytes of source-ish goo.

To go on, enter the directory that apt-get source created; in my case, that was netsurf-3.10. You can now look around, and something like:
```
find . -name "*.c" | xargs grep "set-cookie"
```
quickly brought me to a file called netsurf/content/urldb.c (yeah, you can use software like rgrep for „grep an entire tree“; but then the find/xargs combo is useful for many other tasks, too).

Since I still suspected a problem when netsurf parses my set-cookie header, the function urldb_parse_cookie in there caught my eye. It's not pretty that that function is such an endless beast of hand-crafted C (rather than a few lines of lex[1]), but it's relatively readable C, and they are clearly trying to accomodate some of the horrible practices out there (which is probably the reason they're not using lex), so just looking at the code cast increasing doubts on my hypothesis of some minor standards breach on my end.

In this way, idly browsing the source code went nowhere, and I decided I needed to see the thing in action. In order to not get lost in compiled machine code while doing that, one needs debug symbols, i.e., information that tells a debugger what compiled stuff resulted from what source code. Modern Debians have packages with these symbols in an extra repository; you can guess the naming scheme from the apt.sources string one has to use for bullseye:
```
deb http://debug.mirrors.debian.org/debian-debug bullseye-debug main
```
After another round of apt update, you can install the package netsurf-gtk-dbgsym (i.e., just append a -dbgsym to the name of the package that contains the program you want to debug). Once that's in, you can run the GNU debugger gdb:
```
gdb netsurf
```
which will drop you into a command line prompt (there's also a cool graphical front-end to gdb in Debian, ddd, but for little things like this I've found plain gdb to be less in my way). Oh, and be sure to do that in the directory with the extracted sources; only then can gdb show you the source lines (ok: you could configure it to find the sources elsewhere, but that's rarely worth the effort).

Given we want to see what happens in the function urldb_parse_cookie, we tell gdb to come back to us when the program enters that function, and then to start the program:
```
(gdb) break urldb_parse_cookie
Breakpoint 1 at 0x1a1c80: file content/urldb.c, line 1842.
(gdb) run
Starting program: /usr/bin/netsurf
```
With that, netsurf's UI comes up and I can go to my cookie-setting page. When I try to set the cookie, gdb indeed stops netsurf and asks me what to do next:
```
Thread 1 "netsurf" hit Breakpoint 1, urldb_parse_cookie (url=0x56bcbcb0,
    cookie=0xffffbf54) at content/urldb.c:1842
1842  {
(gdb) n
1853    assert(url && cookie && *cookie);
```
n (next) lets me execute the next source line (which I did here). Other basic commands include print (to see values), list (to see code), s (to step into functions, which n will just execute as one instruction), and cont (which resumes execution).

In this particular debugging session, everything went smoothly, except I needed to skip over a loop that was boring to watch stepping through code. This is exactly what gdb's until command is for: typing it at the end of the loop will fast forward over the loop execution and then come back once the loop is finished (at which point you can see what its net result is).

But if the URL parsing went just fine: Why doesn't netsurf send back my cookie?

Well, tracing on after the function returned eventually lead to this:
```
3889      suffix = nspsl_getpublicsuffix(dot);
(gdb)
3890      if (suffix == NULL) {
```
and a print(suffifx) confirmed: suffix for localhost is NULL. Looking at the source code (you remember the list command, and I usually keep the source open in an editor window, too) confirms that this makes netsurf return before storing the freshly parsed cookie, and a cookie not stored is a cookie not sent back to the originating site. Ha!

You do not want to contemplate how such a session would look like with a webkit browser or, worse, firefox or chromium, not to mention stuff you don't have the source …
⎆
Entering Wild Unicode in Vim

12022:153:2 ( 2022-06-02)
This is what this post is about: being able to type PILE OF POO (a.k.a. U+1f4a9) in vim in…tuitively. For certain notions of intuition.

As a veteran of writing texts in TeX, I've long tended to not bother with “interesting” characters (like the quotes I've just used, or the plethora of funny characters one has for writing math) in non-TeX texts. That is, until I started writing a lot of material not (directly) formatted using TeX, as for instance this post. And until reasonably robust Unicode tooling was widely available.[1]

I enter most of my text in vim, and once I decided I wanted the exotic unicode characters I experimented with various ways to marry unicode and vim. What I ended up doing is somewhat inspired by TeX, where one enters funny characters through macros: a backslash and a few reasonably suggestive letters, as perhaps \sigma for σ or \heartsuite for ❤.

What lets me do a similar thing in vim are interactive mode abbreviations and unicode escapes. I've found the abbreviations do not inconvenience me otherwise if I conclude them with a slash. And so I now have quite a few definitions like:
```
iab <expr> scissors/ "\u2702"
```
in my ~/.vimrc. This lets me type scissors/␣ to get ✂ (and blank/␣ to get the visible blank after the scissors). This works reasonably well for me; it's only when the abbreviation is not bounded by blanks that I have have to briefly leave the insert mode to make sure that vim recognises the abbreviation. For instance y▶y – where the abbreviation needs to directly abut the letter – I have to type as y<ESC>aarrleft/ <BACKSPACE>y. I don't know about people who didn't grow up with TeX, but in my world such a thing passes as really natural, and for me it easily beats the multibyte keymaps that, I think, the vim authors would recommend here.

And how do I figure out the unicode code points (i.e., the stuff after the \u)? Well, there is the unicode(1) command (Debian package unicode), which sounds cool but in reality only points me to what I'm looking for every other time or so: It's hard to come up with good words to look for characters the name of which one doesn't know.

In practice, most of the time I look at the various code blocks linked from the Wikipedia unicode page. Going by their titles in my experience is a good way to optically hunt for glyphs I'm looking for. The result is the following abbreviations – if you make interesting new ones, do send them in and I will update this list:
```
iab <expr> deg/ "\u00B0"
iab <expr> pm/ "\u00B1"
iab <expr> squared/ "\u00B2"
iab <expr> cubed/ "\u00B3"
iab <expr> times/ "\u00D7"
iab <expr> half/ "\u00BD"
iab <expr> acirc/ "\u00E2"
iab <expr> egrave/ "\u00E8"
iab <expr> idia/ "\u00EF"
iab <expr> subtwo/ "\u2082"
iab <expr> euro/ "\u20ac"
iab <expr> trademark/ "\u2122"
iab <expr> heart/ "\u2764"
iab <expr> smile/ "\u263A"
iab <expr> arrow/ "\u2192"
iab <expr> emptyset/ "\u2205"
iab <expr> bullet/ "\u2022"
iab <expr> intersects/ "\u2229"
iab <expr> scissors/ "\u2702"
iab <expr> umbrella/ "\u2614"
iab <expr> peace/ "\u262e"
iab <expr> point/ "\u261b"
iab <expr> dots/ "\u2026"
iab <expr> mdash/ "\u2014"
iab <expr> sum/ "\u2211"
iab <expr> sqrt/ "\u221a"
iab <expr> approx/ "\u2248"
iab <expr> neq/ "\u2260"
iab <expr> radio/ "\u2622"
iab <expr> hazmat/ "\u2623"
iab <expr> pick/ "\u26cf"
iab <expr> eject/ "\u23cf"
iab <expr> check/ "\u2713"
iab <expr> alpha/ "\u03b1"
iab <expr> beta/ "\u03b2"
iab <expr> gamma/ "\u03b3"
iab <expr> delta/ "\u03b4"
iab <expr> epsilon/ "\u03b5"
iab <expr> zeta/ "\u03b6"
iab <expr> eta/ "\u03b7"
iab <expr> theta/ "\u03b8"
iab <expr> kappa/ "\u03ba"
iab <expr> lambda/ "\u03bb"
iab <expr> mu/ "\u03bc"
iab <expr> nu/ "\u03bd"
iab <expr> Delta/ "\u0394"
iab <expr> Xi/ "\u039e"
iab <expr> pi/ "\u03c0"
iab <expr> rho/ "\u03c1"
iab <expr> sigma/ "\u03c3"
iab <expr> chi/ "\u03c7"
iab <expr> supp/ "\u207A"
iab <expr> supm/ "\u207B"
iab <expr> tripleeq/ "\u2261"
iab <expr> cdot/ "\u22c5"
iab <expr> powern/ "\u207f"
iab <expr> chevopen/ "»"
iab <expr> chevclose/ "«"
iab <expr> element/ "\u2208"
iab <expr> notelement/ "\u2209"
iab <expr> subset/ "\u2282"
iab <expr> superset/ "\u2283"
iab <expr> blank/ "\u2423"
iab <expr> block/ "\u2588"
iab <expr> achtel/ "\u266a"
iab <expr> clef/ "\u1d11e"
iab <expr> arrleft/ "\u25B6"
iab <expr> arrdown/ "\u25BC"
iab <expr> poop/ "\U1f4a9"
```
One last thing one should know: quite a few interesting unicode characters are outside of what's known as the „Basic Multilingual Plane“, which is a pompous way to say: within the first 65536 code points. That in particular includes all the emoijs (please don't torture me with those), but also the timeless PILE OF POO character, the rendering of which in the Hack font is shown in the opening image. Addressing codepoints above 65536 needs more than four hex characters, and to make vim grok those, you need to say \U rather than \u.

[1] Which, I give you, has been the case as of about 10 years ago, so it's not like all this is bleeding-edge.
⎆
Quick RST Previews for Posts in Pelican

12022:146:0 ( 2022-05-26)
In January, I described how I use this blog's engine, pelican, and how I have a “development” and a “production” site (where I will concede any time that it's exceedingly silly to talk about “production” in this context). Part of that was a trivial script, remake.sh, that I would run while writing and revising a post to format it without doing too much unnecessary work. This script was running between a couple and a couple of dozen times until I was happy with an article.

What the script did was call pelican asking to only write the document being processed. When pelican was instructed to cache work on the other articles, that was enough to keep build times around a second on my box; but as the number of posts on this blog approaches 200, build times ended up on the totally wrong side of that second, and I thought: “Well, why don't I run, perhaps, rst2html for formatting while revising?” That would be, essentially, instantaneous.

But pelican does a lot more than rst2html. Especially, having the plugins and the templating available is a good thing when inspecting a post. So, I got to work and figured out how pelican builds a document. The result is a script build-one that only looks at a single (ReStructuredText) article – which it gets from its command line – and ignores everything else.

This is fast enough to be run whenever I save the current file. Therefore, in my pelican directory I now have, together with the script, the following .vimrc enabling just that (% expands to the file currently edited in vim):
```
augroup local
  au!
  autocmd BufWritePost *.rst !python build-one %
augroup END
```
I've briefly considered whether I should also add some trick to automatically reload a browser window when saving but then figured that's probably overdoing things: In all likelihood I want to scroll around in the rendered document, and hence I will have to focus it anyway. If I do that, then effort spent on saving pressing r after focusing feels misplaced.

The script does have an actual drawback, though: Since pelican does not get to scan the file system with build-one, it cannot do file name substitution (as in {filename}2022-05-26.rst) and will instead warn whenever seeing one of these. Since, as described in January, my static files are not managed by pelican, that is not a serious problem in my setup, except I have to watch out for broken substitutions when doing a final make html (or the make install).
Insights into Pelican

It took me a bit to figure out how the various parts of pelican fit together at least to the extent of letting me format a ReStructuredText document with the jinja templates. Let me therefore briefly discuss what the script does.

First, to make pelican do anything remotely resembling what it will do on make html, you have to load its settings; since I assume I am running in pelican's directory and this is building a “draft” version, I can simply do:
```
settings = pelican.read_settings("pelicanconf.py")
```
With that, I already now where to write to, which lets me construct a writer object; that will later arrange for actually placing the files. I can also construct a reader for my ReStructuredText files (and you would have to change that if you are writing in Markdown); these readers decouple the Article class from input formats:
```
writer = writers.Writer(settings["OUTPUT_PATH"], settings)
reader = readers.RstReader(settings)
```
With that, I have to delve deep into pelican's transformation machinery, which consists of various generators – for articles, static files, pages, whatever. The constructors of these generator classes (which are totally unrelated to Python generators) take a lot of arguments, and I cannot say I investigated why they insist on having them passed in when I fill them with data from settings anyway (as does pelican itself); but then I suspect these extra arguments are important for non-Article generators. I only need to generate a single article, and so stereotypically writing:
```
artgen = generators.ArticlesGenerator(
    settings.copy(), settings,
    settings["PATH"], settings["THEME"], settings["OUTPUT_PATH"])
```
does the trick for me.

Article generators will usually collect the articles to generate by looking at the file system. I don't want that; instead, I want to construct an Article instance myself and then restrict the generator's action to that.

The Article class needs to be constructed with content and metadata, which happen to be what readers return. So, to construct an Article from the RST file passed in in source_path, I need to say:
```
content, metadata = reader.read(source_path)
art = contents.Article(content, metadata,
    source_path=source_path, settings=settings)
```
After all that preparation, all that is left to do is overwrite any misguided ideas the article generator might have on what I would like to have processed and then let it run:
```
artgen.translations = []
artgen.articles = [art]
artgen.generate_articles(
    functools.partial(writer.write_file, relative_urls=True))
```
(the currying of the writer's write_file method to make sure it creates relative URLs you can probably do without, but I'm a fan of relative URLs and of almost anything in functools).
⎆

HTML and Word in mutt

12022:134:0 ( 2022-05-14)

I read my mail using mutt, and even though I was severely tempted by astroid, mutt just works too nicely for me to make moving away an attractive proposition. And it is a fine piece of software. If you're still stuck with Thunderbird (let alone some webmail interface in the browser) and wonder what text-based software you might adopt, right after vim I'd point you to mutt.

I'm saying all that because the other day I complained about a snooping mail marketing firm (in German) abusing MIME's multipart/alternative type to clickbait people reading plain text mails into their tracker-infested web pages, and I promised to give an account on how I configured my mutt to cope with HTML mails and similar calamities.

The basic mechanism is ~/.mutt/mailcap. That's a file analogous to /etc/mailcap, for which there's a man page, mailcap (5)[1]. That explains how, in general, software uses this file to figure out which program to use to display (or print or compose) files of which types.

Mutt reads system-wide mailcaps, too, but I've found I generally want to handle a solid number of media differently in mails than, say, in browsers or from the shell[2], and hence I'm keeping most of this configuration in mutt's private mailcap. For HTML mail, I've put into that file:

text/html; w3m -dump -T %t -I utf-8 -cols 72 -o display_link_number=1 %s; copiousoutput

This uses w3m to format HTML rather than te lynx that the mutt docs give. Lynx these days really is too basic for my taste (I'm not even sure whether it has learned utf-8). Still, this will not execute javascript or retrieve images, so most of the ugly aspects of HTML mails are sidestepped. The copiousoutput option makes mutt use its standard pager when showing the program's output, and thus HTML mail will look almost like sane mail.

To make that really seamless, you need an extra setting in your ~/.mutt/muttrc:

auto_view text/html

This makes mutt automatically render HTML (which, contrary to the behaviour of gmail or thunderbird I consider relatively safe if it's parsimonious w3m that does the rendering). In addition, since I still believe in the good in humans, I believe that when there is both HTML and plain text in a mail, the plain text will be better suited for my text terminal, and so I tell mutt to prefer text/plain, which, again in the muttrc, translates into:

alternative_order text/plain

And that's it: If the villains at cleverreach (the marketing firm I complained about) didn't have their treacherous text/plain alternative, my w3m would render their snooping HTML without retrieving their tracking pixel and I could read whatever they send me without them ever knowng if and when. I'm still not sure if that's the reason they have the nasty clickbait text/plain alternative. In general, I support the principle that you should never explain with malice what you can just as well explain with stupidity. But then we're dealing with a marketing firm here…

Anyway: The best part of this setup is that you can quote-reply to HTML mails, giving your replies inline as $DEITY wanted e-mail to work. That is something that also is nice when folks send around MS office files (I get the impression that still happens quite a lot outside of my bubble). To cater for that, I have in my mailcap:

application/msword; antiword %s; copiousoutput
application/vnd.openxmlformats-officedocument.wordprocessingml.document; docx2txt %s - ; copiousoutput

and consequently in the muttrc:

auto_view application/msword
auto_view application/vnd.openxmlformats-officedocument.wordprocessingml.document

I admit I actually enjoy commenting inline when replying to office documents, and I trust antiword (though perhaps docx2txt a bit less) to not do too many funny things, so that I think I can run the risk of auto-rendering MS office files. I've not had to regret this for the, what, 15 years that I've been doing this for (in the antiword case; according to my git history, I've only given in to autorendering nasty docx in 2019).

I mention in passing that I have similar rules for libreoffice, but there I have a few lines of python to do the text rendering, and that is material for another post (also, folks decent enough to use libreoffice are usually decent enough to not send around office files, and hence auto-displaying ODT is much less of a use case).

Two more remarks: This actually cooperates nicely with rules not using copiousoutput. So, for instance, I also have in my mailcap:

text/html; x-www-browser file://%s

With that, if need by I can still navigate to an HTML file in the attachments menu and then fire up a “normal” browser (with all the privacy implications).

And: people indecent enough to mail around MS office files often are not even decent enough to configure their mail clients to produce proper media types. Therefore, mutt lets you edit these to sanity. Just hit v, go to the misdeclared attachment and then press ^E. Since the “Office Open XML“ (i.e., modern Microsoft Office) media types are so insanely long and unmemoralisable, I have made up a profane media type that I can quickly type and remember for that particular purpose:

application/x-shit; libreoffice %s

[1] In case you're not so at home in Unix, writing “mailcap (5)” means you should type man 5 mailcap (for “show me the man page for mailcap in the man section 5”) to read or skim the documentation on that particular thing. Explicitly specifying a section has a lot of sense for things like getopt (which exists in sections 1 and 3) and otherwise is just an indication that folks ought to have a look at the man page.

[2]	You can use programs like see (1) or even compose (1) to use the information in your non-mutt mailcaps.

Big Brother Award: Wieder nicht cleverreach

12022:127:6 ( 2022-05-08)

Ich will nicht ganz verhehlen, dass mich die diesjährigen Big Brother Awards ein klein wenig enttäuscht haben. Gewiss waren das alles verdiente Empfänger, aber mein Favorit war wieder mal nicht dabei. Dabei gebe ich den Leuten vom FoeBuD (ich verweigere mich der Umbenennung in digitalcourage aus reiner Nostalgie), dass dessen Verfehlungen überschaubarer sind als die Bewegungsprofile von Lieferando und die garstigen Fintech-Teufeleien von Klarna.

Dennoch sollte, so finde ich, cleverreach gelegentlich bedacht werden, idealerweise in einer Kategorie „Ausschnüffeln per DSGVO-FUD“. Diese wäre Einrichtungen vorbehalten, die anderen Einrichtungen Überwachungstechnik unterschieben, indem sie die von weiten Teilen der freien Presse geschürten Ängste vor der DSGVO ausnutzen.

Dieses Thema wird schon durch das augenblickliche Cookie-Banner dieser Leute gesetzt, das modal (also: nichts geht, solange mensch nicht klickt) zunächst von „Privatsphäre respektieren“ redet und dann über allzu bekannte Dark Patterns versucht, den Leuten hart am Rande der Legalität Daten abzupressen:

Gestauchter Screenshot des cleverreach-cookiebanners: Der Zustimmen-Button ist sehr einladend.

(ich habe das Legalesisch rausgeschnitten, da es nichts zur Sache tut). Zum Mitschreiben: Ein Laden, der Datenschutz ernstnimmt, braucht keine Cookie-Banner und schon gar keine Dark Patterns. Immerhin, das will ich den Leuten lassen, funktionieren weite Bereiche der Webseite inzwischen (das war vor zwei Jahren noch anders) ohne Javascript. Tatsächlich werden ohne Javascript nicht mal Cookie-Banner ausgespielt (Disclaimer: ich habe nicht nachgesehen, ob dann auch wirklich keine unnötigen Cookies[1] verschickt werden).

Etwas ungehaltener bin ich schon über die Sirenentöne zur DSGVO, mit denen cleverreach Unternehmen und, noch schlimmer, andere Einrichtungen verunsichert: „eine Mailingliste ist datenschutzmäßig total kompliziert, und wenn du eine betreibst, bist du schon halb im Knast“. Das ist natürlich Unfug, solange beim Abonnieren klar ist, was die Leute kriegen und der dazugehörige Dialog transparent gestaltet ist. Aber welcheR „EntscheiderIn“ – Technik- und Sachkenntnisse sind in solchen Positionen ja eher optional – könnte Sirenentönen schon widerstehen?

Nun könnte ich mit so ein wenig DSGVO-FUD zur Not noch leben, selbst wenn er zu einer – nur in seltenen Fällen dem Datenschutz wirklich helfenden – Zentralisierung von EDV führt, hier nämlich von jeweils ein paar lokalen Mailinglisten auf jeder Menge voneinander isolierter Server zu einer Firma mit „310.000 Kunden“ mit entsprechend vielen Listen.

Diese Kunden sind zum Beispiel die Bundesorganisation der GEW, die Ebert-Gedenkstätte in Heidelberg und das Landesmuseum für Arbeit und Technik in Mannheim (dessen aktuellen Namen „Technoseum“ verweigere ich mit gleicher Sturheit wie die „digitalcourage“). Stellt euch vor, ihr wisst, dass jemand von allen drei Läden Info-Mails abonniert hat, notabene freiwillig, was schon eine gewisse Identifikation mit den jeweiligen Zwecken vermuten lässt: Entsteht da nicht ganz von selbst ein Profil?

Faustregel: „clever“ heißt im Internet so viel wie „fies”

Aber ist denn die Bildung so eines Profils nicht gegen die DSGVO? Oh, mit hinreichend viel Skrupellosigkeit ist das kein Problem. Und damit komme ich zum wirklich verwerflichen Teil von cleverreachs Geschäft, der meines Erachtens Restzweifel im Hinblick auf die Skrupellosigkeit zuverlässig zerstreut.

Dazu braucht es einen Blick in die von cleverreach verschickten Mails. Dies sehen typischerweise so aus:

I     1        [multipa/alternativ, 7bit, 79K]
I     2 ├─>    [text/plain, quoted, utf-8, 0.7K]
I     3 └─>    [text/html, quoted, utf-8, 78K]

– sie gibt also vor, dass mensch alternativ ordentlichen Text oder ein HTML-Dokument haben kann. Das Problem ist noch nicht mal, dass es da überhaupt HTML gibt (auch wenn anständige Menschen kein HTML in Mails packen). Das Problem deutet sich an darin, dass der Plain Text nur ein Hundertstel der Länge des HTML-Teils hat. Das ist auch beim Einrechnen des HTML-typischen Fluffs nicht mehr glaubhaft, und in der Tat ist der Plain-Text-Teil nur ein Köder, um Menschen zu cleverreachs Schnüffelseiten zu bringen:

Ihr E-Mail Programm unterstützt leider keine HTML E-Mails.

Hier finden Sie diesen Newsletter online:
https://213989.seu2.cleverreach.com/m/dddddddd/dddddd-hhhhhhhhhhhhhhhhhhhhhhhhhh
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

(wo d Dezimalziffern und h Sedezimalziffern sind; mit Code im Umfang dieser URL könnten DemoprogrammiererInnen Bälle über den Bildschirm hüpfen lassen). Wer dieser ganz klar personalisierten URL folgt, bekommt immerhin keinen Cookie-Banner, und erstaunlicherweise wird auch kein Javascript ausgeliefert. In der Tat setzen die verschiedenen Ressourcen auch keine nennenswerten Cookies. Allein die URL:

https://stats-eu2.crsend.com/stats/mc_dddddd_dddddddd_hhhhhhhhhhh-cccccc.gif

(c steht jetzt für Kleinbuchstaben) setzt einen PHP-Session-Cookie – aber das ist vermutlich nur Gedankenlosigkeit. Das wirkliche Problem an dieser Ressource: das ist ein Tracking-Pixel, in schlechter Tradition als leeres GIF von 40 Bytes, das aber (was passiert da wohl im Hintergrund?) jetzt gerade 5 Sekunden für die Auslieferung gebraucht hat.

Hiermit nimmt cleverreach auf, wer ihre Mails liest und wann sie das tun. Nennt mich paranoid, aber ob und wann ich Werbepost lese, das geht wirklich nur mich etwas an. Ich will auf keinen Fall, dass jemand merkt, wenn ich auf Clickbait hereinfalle.

Wenig überraschenderweise findet sich dieser Trackversuch auch am Ende der ganz klassisch als Tabelle ausgezeichneten HTML-Alternative in der Mail:

  </center>
  <img src="https://stats-eu2.crsend.com/stats/<wie oben>"
    border="0" alt="" height="1" width="1"></body>
</html>

Bei HTML dieser Art (und überhaupt: GIFs) werde ich wieder jung: so haben wir Mitte der 1990er Webseiten geschrieben.

Trotz dieses Schnüffelversuchts wäre es übrigens (wenn es jetzt schon HTML sein muss) datenschützerisch besser, wenn cleverreach den Plain Text-Teil nicht einbasteln würde, denn dann würde mein datensparsames eigenes HTML-Rendering aktiv werden und ich könne die Mail lesen, ohne (wie im normalen Browser fast unvermeidlich) getrackt zu werden. Wie das alles mit meinem Mail-Client zusammengeht, beschreibe ich demnächst mal; ich bestehe jedenfalls darauf, dass Plain Text-Alternativen, die nur Clickbait enthalten, unethischer sind als gar kein Plain Text.

Entweder Internet-Normalbetrieb oder DSGVO.

Zugegeben: das verschlagene Unterschieben von Trackingpixeln, um mitzukriegen, wann wer was liest und das dann den eigenen KundInnen als total wichtige Marketingmetrik verkaufen zu können: Das könnte in Zeiten von Google Analytics schon fast als lässliche Sünde durchgehen. Wer sich auf Schurkereien dieser Art einlässt, darf jedoch im Gegenzug nicht Gewerkschaften und SPD-nahe Stiftungen (samt ihrer im Datenschutz eher unbeholfenen HauspolitiologInnen für antisoziale Medien) mit Datenschutzraunen und -drohen von datensparsamen Verfahren in das eigene überwachungskapitalistische Geschäftsmodell ziehen.

[1]	Ich lasse die Frage, ob es wohl „notwendige Cookies“ überhaupt gibt, mal beiseite, denn Antworten auf diese bräuchten jede Menge Platz für haarige technische und soziale Betrachtungen sowie eine Großpackung Wenns und Abers.

Taming a Synaptics Clickpad

12022:122:0 ( 2022-05-02)

About the lamest component of my current machine, a Lenovo XP 240, is the touchpad. Well, it's actually a clickpad, i.e., a thing without real buttons that you can press to make mouse button clicks.

Yes, this machine was designed at a time when everyone thought they had to follow Apple's lead in abolishing the mouse buttons. What they had not considered: while OS X is built around the (IMHO somewhat foolish) notion that there's just one mouse button, in conventional X11 (roughly, left: mark, middle: paste, right: context menu), having just one button really is no fun.

Fortunately, one can define the button areas rather liberally by X11-configurung the synaptics driver (and prototype things using the synclient program, once one gets used to it quirks[1]) Some duct tape one will even give add some tactile feedback to the pad so you can feel the buttons without having to look:

Photo: a touchpad with separators with duct tape stuck on it.

This still stinks, because every time one clicks, the mouse pointer moves. Fortunately, thinkpads also have a stick for pointer motions, and so I could switch off pointer motion through the touchpad entirely. I did that by setting AreaTopEdge to 0 (the default) and AreaBottomEdge to 10 (or something similarly small). Hardware clicks and the detection of the finger's location is unaffected by that setting.

That has worked fine (within reason), for all the, what, eight years that I've used this box. But my stint into the fediverse made me revisit my clickpad hack. This is because the mastodon client Tootle does not have useful key bindings (like: space bar scrolls a page).

I have considered adding them but lost all motivation when I noticed that the current vala source does not build on Debian stable, and apparently by a large margin (ah, hipsterware!). I then briefly considered writing mastodon reader myself in a bit of Tkinter but got sidetracked when I noticed I'd have to render at least some subset of HTML (which is an interesting problem in itself, and tk_html_widgets looks fairly promising; but I still let it go).

And so I've finally implanted a scroll wheel into the stupid synaptics clickpad by enabling two-finger scroll. However, one cannot just switch on two-finger scroll without also switching on pointer motion, too (or can one?). After quite a bit of fiddling, I figured having a little patch in the middle of the touchpad sensitive keeps the number of inadvertent pointer movements to a minimum; once you have a finger there, you can use the entire pad for scrolling with the other finger.

In sum, this gives the following piece of xorg.conf material, to be dumped to /etc/X11/xorg.conf.d/50-synaptics.conf:

Section "InputClass"
  Identifier "touchpad"
  MatchProduct "SynPS/2 Synaptics TouchPad"
  Driver "synaptics"
  Option "ClickPad" "True"

  # Three buttons of equal size, stretching the whole way vertically
  Option "SoftButtonAreas" "67% 100% 0 0 30% 67% 0 0"
  # Turn off motion altogether: AreaBottomEdge 10 AreaTopEdge 0
  # Motion only detected betwee A.T.E. and A.B.E.
  # Here, use an aread in the middle of the pad:
  Option "AreaTopEdge" "2400"
  Option "AreaBottomEdge" "3200"

  # Now enable scrolling with one finger on the sensitive area, the
  # other finger moving to scroll.  Raising FingerHigh can help
  # reducing accidental moves.  VertScrollDelta lets you
  Option "VertTwoFingerScroll" "True"
  Option "VertScrollDelta" "60"
  Option "FingerHigh" "60"

  # Our Buttons are on the full area anyway, so:
  Option "HasSecondarySoftButtons" "False"

  # A bit of config spam that I'm too lazy to regression test out.
  # I'd expect they're rather safe to drop, though.
  Option "VertResolution" "1000"
  Option "HorizResolution" "650"
  Option "MinSpeed" "1"
  Option "MaxSpeed" "1"
  Option "AccelerationProfile" "1"
  Option "AdaptiveDecelration" "16"
  Option "ConstantDecelration" "16"
  Option "VelocityScale" "1"
EndSection

Have a look at the comments; on another box, I expect you'd need to fiddle with AreaTopEdge and AreaBottomEdge to find values convenient for you and your specific pad (the coordinates of the limits are most easily found in /var/log/Xorg.0.log). You may also want to play with FingerHigh, the pressure above which the device counts a click; on a clickpad, though, even for reasonable values you will click before you touch.

[1]

Synclient's quirks may not be its fault at all, but it is somewhat annoying that it lets you play with settings for circular pads (CircularScrolling, say) on devices that have no idea what these settings are, that you can set pressure sensitivity parameters (PressureMotionMinZ, PressureMotionMaxZ, PressureMotionMinFactor, PressureMotionMaxFactor) on pads that are, as I'm sure mine is after experimenting a lot, not pressure sensitive, or that it lets you set PalmDetect, PalmMinWidth, and PalmMinZ to absolutely no discernable effect. Ok, on the latter setting the synaptics man page does state that that needs firmware support. But really, if that's so, couldn't the non-supporting firmware be smart enough to not offer the setting in the first place if it does nothing?

Globaldokumente in Libreoffice zusammenführen

12022:105:6 ( 2022-04-16)
Ich hatte gerade einer armen Seele zu helfen, die drei Bücher mit Microsoft Word für Windows geschrieben hat und dazu Word-Zentraldokumente verwendet hat, vor allem wohl, weil vor 20 Jahren – als diese Projekte starteten – Word nicht mit mehreren hundert Seiten auf einmal umgehen konnte. Unbenommen, dass mensch mit Office-Software keine richtige Arbeit machen sollte: da die arme Seele auf Linux migrierte, musste der ganze Kram auf Libreoffice. Für eine Migration auf vernünftige Technologien (TeX, ReStructuredText, Docbook oder was auch immer) reichten weder meine Geduld noch der Erlösungswille der armen Seele.

Erste Ernüchterung: Word-Zentraldokumente (wie auch Libreoffice-Globaldokumente) speichern ihre Pfade absolut: sie können also ohne Tricks nicht bewegt werden, insbesondere nicht von einem Windows-Dateisystem in ein Linux-Dateisystem. Wie so oft stellt sich die Frage, warum das mal ein wie ein gutes Design ausgesehen haben mag.

Nach einem kurzen Blick auf das Arbeiten mit Globaldokumenten habe ich beschlossen, dass Rechner jetzt wirklich groß genug sind, um 500-seitige Dokumente im Speicher zu halten und dass der ganze Zentral- und Globaldokumentenzauber mit den begleitenden Komplikationen nicht mehr sein muss.

Nur: Wie kopiert mensch 20, 30 solche Dateien zusammen? Was ich dazu in der verbleibenden Libreoffice-Doku (das Wiki, das das jetzt sein soll, überzeugt mich übrigens nicht) und dem weiteren Netz gefunden habe, fand ich eher… unbefriedigend – Erinnerungen an Windows-Foren („um den Sound-Treiber zu reparieren, musst du den CD-Treiber deinstallieren. Hat bei mir funktioniert“) werden da dann und wann schon wach. Deshalb dachte ich mir, es könnte nützlich sein, wenn ich auch ein paar Rezepte beitrage, auch wenn ich – Disclaimer – selbst keine Office-Software verwende und in dem Sinn selbst höchstens einäugig bin.
Wie machte ich aus einem Satz von ODTs oder DOCs ein einziges ODT?
1. Ein Verzeichnis anlegen und alle Filialdateien reinkopieren.
2. Die Dateien so benennen, dass eine einfache Sortierung sie in die richtige Reihenfolge bringt (ich habe einfach 000_, 001_ usf vor die Namen gesezt).
3. Libreoffice starten, Neu → Globaldokument.
4. F5 drücken, um in den Navigator zu kommen, dort aufs Einfügen-Icon klicken; das poppt eine Auwahlbox auf.
5. In dieser Auswahlbox in das Verzeichnis mit den Filialdateien gehen und diese in die Reihenfolge sortieren lassen, in der sie nachher im Dokument erscheinen sollen.
6. Alle Dateien im Verzeichnis auswählen, z.B. durch Control-A oder per Shift-Klick, dann den Import bestätigen.
7. Datei → Exportieren, dabei ODT als Zielformat auswählen; das ist der erste Schritt, um von der Einbettung im Globaldokument wegzukommen. Ich nenne diese Datei jetzt mal joined.odt.
8. Das so erzeugte ODT ist leider überall schreibgeschützt, und ich habe keinen Weg gefunden, diesen Schreibschutz per Klicken wegzuzaubern, bevor ich die Geduld mit Doku, Menüs und vor allem Foren verloren habe und mit epubedit beigegangen bin (vgl. unten). Mit dem kleinen Skript dort unten könnt ihr Folgendes in einer Shell laufen lassen:
  
  epubedit joined.odt sed -ie 's/text:protected="[^"]*"//g' content.xml
  
  (ihr könnt natürlich auch mit einem Editor oder gar mit dem hervorragenden xmlstarlet die ganzen text:protected-Attribute löschen)[1]. Geht dann aus der Shell vom epubedit wieder raus; das schreibt joined.odt neu.
9. Das neue joined.odt in libreoffice öffnen.
10. Bearbeiten → Verknüpfungen, wieder alle Auswählen (^A), und dann den Lösen-Knopf drücken.
Das Ergebnis dieser Prozedur ist ein zusammenhängendes „Dokument“ (wenn mensch keine großen Ansprüche an Dokumente hat).

Zumindest in meinem Fall fing damit die Arbeit allerdings erst an, weil jedes Filialdokument eigene und verrückte Absatzvorlagen hatte. Ich schreibe gleich, wie ich das aufgeräumt habe, aber zunächst müsen wir geschwind über das erwähnte epubedit reden.
epubedit

Was Open Document-Dateien tatsächlich etwas angenehmer im Umgang macht als einige andere Office-Dateien, die ich hier erwähnen könnte, ist, dass sie eigentlich nur zip-Archive sind, in denen von für sich nicht unvernünftigen Standards (z.B. XML und CSS) beschriebene Textdateien leben. Das hat mir beispielsweise den obigen Trick mit der Ersetzung der text:protected-Attribute erlaubt.

Diese Architektur haben sie mit Ebooks im epub-Format gemein, und um an denen geschwind mal kleine Korrekturen vorzunehmen, habe ich mir vor Jahren ein kleines Shell-Skript geschrieben:
```
#!/bin/bash

if [ $# -ne 1 ]; then
        echo "Usage: $0 <epub> -- unpack an epub, open a shell, pack it again."
        exit 0
fi

workdir=$(mktemp -d /tmp/workXXXXXX)

cleanup() {
        rm -rf $workdir
}
trap cleanup EXIT
if [ ! -f  "$1".bak ]; then
        cp -a "$1" "$1".bak
fi

unzip "$1" -d $workdir
(cd $workdir; bash)
fullpath=$(pwd)/"$1"

cd $workdir
zip -r "$fullpath" *
```
Nehmt das und legt es als – sagen wir – epubedit irgendwo in euren Pfad und macht es ausführbar. Ihr könnt dann für irgendein epub oder odt epubedit datei.odt sagen und landet in einer Shell, die im Wurzelverzeichnis des jeweiligen ZIP-Archivs läuft. Dort könnt ihr nach Herzenslust editieren – bei ODTs ist der Inhalt in content.xml –, und wenn ihr fertig seid, beendet ihr die Shell und habt ein entsprechend verändertes ODT oder epub.

Weil dabei gerne mal was schief geht, legt das Skript ein Backup der Originaldatei an (es sei denn, es gäbe schon so ein Backup; die Erfahrung zeigt, dass mensch in der Regel lieber das ursprüngliche Backup behalten will…).
Stilfragen

Nun ist das vereinte Dokument zwar immerhin nur noch eine einzige Datei, die zudem – wow! – auch bewegt werden kann. Zumindest mit der Genese in meinem Fall, also den vielen Einzel-Word-Dateien, ist sie trotzdem kaum brauchbar, weil Word einige hundert Formatvorlagen erzeugt hat, meist mit so nützlichen Namen wie Formatvorlage_20_16_20_pt_20_Block_20_Erste_20_Zeile_3a__20__20_05_20_cm_20_Zchn oder Fußnotentext1_20_Zchn oder auch apple-converted-space. Dieses Problem ist schlimm, und ich habe schließlich eingesehen, dass es ohne ein kleines Programm und einige Handarbeit nicht lösbar ist.

Das Programm hat am Anfang nur Stilnamen aus dem Dokument rausgeprökelt und auf die Standardausgabe gelegt. Inzwischen ist das zu einer Basis für eine Abbildungsdatei geworden, und auch für die Abbildung als solche haben reguläre Ausdrücke noch gereicht. Wegen der Abhängigkeiten der Stile untereinander blieb jedoch immer noch jede Menge Mist in der Liste der verwendeten Stile zurück. Deshalb musste ich schließlich doch noch ordentliche XML-Verarbeitung anwerfen, um die styles.xml umzufummeln. Das Ergebnis ist das Programm defuse-libreoffice-style.py. Wenn ihr dieses Programm für die Dauer der Verarbeitung in euer Homeverzeichnis legt, würdet ihr die Stile wie folgt vereinheitlichen:
1. epubedit joined.odt; alles Weitere passiert in der Shell, die das öffnet.
2. python3 ~/defuse_libreoffice-style.py > ~/style-map.txt – wenn ihr das Skript nicht in eurem Home lagert, müsst ihr diesen Pfad anpassen. Und ich lege die Stil-Abbildung ins Home und nicht ins aktuelle Verzeichnis, damit die Abbildung (die recht viel Arbeit ist) nicht gleich verloren ist, wenn ihr die Shell verlasst. Ich jedenfalls habe besonders beim ersten Mal ein paar Anläufe gebraucht, bis das Mapping gut gepasst hat.
3. Editiert die Datei ~/style-map.txt mit einem Texteditor (also auf keinen Fall mit libreoffice selbst). Da drin stehen Zeilen wie:
  
  Footnote_20_Symbol -> Footnote_20_Symbol
  
  – in meinem Fall ungefähr 200 davon. Die Aufgabe ist jetzt, die rechten Seiten dieser Zeilen auf eine Handvoll Stile runterzubringen (Textkörper, Überschrift_1, Überschrift_2, Zitat, Fußnotenzeichen und Fußnote waren mein Minimum); die Zeile oben habe ich zum Beispiel zu:
  
  Footnote_20_Symbol -> Fußnotenzeichen
  
  gemacht. Es ist nicht immer einfach, herauszukriegen, was wohl eine Vorlage mal tun sollte; meist hat Word aber doch einen gewissen Hinweis darauf im Namen hinterlassen.
4. Wenn die Abbildung fertig ist, lasst das Python-Skript nochmal laufen; wenn es nämlich ein Argument bekommt, interpretiert es das als Abbildung und passt sowohl content.xml als auch style.xml entsprechend an:
  
  python3 ~/defuse_libreoffice-style.py ~/style-map.txt
5. Um zu sehen, welche Stile noch übrig sind, könnt ihr das Skript ein weiteres Mal ohne Argumente laufen lassen; das gibt dann die noch vorhandenen Stile ins Terminal aus:
  
  python3 ~/defuse_libreoffice-style.py
  
  Wenn noch was dabei ist, das nicht übrig bleiben soll, könnt ihr style-map.txt anpassen und Schritt (4) nochmal laufen lassen (oder nochmal vom Backup des ODT anfangen).
6. Verlasst zum Abschluss die Shell vom epubedit und guckt im libreoffice nach, ob alles geklappt hat. Libreoffice erzählt wahrscheinlich, dass das Dokument beschädigt sei (aber nicht genauer, was eigentlich; hier rächt sich, dass ich die Open Document-Standards nicht gelesen und stattdessen einfach munter drauflosgehackt habe). Das, was es zur Reparatur unternimmt, hat aber bei mir immer gut funktioniert – insofern: Nur Mut.
Und für den Fall, dass jemand in den Python-Code reinguckt: Nein, auch wenn der StyleSanitiser immerhin ordentlich XML bearbeitet (im Gegensatz zu dem RE-Hacks von oben), ist das immer noch nicht Open Document-allgemein, denn ich habe die spezifische Wahl des text:-Präfix von Libreoffice darin hart kodiert, was sich für „richtige“ Software nicht gehören würde. Aber SAX mit richtigen Namespaces macht keinen Spaß, und ich rechne erstmal nicht damit, dass dieser Code je mit ODTs laufen wird, die nicht von Libreoffice kommen.
Und die Stichworte?

Die Bücher hatten auch je ein Stichwortverzeichnis. Bei einem Dokument hat das gut funktioniert, bei den anderen standen im Verzeichnis ein paar ordentliche Begriffe, ein paar Begriffe mit sinnlosen typografischen Anführungszeichen und ganz viele Einträge für das leere Wort. Ich habe keine Ahnung, wie es dazu kam.

Bei der Reparatur hilft erneut der Umstand, dass ODT im Kern ein nicht ganz unvernünftiges XML ist. Dabei sieht das Markup für ein Stichwort beispielsweise so aus:
```
Karl Valentin …
```
⎆
Von Geburtstagen und /etc/papersize

12022:100:8 ( 2022-04-11)
Ich bin ein stiller Fan des Debian-Pakets installation-birthday. Das hat mir vorhin eine Mail geschrieben:
```
Date: Mon, 11 Apr 2022 11:04:11 +0200
From: Anacron <root@hostname-withheld>
To: root@hostname-withheld
Subject: Anacron job 'cron.daily' on hostname-withheld

/etc/cron.daily/installation-birthday:

                  0   0
                  |   |
              ____|___|____
           0  |~ ~ ~ ~ ~ ~|   0
           |  |           |   |
        ___|__|___________|___|__
        |/\/\/\/\/\/\/\/\/\/\/\/|
    0   |       H a p p y       |   0
    |   |/\/\/\/\/\/\/\/\/\/\/\/|   |
   _|___|_______________________|___|__
  |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/|
  |                                   |
  |         B i r t h d a y! ! !      |
  | ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ |
  |___________________________________|


Congratulations, your Debian system "hostname-withheld" was installed
15 year(s) ago today!


Best wishes,

Your local system administrator
```
Oh wow. So ein glatter Geburtstag ist doch eigentlich ein Grund zu feiern.

Er ist aber auch ein Grund zum Nachdenken. Irgendwas kann da nicht stimmen. Bevor meine Haupt-Arbeitsmaschine (deren Filesystem so etwa aus dem Jahr 1996 stammt) ein Debian-System wurde, habe ich das selbst gestrickt und gebaut. Das habe ich mir 2007 ganz sicher nicht mehr angetan, denn die Zahl der auf einem durchschnittlichen Desktop-Linux verbauten Zeilen und irgendwie zusammenzufummelnden Komponenten war schon ein paar Jahre vorher deutlich jenseits meiner persönlichen Schmerzgrenze. Was aber ist dann 2007 passiert?

Nun: wozu gibt es die Quellen? Im Fall von installation-birthday führt ein schneller Blick in die cron-Datei, die in der Mail erwähnt ist, auf ein Python-Skript, das das Installations-Datum berechnet als:
```
for candidate in self.gen_installation_datetimes():
  # Use the oldest mtime we can find
  if dt is None or candidate < dt:
    self.log.debug("Preferring %s over %s", candidate, dt)
    dt = candidate
```
gen_installation_datetime nun geht durch die Timestamps des Root-Filesystems, von /var/log/installer (gibts bei mir nicht), /var/log/bootstrap.log (auch nicht), /var/lib/vim (da kommt das Datum her und reflektiert so wohl eher irgendein Datum in der Entwicklung des Pakets, wahrscheinlich die Einführung formaler Addons), /root und /etc/machine-id (diese Datei ist von 2014, und eine kurze Recherche zeigt: Da habe ich zum ersten Mal versucht, mich mit systemd anzufreunden).

So romatisch es ist, mein Systemalter anhand von vi-Hilfsdateien zu schätzen: Das will ich schon anders haben. Wie aber bekomme ich das Migrationsdatum raus?

In meinem Dateisystem sind noch viele Dateien, die auf das selbstgestrickte Linux zurückgehen (und weitere, die von noch älteren Maschinen stammen, aber die beschränken sich auf /home), ich kann also nicht einfach nur die älteste Datei suchen. Es gibt aber eine interessante Häufung von Timestamps in /etc/ am 7. Juli 2005, darunter adduser.conf und papersize, die sehr nach etwas klingen, das Debian beim Überbügeln des Selbstbausystems angelegt haben könnte.

Umgekehrt gibts davor eigentlich nichts, das irgendwie nach Debian aussieht; der Timestamp von /etc/papersize wird es dann wohl sein.

Jetzt wollte ich es genauer wissen, und weil ich zu faul bin, aus der ls-Manpage oder sonstwoher rauszupopeln, wie ich ohne Code an die Uhrzeit der letzten Änderung herankomme, muss es ein Python-Einzeiler tun (ja, sowas gibts):
```
$ python -c "import os, datetime; print(datetime.datetime.utcfromtimestamp(os.path.getmtime('papersize')))
2005-07-07 12:48:37
```
Also: am Donnerstag, 7.7.2005, so gegen drei am Nachmittag, habe ich offenbar meine Arbeitsmaschine auf Debian umgestelt. Uiuiui. Dass ich das Selbstbauen doch so lange durchgehalten habe, hatte ich gar nicht mehr in Erinnerung.

Damit in Zukunft installation-birthday das richtige Datum nimmt, habe ich gerade noch:
```
$ sudo touch -r /etc/papersize /var/log/installer
```
gesagt. Damit gibt es ein /var/log/installer mit dem Timestamp von /etc/papersize, das jetzt installation-birthday das richtige Signal gibt:
```
$ installation-birthday --verbosity 1
I: Installation date: 2005-07-07
```
Ich freue mich schon auf den 7.7. Das ist übrigens in diesem Jahr, wie schon 2005, wieder ein Donnerstag.
⎆

« Seite 2 / 4 »

Artikel aus edv

The Good

The Bad

The Crazy

Oracle clickjacks W3C links

A long history of evil getting worse

Mozilla: Crashes in the supply chain

Want to report a bug? Feed Microsoft!

Preparations

Figuring out Libreoffice's API

The Wikipedia-Opening Script

Make the Box Wake Up On Power

Make the Box Hibernate on Mains Loss

Two Repos, Two Branches

Automating the Publication

Codeberg

Subqueries in SPARQL, and 50 Labels per Entity

Named Subqueries and Planner Barriers

Finally: Numbers. But what do they Mean?

Getting Labels

Determining Day and Month

Uploads, Multipart, Urlencoded, Oh My!

Making the Output Friendlier: Turtle?

Making the Output Friendlier: XSLT!

Basics: RDF in a few paragraphs

CURIEs

Long URIs in RDF?

Non-speaking URIs

A Debugging Session

Insights into Pelican

Wie machte ich aus einem Satz von ODTs oder DOCs ein einziges ODT?

epubedit

Stilfragen

Und die Stichworte?

Letzte Ergänzungen