Artikel aus edv

  • View with Netsurf

    A screenshot of a browser window

    An early version of this post rendered in netsurf.

    I believe about the worst threat to software freedom these days is web browsers. That is not only because they already are, for many people out there, a more relevant applications platform than their primary operating system, and that almost everything that gets run in them is extremely non-Free software. I've been linking to a discussion of this problem from these pages since this blog's day one as part of my quip on “best viewed with javascript disabled“.

    No, they are also a threat because the “major” browser engines are so humunguous that they are in effect locking out most platforms (which simply don't have enough power to run them). And they are because the sheer size and complexity of their code bases make it essentially impossible for an individual to fix almost any relevant bug in them related to rendering, javascript execution, or network interactions.

    That is why I am so grateful to the authors and maintainers of both dillo (Debian: dillo) and netsurf (Debian: netsurf-gtk, mainly), small browsers with maintainable code bases. While dillo is really basic and is missing so much of CSS and modern HTML that on today's web even many non-adversarial sites become barely usable, netsurf is usually just fine for websites respecting user rights.

    Flex layouts and the article elements: The good part of 20 years of web development after the Web 1.0.

    I have to admit I nevertheless only use it in very specific contexts, mostly because luakit with its vi-like key bindings and lua extensiblity in the end usually wins out even though I don't trust the webkit rendering engine for two cents[1]. And that is why I hadn't noticed that this blog has rendered a lot worse than it should have in netsurf. This is particularly shameful because that was mostly because I have taken liberties with web standards that I should not have taken. Apologies: Netsurf was right and I was wrong.

    I have improved that quite a bit this morning. Given I am using flex layouts quite liberally here, and these don't work in Debian stable's netsurf, the rendered pages do look quite a bit different in netsurf than on the “major” browsers. But the fallbacks are ok as far as I am concerned. Since flex layouts are among the few “innovations“ in the post-Web 1.0 ecosystem that are actually a good idea, I gladly accept these fallbacks. Let me stress again that it is a feature of a friendly web rather than a bug that pages look different in different user agents.

    Dillo, regrettably, is another matter because of the stupid^Wunderconsidered colour games I'm playing here. As things are right now, the light background below text like this one sits on an HTML5 article element, which dillo ignores. Hence, the text is black on dark green, which, well, may be barely readable but really is deeply sub-optimal. Since I consider the article element and its brethren real progress in terms of markup (the other positive “innovation” post Web-1.0), I will not change that markup just to make this render better in dillo. I may finally re-think the silly dark green background soon-ish, though.

    [1]If you feel like this, too, let's team up and massage luakit's front end to work with netsurf's rendering engine. Given the close entanglement of luakit with the webkitgtk API, this certainly will result in a very different program, and almost certainly there would be no way to re-use luakit extensions. Still, I could very well see such a thing become my main browser.
  • Giving in to Network Effects

    In my first Fediverse notes, I mused that I'd choose a larger community if I had to choose again.

    Well, after watching the Fediverse for a little while, I figured that while I may not actually have to choose again, I really want to. My old community, social.dev-wiki.de, had about 650 profiles that had posted 4500 toots between them. This undoubtedely counts as small, and that has the double effect that not terribly many toots are coming in on the federated feed (I can't bring myself to write “timeline”) because people on the instance don't follow too many others, and that toots I produce don't get distributed very far because there are not many instances with people following someone on that small instance. A double negative network effect.

    This is particularly unwelcome when globally searching for hashtags (as I did last Sunday when I thought the local elections in Saarland might reflect in the Fediverse). Sure, I can help fix that by starting to follow accounts from other instances, but that's a bit of a chicken-and-egg thing, since in my own instance's feeds I don't even see those other accounts. Perhaps sitting on the public feed of the “flagship” instance (mastodon.social has about 640'000 profiles) for a while might have helped.

    Fediverse relays are a honking great idea.

    But it also felt odd to be behind the most active profile on that instance, and so I decided to compromise. That is, I don't give in to the pressures of the network effect altogether. I am therefore not switching to the flagship instance (which does feel a bit central-ish). But I am switching to troet.cafe, which boasts 2600 users (a factor of 4 over my old instance) with 150'000 posts (a factor of 30) between them. Plus, it uses a “relay“ that somewhat mitigates the problem outlined above by essentially creating a sub-federation of smaller instances exchanging public toots regardless of whether people on them follow each other.

    So, I made the move today.

    It is nice that Mastodon has built-in support for moving; there is an “Import and Export” item in the settings menu that guides one reasonably clearly through the process and transfers the followers (small deal on my profile at the moment). It is then that it gets a little lame.

    You see, I'd have expected that when I get an archive of my profile under “Export” I ought to be able to import it again under the new profile's “Import”. But that is not how it works; it seems the downloaded archive cannot be uploaded, and whatever is in there is either lost (the toots) or needs to be manually restored (profile pictures). Instead, what can be uploaded are CSVs of followed people, block lists and the like. And these are not in the archive one downloads but need to be downloaded from the old profile and re-uploaded to the new profile one by one.

    Is this really the way it's supposed to be? Have I missed something?

    Ok, one doesn't move every day, but if I keep being a Fedinaut, I will probably move again one day – to my own instance. It would be nice if by then there were smoother migration paths.

  • 'Failed to reset ACL' with elogind: Why?

    As I've blogged the other day, I like having my machine's syslog on the screen background so I notice when the machine is unwell and generally have some idea what it thinks it is doing. That also makes me spot milder distress signals like:

    logind-uaccess-command[30337]: Failed to reset ACL on /dev/bus/usb/002/061: Operation not supported
    

    I've ignored those for a long time since, for all I can see, logind-like software does nothing that on a normal machine sudo and a few judicious udev rules couldn't do just as well – and are doing on my box. The only reason there's elogind (a logind replacement that can live without systemd) on my box is because in Debian, kio – which in bullseye 270 packages depend upon – depends upon something like logind. The complaints in the syslog thus came from software I consider superfluous and I'd rather not have at all, which I felt was justification enough to look the other way.

    But then today curiosity sneaked in: What is going on there? Why would whatever elogind tries break on my box?

    Well, the usual technique of pasting relevant parts of the error message into some search engine leads to elogind PR #47 (caution: github will run analytics on your request). This mentions that the message results from a udev rule that tries to match hotplugged devices with users occupying a “seat”[1]. The rule calls some binary that would make sure that the user on the “seat” has full access to the device without clobbering system defaults (e.g., that members of the audio group can directly access the sound hardware) – and to keep the others out[2]. The Unix user/group system is not quite rich enough for this plan, and hence a thing called POSIX ACLs would be used for it, a much more complicated and fine-grained way of managing file system access rights.

    Well, the udev rules mentioned in the bug indeed live on my box, too, namely in /lib/udev/rules.d/73-seat-late.rules, which has the slightly esoteric:

    TAG=="uaccess", ENV{MAJOR}!="", RUN{program}+="/lib/elogind/elogind-uaccess-command %N $env{ID_SEAT}"
    

    I frankly have not researched what exactly adds the uaccess tag that this rule fires on, and when it does that, but clearly it does happen in Debian bullseye. Hence, this rule fires, and thus the failing elogind-uaccess-command is started.

    But why does it fail? Well, let's see what it is trying to do. The great thing about Debian is that as long as you have a (proper) deb-src line in your /etc/apt/sources.list, you can quickly fetch the source code of anything on your box:

    cd /usr/src  # well, that's really old-school.  These days, you'll
                 # probably have your sources somewhere else
    mkdir elogind # apt-get source produces a few files
    cd elongind   # -- keep them out of /usr/src proper
    apt-get source elogind
    cd <TAB>  # there's just one child directory
    

    To see where the source of the elongind-uaccess-command would be, I could have used a plain find, but in cases like these I'm usually lazy and just recursively grep for sufficiently specific message fragments, as in:

    find . -name "*.c" | xargs grep "reset ACL"
    

    This brings up src/uaccess-command/uaccess-command.c, where you'll find:

    k = devnode_acl(path, true, false, 0, false, 0);
    if (k < 0) {
             log_full_errno(errno == ENOENT ? LOG_DEBUG : LOG_ERR, k, "Failed to reset ACL on %s: %m", path);
             if (r >= 0)
                     r = k;
     }
    

    Diversion: I like the use of the C ternary operator to emit a debug or error message depending on whether or not things failed because the device file that should have its ACL adapted does not exist.

    So, what fails is a function called devnode_acl, which does not have a manpage but can be found in login/logind-acl.c. There, it calls a function acl_get_file, and that has a man page. Quickly skimming it would suggest the prime suspect for failures would be the file system, as that may simply not support POSIX ACLs (which, as I just learned, aren't really properly standardised). Well, does it?

    An apropos acl brings up the chacl command that would let me try acls out from the shell. And indeed:

    $ chacl -l /dev/bus/usb/001/003
    chacl: cannot get access ACL on '/dev/bus/usb/001/003': Operation not supported
    

    Ah. That in fact fails. To remind myself what file system we are talking about, I ran mount | grep "/dev " (the trailing blank on the search pattern is important), which corrected my memory from “it's a tmpfs” to “it's a devtmpfs”; while it turns out that the difference between the two does not matter for the problem at hand, your average search engine will bring up the vintage 2009 patch at https://lwn.net/Articles/345480/ (also from the abysses from which systemd came) when asked for “devtmpfs acl”, and a quick skim of that patch made me notice:

    #ifdef CONFIG_TMPFS_POSIX_ACL
    (something)
    

    This macro comes from the kernel configuration. Now, I'm still building the kernel on my main machine myself, and looking at the .config in my checkout of the kernel sources confirms that I have been too cheap to enable POSIX ACLs on my tmpfses (for a machine with, in effect, just a single user who's only had contact with something like POSIX ACLs ages ago on an AFS, that may be understandable).

    Well, I've enabled it and re-built my kernel, and I'm confident that after the next reboot the elogind messages will be gone. And who knows, perhaps the thing may actually save me a custom udev rule or two in the future because it automagically grants me access to whatever I plug in.

    Then again: Given there's now an API for Javascript from the web to read USB devices (I'm not making this up) and at least so far I'm too lazy to patch that out of my browsers… perhaps giving me (and hence these browsers) that sort of low-level access is not such a good idea after all?

    [1]See Multiseat on Wikipedia if you have no idea what I'm talking about. If you've read that you can probably see why I consider logind silly for “normal” computers with either a single user or lots of users coming in through the network.
    [2]Mind you, that in itself is totally reasonable: it would suck if everyone on a machine could read the USB key you've just plugged into a terminal; except that it's a rare configuration these days to have multiple persons share a machine that anyone but an administrator could plug anything into.
  • Now on the Fediverse

    Mastodon logo

    AGPL (copyright)

    While I believe that RSS (or rather Atom a.k.a. RFC 4287) is a great standard for subscribing to media like blogs[1], I strongly suspect that virtually nobody pulls my RSS feed. I'm almost tempted to log for a while to ascertain that. Then again, just based on how few people still run RSS aggregators (me, I'm using a quick self-written hack based on python3-feedparser) I am already quite confident the RSS mainly sits idly on my server.

    At least outside of my bubble, I guess what RSS was designed for has been superceded by the timelines of Facebook, Twitter, and their less shopworn ilk. As a DIY zealot, of course none of that is an option for me. What is an option in this field (and what certainly can do with a bit more public attention) is what these days is commonly called the Fediverse, that is, various sites, servers and client programs in the rough vicinity of microblogging, held together by W3C's ActivityPub protocol.

    What this technobabble means in practice: If you already are in the Fediverse, you can follow @Anselm@social.dev-wiki.de and get a toot whenever I post something here (note, however, that most posts will be in German).

    If you're not in the Fediverse yet, well, choose a community[2] – if I had to choose again, I'd probably take a larger community, as that increases one's initial audience: other communities will, for all I understand, only carry your public toots (i.e., messages) if someone in them has subscribed someone from your community –, get a client – I'm using tootle as a GUI and toot for the CLI – and add my Fediverse id.

    To accomodate tooting about new posts, I have made two changes to by pelican tooling: For one, post.py3 now writes a skeleton toot for the new post, like this:

    with open("next-toot", "w", encoding="utf-8") as f:
      f.write(f"{headline} – https://blog.tfiu.de/{slug}.html\n#zuengeln\n")
    

    And I have a new Makefile target:

    toot:
      (cat next-toot; echo "Post?"; read x)
      toot post < next-toot
    

    In that way, when I have an idea what the toot for the article should contain while I'm writing the post, I edit next-toot, and after I've run my make install, I'm doing make toot to notify the Fediverse.

    A side benefit: if you'd like to comment publicly and don't want do use the mail contact below: you can now do that through Mastodon and company.

    [1]That it is a great standard is already betrayed by the fact that its machine-readable specification is in Relax NG rather than XML schema.
    [2]This article is tagged DIY although I'm not running a Mastodon (or other AcitivityPub server) instance myself because, well, I could do that. I don't, for now, because Mastodon is not packaged for Debian (and for all I can tell neither are alternative ActivityPub servers). Looking at Mastodon's source I can understand why. Also, I won't rule out that the whole Fediverse thing will be a fad for me (as was identi.ca around 2009), and if I bother to set up unpackaged infrastructure, I need to be dead sure it's worth it.
  • Mutt says: “error encrypting data: Unusable public key”

    Today, I replied to an encypted mail, and right after the last “yes, go ahead, send this stuff already”, my mail client mutt showed an error:

    error encrypting data: Unusable public key
    

    Hu? What would “unusable” mean here? The message when all PGP keys are expired looks quite a bit different. And indeed, the key in question was not expired at all:

    $ gpg --list-keys person@example.net
    pub   rsa4096/0xDEEEEEEEEEEEEEEE 2015-03-21 [SCA] [expires: 2023-02-01]
          FINGERPINTWITHHELDFINGERPRINTWITHHELDFIN
    uid                   [  full  ] Person <person@example.net>
    

    – this should do for another year or so. Or should it?

    Feeding the message to a search engine brings up quite a few posts, most of them from times when keyservers would mess up subkeys, i.e., the cryptographic material that is used to actually encrypt stuff (as opposed to the main key that usually just authenticates these subkeys).

    This obviously did not apply here, since keyservers have long been fixed in this respect. But subkeys were the right hint. If you compare the output above with what such a command will output for the feedback key for this blog:

    $ gpg --list-keys zuengeln@tfiu.de
    pub   rsa3072/0x6C4D6F3882AF70AD 2021-01-28 [SC]
          60505502FB15190B10DBF1436C4D6F3882AF70AD
    uid                   [ultimate] Das Engelszüngeln-Blog <zuengeln@tfiu.de>
    sub   rsa3072/0x3FCFC394D8DF7140 2021-01-28 [E]
    

    you'll notice that the Person's key above does not have a sub line, i.e., there are no subkeys.

    How can that happen? Gnupg won't create such a thing without serious amounts of coercion, and such a key is largely useless.

    Well, it turns out it doesn't happen. The subkeys are there, gnupg just hides them because that's what it does with expired subkeys by default. If you override that default, you'll get:

    $  gpg --list-options show-unusable-subkeys --list-keys person@example.net
    pub   rsa4096/0xDEEEEEEEEEEEEEEE 2015-03-21 [SCA] [expires: 2023-02-01]
          FINGERPINTWITHHELDFINGERPRINTWITHHELDFIN
    uid                   [  full  ] Person <person@example.net>
    sub   rsa4096/0xEEEEEEEEEEEEEEEE 2015-02-01 [E] [expired: 2020-01-31]
    sub   elg4096/0xEEEEEEEEEEEEEEEE 2020-02-01 [E] [expired: 2022-01-31]
    

    So, that's the actual meaning of the error message about „Unusable public key“: “No usable subkey”.

    What's a fix for that? Well, for all I know you cannot force gnupg to encrypt for an expired key, so the way to temporarily fix things (for instance, to tell people make their keys permanent[1]) is to turn the clock. There's the nice program faketime that just changes the time for whatever runs below it. That's great because on modern computers, changing the system time has all kinds of ugly side effects (not to mention you'd have to kill the ntpd that your computer quite likely runs to keep your computer's clock synchronised with the rest of the world).

    Since I'm using mutt as a mailer, I'd use faketime like this:

    faketime 2022-01-31 mutt
    

    I'm fairly confident this would work with, say, thunderbird as well, though it might be a problem if the times of an X server and client are dramatically different.

    But that's really no substitute for an updated key: In most people's mailboxes, such mails will be way down in the swamp of rotting mails from one month ago[2] And mail servers sometimes refuse to transport mail that's so far from the past.

    Then again, to my own surprise, everytime I had to go to such extremes because I didn't have a non-expired key, the recipients eventually noticed.

    [1]Let me again advertise non-expiring keys. The main arguments for these are that (a) essentially nobody directly attacks keys, so it really doesn't matter if a key is used for a decade or more, and (b) PGP is hard enough for muggles even without auto-destructing keys. The net effect of expiring keys on privacy is thus negative, because they keep people off using PGP and even trying to understand crypto. And you can always revoke keys, in particular when we have educated people to now and then sync their keyring with keywervers.
    [2]As a side note: While inbox zero sounds to much like one of those market-radical self-improvement fads to me, I've been religious about less-than-a-page inbox for the past decade or so and found it did improve a relevant part of my life.
  • Wakealarm: Device or resource busy

    The other day I wanted a box doing regular (like, daily) file system backups and really not much else to switch off while idle and then wake up for the next backup. Easy, I thought, install the nvram-wakeup package and that's it.

    Alas, nvram-backup mumbled something about an unsupported BIOS that sounded suspiciously like a lot of work that would benefit almost nobody, as the box in question houses an ancient Supermicro board that's probably not very common any more.

    So, back to the roots. Essentially any x86 box has an rtc that can wake it up, and Linux has had an interface to that forever: Cat a unix timestamp (serialised to a decimal number) into /sys/class/rtc/rtc0/wakealarm, as discussed in the kernel documentation's sysfs-class-rtc file:

    (RW) The time at which the clock will generate a system wakeup event. This is a one shot wakeup event, so must be reset after wake if a daily wakeup is required. Format is seconds since the epoch by default, or if there's a leading +, seconds in the future, or if there is a leading +=, seconds ahead of the current alarm.

    That doesn't tell the full story, though. You see, I could do:

    BACKUP_AT="tomorrow 0:30"
    echo `date '+%s' -d "$BACKUP_AT"` > /sys/class/rtc/rtc0/wakealarm
    

    once, and the box came back, but when I then tried it again, the following happened:

    echo `date '+%s' -d "$BACKUP_AT"` > /sys/class/rtc/rtc0/wakealarm
    bash: echo: write error: Device or resource busy
    

    Echoing anything with + or += did not work either; I have not tried to ascertain why, but suspect that's functionality for more advanced RTC chips.

    Entering the error message into a search engine did bring up a lkml thread from 2007, but on lmkl.iu.edu the thread ends with an open question: How do you disable the wakealarm? Well: the obvious guess of echo "" does not work. My second guess, however, did the trick: You reset the kernel wakealarm by writing a 0 into it:

    echo 0 > /sys/class/rtc/rtc0/wakealarm
    

    – after which it is ready to be written to again.

    And now that I've written this post I notice that the 2007 thread indeed goes on, as on narkive, and a bit further down, Tino summed up this entire article as:

    Please note that you have to disable the old alarm first, if you want
    to set a new alarm. Otherwise, you get an error. Example:
    
    echo 12345 > /sys/class/rtc/rtc0/wakealarm
    echo 0 > /sys/class/rtc/rtc0/wakealarm
    echo 23456 > /sys/class/rtc/rtc0/wakealarm
    

    Ah well. Threading is an important feature in mail clients, even if they're just archives.

  • Inlining xs:include in XML Schema

    Screenshot: Fragmented XSD schema

    Please don't do it like this: for users of a schema, having to pull it in a dozen fragments is just pain and no gain. See below for a program that lets you heal this particular disease.

    While I'm a big fan of XML – which is governed by a very well-written standard and is (DTDs aside) about as easy to process as something context-free can be –, I have always been a lot more skeptical about XML Schema, which is horrendously complex, has a few nasty misfeatures[1] and generally has had a major role in giving XML a bad name.

    But well, it's there, and it won't go away. That ought to be reason enough to not encumber it with further and totally avoidable pain. As, for instance, splitting up a single schema into fragments of a couple of lines and then using xs:include liberally to re-assemble the fragments at the client side. Datacite, I'm looking at you. Regrettably, they're not the only ones doing that. And as opposed to splitting a domain mapping into different schemas – which might improve re-usability, this lexical splitting really helps nobody except perhaps the authors.

    The use of xs:include is a pain in particular when one tries to implement redistributable validators, as these then need to keep a lot of files in a defined hierarchy. Just pointing to the vendor's site is not an option, because the software would hit that every time it validates something, which is, if nothing else, a privacy and stability problem.

    Well, today I had another case of XSD splititis, and this time it was bad enough that I decided to merge the fragments. I had expected people had written “inliners” expanding xs:include in XSDs into standalone XSDs. After all, it's basically a lexical thing (well, excepting namespace mappings and perhaps re-indentation). Five minutes of operating a search engine didn't bring up anything, though, and so I wrote a quick cure: expand-xsd-include.py.

    I'll be the first to admit that it is a hack at this point, mainly because I blindly discard the root tags for the included documents. That's wrong because these might add or, worse, change the mapping from prefixes to XML namespaces. In its current state, this will fail badly if included documents use different or extra mappings and declare them in the root element; declarations further down are ok.

    Another problem resulting from keeping namespace processing off on the parser is that I hardcode the prefix for the XSD schema to xs. If your schema uses something else, change XSD_PREFIX in the script.

    Mending these deficiencies wouldn't be an undue effort, and if you have XSDs that need it, let me know and I'll do proper namespace processing. Or perhaps, in addition, teach the thing to pull the input files via http. Meanwhile, I suspect that the large majority of atomised XSDs can be merged with this code, and so I thought I might as well put it online in its slightly embarrassing shape.

    Let me know if you use it. And if you distribute fragmented XSDs: Why not use the script to assemble your XSD before publishing it?

    [1]The worst XSD misfeature IMHO are the namespaced attribute values; where XML has been designed to be parsable without external DTDs (ok, not generally, but under well-defined conditions, and it's been a long time since I saw a document that didn't meet those), parsing results with namespaced attribute values depend on whether or not the parser knows the XSD. And that would even be bad without the ugly schemaLocation hacks in both schema and schema instance.
  • Explaining Tags in Pelican

    Right after I had celebrated the first anniversary of this blog with the post on my Pelican setup, I decided to write another plugin I've been planning to write for a while: taginfo.py.

    Nachtrag (2022-10-07)

    Don't take it from here; rather, see https://codeberg.org/AnselmF/pelican-ext

    This is for:

    Blog screenshot

    that is, including explanations in on pages for tags, telling people what the tag is supposed to mean.

    To use taginfo, put the file into your plugins folder, add taginfo to the PLUGINS list in your pelicanconf.py, and then create a folder taginfo next to your content folder. In there, for each tag you want to comment, create a file <tagname>.rstx (or just rst). Such a file has to contain reStructuredText, where pelican's extensions (e.g., {filename} links) do not work (yet). I suppose it wouldn't be hard to support them; if you're interested in this plugin, feel free to poke me in case you'd like to see the extra pelican markup.

    To make the descriptions visible, you need to change your tag.html template (typically in theme/templates/tag.html) in order to arrange for tag.make_description() to be callsed when rendering the document. Me, I'm doing it like this:

    {% block content_title %}
    <h1>Tag <em>{{ tag }}</em></h1>
    <div id="taginfo">
            {{ tag.make_description() }}
    </div>
    {% endblock %}
    

    (And I still find jinja templates exceptionally ugly).

  • How I'm Using Pelican

    I started this blog on January 14th last year. To celebrate the anniversary, I thought I could show how I'm using pelican (the blog engine I'm using); perhaps it'll help other people using it or some other static blog generator.

    Posting and Writing

    First, I structure my content subdirectory (for now) such that each article has the ISO-formatted date as its name, which makes that source name rather predictable (for linking using pelican's {filename} replacement), short, and gives the natural sort order sensible semantics.

    Also, I want to start each post from a template, and so among the first things I did was write a little script to automate name generation and template instantiation. Over the past year, that script has evolved into post.py3.

    Nachtrag (2022-03-15)

    I've changed a few things in the meantime; in particular, I am now opening a web browser because I got tired of hunting for the URI when it was scrolled off the screen before I first had something to open, and to make that work smoothly, I'm building the new post right after creating its source.

    It sits next to pelican's Makefile and is in the blog's version control. With this, starting this post looked like this:

    $ ./post.py3 "How I'm Using Pelican"
    http://blog/how-i-m-using-pelican.html
    remake.sh output/how-i-m-using-pelican.html
    

    Nachtrag (2022-05-26)

    The output is now a bit different, and now I do open the browser window – see below.

    What the thing printed is the URL the article will be seen under (I've considered using the webbrowser module to automatically open it, but for me just pasting the URL into my “permanent” blog browser window works better). The second line gives a command to build the document for review. This remake.sh script has seen a bit of experimentation while I tried to make the specification of what to remake more flexible. I've stopped that, and now it's just:

    #!/bin/bash
    pelican --write-selected "$1"
    

    When you add:

    CACHE_CONTENT = True
    LOAD_CONTENT_CACHE = True
    CONTENT_CACHING_LAYER = 'generator'
    

    to your pelicanconf.py, rebuilding just the current article should be relatively quick (about 1 s on my box). Since I like to proofread on the formatted document, that's rather important to me.

    Nachtrag (2022-05-26)

    N…no. This part I'm now doing very differently. See Quick RST Previews.

    If you look at post.py3's code, you will see that it also fixes the article's slug, i.e., the path part of the URL. I left this to Pelican for a while, but it annoyed me that even minor changes to a blog title would change the article's URI (and hence also the remake statment). I was frankly tempted to not bother having elements of the title in the slug at all, as I consider this practice SEO, and I am a fanatical enemy of SEO. But then I figured producing shorter URIs isn't worth that much, in particular when I'd like them to be unique and easy to pronounce. In the end I kept the title-based slugs.

    The script also picks the local file name as per the above consideration with some disambiguation if there's multiple posts on one day (which has only happened once in the past year). Finally, the script arranges for adding the new post to the version control system. Frankly, from where I stand now, I'd say I had overestimated the utility of git for blogging. But then, a git init is cheap, and who knows when that history may become useful.

    I'm not using pelican's draft feature. I experimented with it for a while, but I found it's a complication that's not worth anything given I'm always finishing a post before starting the next. That means that what otherwise would be the transition from draft to published for me is the make install. The big advantage of starting with status:published is that under normal circumstances, an article never changes its URI.

    Local Server Config and Media

    Another pelican feature I'm not using is attaching static files. I have experimented with that initially, but when the first larger binary files came in, I realised they really shouldn't be under version control. Also, I never managed to work out a smooth and non-confusing way to have pelican copy these files predictably anyway.

    What I ended up doing is have an unversioned, web-published directory that contains all non-article (“media”) files. On my local box, that's in /var/www/blog-media, and to keep a bit of order in there, the files sit in per-year subdirectories (you'll spot that in the link to the script above). The blog directory with the sources and the built documents, on the other hand, is within my home. To assemble all this, I have an /etc/apache2/sites-enabled/007-blog.conf containing:

    <VirtualHost *:80>
      ServerName blog
      DocumentRoot /home/anselm/blog/output
    
      Alias /media /var/www/blog-media
    
      ProxyPass /bin/ http://localhost:6070/
    
      <Directory "/home/anselm/blog/output">
        AllowOverride None
        Options Indexes FollowSymLinks
        Require all granted
      </Directory>
    
      <Directory ~ "/\.git">
        Require all denied
      </Directory>
    </VirtualHost>
    

    which needs something like:

    127.0.0.1 localhost blog
    

    in your /etc/hosts so the system knows what the ServerName means. The ProxyPass statement in there is for CGIs, which of course apache could do itself; more on this in some future post. And I'm blocking the access to git histories for now (which do exist in my media directory) because I consider them fairly personal data.

    Deployment

    Nachtrag (2022-07-10)

    I'm now doing this quite a bit differently because I have decided the procedure described here is a waste of bandwidth (which matters when all you have is GPRS). See Maintaining Static Blogs Using git push.

    When I'm happy with a post, I remake the whole site and push it to the publishing box (called sosa here). I have added an install target to pelican's Makefile for that:

    install: publish
      rsync --exclude .xapian_db -av output/ sosa:/var/blog/generated/
      rsync -av /var/www/blog-media/ sosa:/var/blog/media/
      ssh sosa "BLOG_DIR=/var/blog/generated/ /var/blog/media/cgi/blogsearch"
    

    As you can see, on the target machine there's a directory /var/blog belonging to me, and I'm putting the text content into the generated and the media files into the media subdirectory. The exclude option to the rsync and the call to blogsearch is related to my local search: I don't want the local index on the published site so I don't have to worry about keeping it current locally, and the call to blogsearch updates the index after the upload.

    The publication site uses nginx rather than apache. Its configuration (/etc/nginx/sites-enabled/blog.conf) looks like this (TLS config removed):

    server {
      include snippets/acme.conf;
      listen 80;
      server_name blog.tfiu.de;
    
      location / {
        root /var/blog/generated/;
      }
    
      location /media/ {
        alias /var/blog/media/;
      }
    
      location /bin/ {
        proxy_pass http://localhost:6070;
        proxy_set_header Host $host;
      }
    
      location ~ \.git/ {
        deny all;
      }
    }
    

    – again, the clause for /bin is related to local search and other scripting.

    Extensions

    Nachtrag (2022-10-07)

    Don't take the code from here; rather, see https://codeberg.org/AnselmF/pelican-ext

    In addition to my local search engine discussed elsewhere, I have also written two pelican plugins. I have not yet tried to get them into pelican's plugin collection because… well, because of the usual mixture of doubts. Words of encouragement will certainly help to overcome them.

    For one, again related to searching, it's articlemtime.py. This is just a few lines making sure the time stamps on the formatted articles match those of their input files. That is very desirable to limit re-indexing to just the changed articles. It might also have advantages for, for instance, external search engines or havesters working with the HTTP if-modified-since header; but then these won't see changes in the non-article material on the respective pages (e.g., the tag cloud). Whether or not that is an advantage I can't tell.

    Links to blog posts

    The citedby plugin in action: These are the articles that cite this post right now.

    The other custom extension I wrote when working on something like the third post in total, planning to revisit it later since it has obvious shortcomings. However, it has been good enough so far, and rather than doing it properly and then writing a post of it own, I'm now mentioning it here. It's citedby.py, and it adds links to later articles citing an article. I think this was known as a pingback in the Great Days of Blogs, though this is just within the site; whatever the name, I consider this kind of thing eminently useful when reading an old post, as figuring out how whatever was discussed unfolded later is half of your average story.

    The way I'm currently doing it is admittedly not ideal. Essentially, I'm keeping a litte sqlite database with the cited-citing pairs. This is populated when writing the articles (and pulls the information from the rendered HTML, which perhaps is a bit insane, too). This means, however, that a newly-made link will only …

  • Replacing root-tail when there is a compositor

    Since there hasn't been real snow around here this year until right this morning, I've been running xsnow off and on recently[1]. And that made me feel the lack of a compositor on my everyday desktop. Certainly, drop shadows and fading windows aren't all that necessary, but I've been using a compositor on the big screen at work for about a decade now, and there are times when the extra visual cues are nice. More importantly, the indispensable xcowsay only has peudo-transparency when there's no compositor ever since it moved to gtk-3 (i.e., in Debian bullseye).

    Well: enough is enough. So, I'm now running picom in my normal desktop sessions (which are managed by sawfish).

    Another near-indispensable part of my desktop is that the syslog is shown in a part of the root window (a.k.a. desktop background), somewhat like this:

    Windows, and a green syslog in the background

    This was enabled by the nice program root-tail for ages, but alas, it does not play well with compositors. It claims its --windowed flag does provide a workaround, but at least for me that failed in rather crazy ways (e.g., ghosts of windows were left behind). I figured that might be hard to fix and thought about an alternative. Given compositors are great for making things transparent: well, perhaps I can replace root-tail with a heavily customised terminal?

    The answer: essentially, yes.

    My terminal program is unicode-rxvt. In the presence of a compositor, you can configure it for a transparent background by telling it to not use its pseudo-transparency (+tr), telling it to use an X visual with an alpha channel (-depth 32) and then using a background colour with the desired opacity prefixed in square brackets. For a completely transparent terminal, that is:

    urxvt -depth 32 +tr -bg "[0]#000000"
    

    This still has the scrollbar sticking out, which for my tail -f-like application I don't want; a +sb turns it off. Also, I'm having black characters in my terminals by default, which really doesn't work with a transparent background. Making them green looks techy, and it even becomes readable when I'm making the background 33% opaque black:

    urxvt -depth 32 +tr +sb -bg "[33]#000000" -fg green
    

    To replace root-tail, I have to execute my tail -f, put the window into the corner and choose a somewhat funky font. To simply let me reference the whole package from startup files, I'm putting all that into a shell script, and to avoid having the shell linger around, all that this script does is call an exec (yes, for interactive use this probably would be an alias):

    #!/bin/sh
    exec /usr/bin/urxvt -title "syslog-on-root" +sb \
            +tr -depth 32 -bg "[33]#000000" -g 83x25-0-0 \
            -fg green -fn "xft:monofur-11:weight=black" \
            -e tail -f /var/log/syslog
    

    This already looks pretty much as it should, except that it's a normal window with frames and all, and worse, when alt-tabbing through the windows, it will come up, and it will also pollute my window list.

    All that needs to be fixed by the window manager, which is why I gave the window a (hopefully unique) title and then configured sawfish (sawfish-config, “Window Rules”) to make windows with that name depth 16, fixed-position, fixed-size, sticky, never-focus, cycle-skip, window-list-skip, task-list-skip, ignore-stacking-requests. I think one could effect about the same with a judicious use of wmctrl – if you rig that up, be sure to let me know, as I give you it would be nice to make that part a bit more independent of the window manager.

    There's one thing where this falls short of root-tail: Clicks into this are not clicks into the root window. That hurts me because I have root menus, and it might hurt other people because they have desktop icons. On the other hand, I can now mouse-select from the syslog, which is kind of nice, too. Let's see.

    [1]Well, really: Mainly because I'm silly.
  • Zu Fuß im Zug ins Netz

    Screenshot

    Wer hinreichend Geduld und Kompetenz hat, bekommt in den Zügen von Go-Ahead am Ende so eine Seite vom Captive Portal.

    Zu den ärgerlichen Folgen des Irrsinns vom „geistigem Eigentum“ gehören Captive Portals, also Webseiten, auf die mensch in öffentlichen WLANs erstmal umgeleitet wird. Erst, wer Familienpackungen Javascript ausführen lässt und schließlich per Häkchen lügt, er_sie habe die Nutzungsbedingungen gelesen und anerkannt, darf ins Netz. Allein fürs Öffnen dieser Sicherheitslücke („gehen Sie in irgendein unbekanntes Netz und lassen Sie ihren Browser allen Code ausführen, der da rauskommt, und dann ziehen Sie noch megabyteweise Bilder – vielleicht ist ja in einem der Bilder-Decoder auch noch ein Buffer Overflow“) verdient die Geistiges-Eigentum-Mafia Teeren und Federn.

    Na ja, und erstaunlich oft ist der Mist einfach kaputt. Eine besondere Kränkung für die Ingenieursabteilung meines Herzens ist, wenn die ganze aufregende Hi-Tech, mit der mensch in fahrenden Zügen ins Netz kann, prima geht, aber trotzdem kein Bit durch die Leitung zu kriegen ist, weil ein „Web-Programmierer“ im doofen Captive Portal gemurkst hat.

    Kaputt sah das WLAN für mich heute in einem Zug von Go-Ahead aus. Die Geschichte, wie ich mich dennoch ins Netz vorgekämpft habe, finde ich im Hinblick auf manuelles Fummeln an IP-Netzen instruktiv, und so dachte ich mir, ich könnte im nächsten Zug (betrieben von der Bahn und deshalb noch nicht mal mit kaputtem Internet ausgestattet) zusammenschreiben, was ich alles gemacht habe. Das Tooling, das ich dabei verwende, ist etwas, öhm, oldschool. So sollte ich statt ifconfig und route heute wohl lieber ein Programm mit dem schönen Namen ip verwenden. Aber leider finde ich dessen Kommandozeile immer noch ziemlich grässlich, und solange die guten alten Programme aus grauer Vorzeit immer noch auf eigentlich allen Linuxen rumliegen, kann ich mich einfach nicht zur Migration durchringen.

    Am Anfang stand die einfache ifupdown-Konfiguration für das Zug-Netz:

    iface roam inet dhcp
      wireless-essid freeWIFIahead!
    

    in /etc/network/interfaces.d/roam. Damit kann ich sudo ifup wlan0=roam laufen lassen, und der Kram sollte sich verbinden (wenn ihr die Interface-Umbenamsung der systemd-Umgebung nicht wie ich abgeschaltet habt, würde vor dem = einer der kompizierten „vorhersagbaren“ Buchstabensuppen der Art wp4e1 oder so stehen).

    Nur: das Netz kam nicht hoch. Ein Blick nach /var/log/syslog (wie gesagt: etwas altbackenes Tooling; moderner Kram bräuchte hier eine wilde journalctl-Kommandozeile) liefert:

    Jan  2 1████████ victor kernel: wlan0: associate with be:30:7e:07:8e:82 (try 1/3)
    Jan  2 1████████ victor kernel: wlan0: RX AssocResp from be:30:7e:07:8e:82 (capab=0x401 status=0 aid=4)
    Jan  2 1████████ victor kernel: wlan0: associated
    Jan  2 1████████ victor kernel: IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
    Jan  2 1████████ victor dhclient[18221]: Listening on LPF/wlan0/█████████████████
    Jan  2 1████████ victor dhclient[18221]: Sending on   LPF/wlan0/█████████████████
    Jan  2 1████████ victor dhclient[18221]: Sending on   Socket/fallback
    Jan  2 1████████ victor dhclient[18221]: DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 6
    Jan  2 1████████ victor dhclient[18221]: DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 14
    Jan  2 1████████ victor dhclient[18221]: DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 1
    Jan  2 1████████ victor dhclient[18221]: No DHCPOFFERS received.
    

    Das bedeutet: Das lokale „Router“ (Access Point, AP) hat mich ins Netz gelassen („associated“; das ist quasi das Stecken des Netzkabels), aber dann hat der DHCP-Server, der mir eigentlich eine IP-Adresse hätte zuteilen sollen (mit der ich dann übers lokale Netz hinauskommen könnte), genau das nicht getan: „No DHCPOFFERS received“.

    Wenn das so ist – der Rechner ist Teil von einem Ethernet-Segment, sieht mithin Netzwerkverkehr, bekommt aber keine IP-Adresse und kann also mit niemandem über TCP/IP reden –, lohnt sich ein Blick in die Pakete, die im Netzwerksegment unterwegs sind. Das geht mit fetten Grafikmonstern wie wireshark, aber für einen schnellen Blick tut es tcpdump allemal. Da wir aber noch keine Internet-Verbindung haben, können wir sicher keine IP-Adressen („192.168.1.15“) zu Namen („blog.tfiu.de“) auflösen. Tatsächlich würde der Versuch tcpdump schon zum Stehen bringen. Deshalb habe ich per -n bestellt, Adressen numerisch auszugeben:

    tcpdump -n
    

    Ich sehe dabei einen Haufen ARP-requests, also Versuche, die lokalen Adressen von Ethernet-Karten für IP-Adressen herauszufinden, von denen der Router meint, sie müssten im lokalen Ethernet-Segment sein, etwa:

    12:█████████383626 ARP, Request who-has 10.1.224.139 tell 10.1.0.1, length 28
    

    Aus dieser Zeile allein kann ich schon mal raten, dass das Gateway, also der Router, über den ich ins Internet kommen könnte, wohl die Maschine 10.1.0.1 sein wird. Dabei sind Adressen aus dem 10er-Block leicht magisch, weil sie nicht (öffentlich) geroutet werden und daher (im Gegensatz zu normalen IP-Adressen) im Internet beliebig oft vorkommen können. Sie sind deshalb für relativ abgeschottete Unternetze wie hier im Zug populär (ins richtige Netz gehts dann per Network Address Translation NAT) – und Maschinen mit .1 hinten dran sind konventionell gerne Router.

    Wenn ich mit dem Gateway reden will, brauche ich immer noch selbst eine IP-Adresse. Ohne DHCP-Server bleibt mir wenig übrig als mir selbst eine zu nehmen. Das ist normalerweise ein unfreundlicher Akt, denn wer eine Adresse wiederverwendet, die jemand anders in dem Teilnetz schon hat, macht die Verbindung für den anderen Rechner im Effekt kaputt. Insofern: Wer etwas wie das Folgende tut, sollte erstmal für eine Weile dem tcpdump zusehen und sicherstellen, dass sonst niemand mit der gewählten Adresse unterwegs ist. Wer öfter zu Stunts dieser Art genötigt ist, möge sich arping ansehen – damit kann mensch kontrolliert nachsehen, ob eine Adresse frei ist.

    In meinem Fall war es ruhig im Netz – es hat ja vermutlich auch kaum jemand sonst eine Verbindung bekommen. Ich fühlte mich also hinreichend sicher, irgendwas zu probieren, das der Router wohl als „im eigenen Netz“ akzeptieren würde. Die so geratene Adresse habe ich versuchsweise auf meine Netzwerkschnittstelle geklebt:

    sudo ifconfig wlan0 inet 10.1.0.105 up
    

    Ein guter erster Tipp für diese Zwecke ist eine Adresse, bei der nur das letzte Byte anders ist als in der Router-Adresse (in antikem Jargon: „im selben Klasse C-Netz“). In diesem Fall wäre es denkbar, noch mehr zu ändern, denn im antiken Jargon ist 10.0.0.0 ein A-Netz („die Netzmaske ist 255.0.0.0“), was mit dem Kommando oben (das nicht explizit eine andere Netzmaske gibt) den lokalen Rechner Pakete, die an eine 10.irgendwas-Adresse gehen, an wlan0 schicken lässt. Allerdings macht fast niemand mehr Routen nach diesen Regeln, und so ist es nicht unwahrscheinlich, dass der Router allzu weit entfernte Adressen routen will oder jedenfalls nicht einfach wieder ins lokale Netz-Segment zurückschickt. So in etwa ist die Logik, die mich auf die IP-Adresse oben gebracht hat.

    Wenn meine Vermutungen richtig waren, hätte ich nach dem ifconfig mit dem vermuteten Router bei 10.1.0.1 IP sprechen können. Der Klassiker zur Konnektivitätsprüfung ist ping, und wirklich:

    ping 10.1.0.1
    

    kriegte brav Paket für Paekt zurück. Wow! Immerhin schon IP!

    Ein DHCP-Server konfiguriert als nächstes normalerweise die „Default-Route“, also das, was der Rechner mit Paketen tun soll, für die er nichts anderes weiß. Rechner am Rande des Netzes (und wir ungewaschenen Massen sind eigentlich immer mehr oder weniger am Rande des Netzes) schicken in der Regel alle Pakete, die nicht ins lokale Netz gehen, an einen (und nur einen) Router, nämlich ihr Gateway. Dieses Gateway legt mensch mit meinen alten Werkzeugen so fest:

    sudo route add default gw 10.1.0.1
    

    Damit könnte ich im Netz sein. Ein schnelles ssh auf eine meiner Maschinen im Netz führt aber zu nichts: meine Maschine kann keine Namen auflösen. Ach ja, das ist noch etwas, das normalerweise ein DHCP-Server macht: der lokalen Maschine Adressen geben, an denen sie Namen auflösen kann (die DNS-Server). Über (inzwischen potenziell sehr komplizierte) Umwege enden diese Adressen in gewissem Sinn in der Datei /etc/resolv.conf; dort erwartet sie jedenfalls die C-Bibliothek.

    Nun ist aber der DHCP-Server gerade kaputt. Die Namen von DNS-Servern, die aus einem bestimmten Netz heraus funktionieren, kann mensch jedoch nicht raten. Manchmal – typisch bei privaten Netzen – tut es der Router selbst. Andererseits betreibt Google unter 8.8.8.8 einen DNS-Server, und so ungern ich Google-Dienste empfehle: Die 8.8.8.8 verwende ich in Notsituationen wie dieser. Nur bin ich ja immer noch im Captive Portal, und der Router mag meine Versuche, mit dem Google-DNS zu reden, unterbinden. Viele Captive Portals tun das nicht (aus relativ guten Gründen). Und wirklich:

    ping 8.8.8.8
    

    kommt gut zurück. Andererseits kann mir das Captive Portal natürlich alles vorspielen. Ist da wirklich ein DNS-Server?

    Das Tool der Wahl zum Spielen mit DNS-Servern ist dig [1]. Im einfachsten Fall bekommt dig einen Namen als Parameter, das wird hier zu nichts führen, denn noch hat mein Rechner ja kein DNS. Eine Stufe komplizierter übergibt mensch noch eine Nameserver-Adresse hinter einem @, also:

    dig blog.tfiu.de @8.8.8.8
    

    Und das kommt zurück! Mit der richtigen Information, nicht irgendeinem Mist, den sich das Captive Portal ausdenkt. Damit kann ich für meine temporäre Netzverbindung mein /etc/resolv.conf ändern zu:

    nameserver 8.8.8.8 …
  • El-Cheapo Internationalisation in Pelican and Jinja

    This blog is mainly in German, but in particular computer-related posts I'm writing in (some version of) English, as I expect they might be useful and/or interesting to quite a few people who don't speak German. On the English-language pages, I was always a bit unhappy that the footers (and to a certain degree, menu items) were in German.

    Now, the standard way to “internationalize” programs is gettext, and sure enough, Pelican's template engine Jinja supports i18n with gettext. But my actual use case is mainly to replace large blocks (like the full footer, which has little markup but quite a lot of text), and having gettext-style message catalogues for those seemed unattractive to me.

    Instead, I thought I could somehow work thorugh jinja computed includes, perhaps with something like:

    {% import 'messages-'+article.lang+'.html' as messages %}
    

    For each language I want to support, I'd then have one messages file (messages-en.html, messages-de.html, …), a bit like the message catalogues in gettext, but containing jinja markup.

    The first question to answer was: What jinja markup? After a few experiments, it seems to me that using jinja blocks – which initially had seemed idiomatic to me – to write these messages files is at least tricky. After a while I rather settled for macros, such that the message file could contain something like:

    {% macro basefoot() -%}
     <footer id="main-footer" class="clearfix">
       <ul class="nobull">
       <li><a href="/pages/wer.html">[Wer?/Kontakt]</a>
         | <a href="/archives.html">[Archiv]</a></li>
       ...
    {%- endmacro }
    

    and the application in the template would then be:

    {% block footer %}
      {{ messages.basefoot() }}
    {% endblock footer %}
    

    This works nicely for internationalising large tree fragments and is not a lot worse than gettext for single strings (though I admit message catalogues are a bit nicer to work with when translating).

    The one big problem was how to select the messages. For all the infrastructure pages (archive, tags, categories), I'm happy to use the default language (who looks at them, after all?). Hence,

    {% import 'messages-'+DEFAULT_LANG+'.html' as messages %}
    

    in the base.html template sounded like a good idea and went well enough. Then I thought I could use:

    {% import 'messages-'+article.lang+'.html' as messages %}
    

    in article.html and something similar in page.html. But alas: That won't work, because, as in python, jinja imports only happen once per run and are simply namespace operations when repeated.

    I then tried to put the import statement into a named block, hoping that perhaps block overriding would suppress the default import. I suspect it wouldn't have because block content, I gather, is executed unconditionally, but never mind: it won't work anyway because (at least that's what I gather) blocks have python namespaces of their own, and hence imports in a block are not visible outside of it. Hence, once I put the import statement into a jinja block, my messages object is gone from where I need it.

    So, I ended up with the following hack in base.html:

    {# Yeah, hard-coding the various cases here *is* lame, but
       you can't override an import in a child templates as far
       as I can see, so this seems the least ugly option #}
    {% if article %}
    {% import 'messages-'+article.lang+'.html' as messages %}
    {% elif page %}
    {% import 'messages-'+page.lang+'.html' as messages %}
    {% else %}
    {% import 'messages-'+DEFAULT_LANG+'.html' as messages %}
    {% endif %}
    

    – this is ugly, breaks the encapsulation of the article and page templates, and it generally sucks, but for now it's good enough for me. I suppose the clean way to do this would be through a pelican extension providing a variable main_language (perhaps), computed in much the same way as this.

    Let's see if this hack falls on my feet; for now, I like the way this works out and that most of the stuff on the article pages is now English on English pages and German on German pages.

    Ceterum censeo: the more template languages I use for producing XML (and yes, I'm aware jinja has a much wider scope), the more I'm convinced the low adoption of stan is another instance of IT discarding a clearly superior design. What a pity it is that, for all I can see, all the popular templating engines work against the existing XML markup rather than, as in stan, with it.

  • Fernseh-Livestreams mit Python und mpv

    Nachtrag (2022-10-11)

    Ich habe das Programm jetzt auf codeberg untergebracht

    Die Unsitte, alles in den Browser zu verlegen, ist ja schon aus einer Freiheitsperspektive zu verurteilen – „die Plattform“ gibt die Benutzerschnittstelle vor, kann die Software jederzeit abschalten und sieht (potenziell) noch den kleinsten Klick der NutzerIn (vgl. WWWorst App Store). Bei Mediatheken und Livestreams kommt noch dazu, dass videoabspielende Browser jedenfalls in der Vergangenheit gerne mal einen Faktor zwei oder drei mehr Strom verbraucht haben als ordentliche Videosoftware, von vermeidbaren Hakeleien aufgrund von schlechter Hardwarenutzung und daraus resultierendem Elektroschrott ganz zu schweigen.

    Es gibt also viele Gründe, speziell im Videobereich den Web-Gefängnissen entkommen zu wollen. Für übliche Videoplattformen gibt es dafür Meisterwerke der EntwicklerInnengeduld wie youtube-dl oder streamlink.

    Für die Live-Ströme der öffentlich-rechtlichen Anstalten hingegen habe ich zumindest nichts Paketiertes gefunden. Vor vielen Jahren hatte ich von einem Freund ein paar screenscrapende Zeilen Python erledigt dazu. In diesen Zeiten müssen allerdings Webseiten alle paar Monate komplett umgeschrieben („jquery ist doch total alt, angular.js hat auch schon bessere Tage gesehen“) und regelauncht werden, und das Skript ging mit einem Relaunch ca. 2015 kaputt. Als ich am Freitag die Tagesschau ansehen wollte, ohne DVB-Hardware zu haben, habe ich mich deshalb nach einer Neufassung des Skripts umgesehen.

    Das Ergebnis war eine uralte Seite mit mpv-Kommandozeilen und ein Verweis auf ein von den MediathekView-Leuten gepflegtes Verzeichnis von Live-Strömen. Da stehen zwar oben alt aussehende Timestamps drin, das Log aber zeigt, dass der Kram durchaus gepflegt wird.

    Aus letzterem habe ich livetv.py (ja, das ist ein Download-Link) gestrickt, ein weiteres meiner Ein-Datei-Programme. Installiert mpv (und Python, klar), macht chmod +x livetv.py und sagt dann ./livetv.py ARD Livestream – fertig. Bequemer ist es natürlich, das Skript einfach irgendwo in den Pfad zu legen.

    Das Argument, das das Programm haben will, kann irgendeine Zeichenfolge sein, die eindeutig einen Sender identifiziert, also nur in einem Sendernamen vorkommt. Welche Sender es gibt, gibt das Programm aus, wenn es ohne Argumente aufgerufen wird:

    $ livetv.py
    3Sat Livestream
    ...
    PHOENIX Livestream
    

    Mit der aktuellen Liste könnt ihr z.B. livetv.py Hamburg sagen, weil „Hamburg“ (auch nach Normalisierung auf Kleinbuchstaben) nur in einer Stationsbezeichnung vorkommt, während „SWR“ auf eine Rückfrage führt:

    $ livetv.py SWR
    SWR BW Livestream? SWR RP Livestream?
    

    „SWR BW“ (mit oder ohne Quotes auf der Kommandozeile) ist dann eindeutig, woraufhin livetv an den mpv übergibt.

    Ich gehe davon aus, dass die Anstalten die URLs ihrer Streams auch weiterhin munter verändern werden. Deshalb kann sich das Programm neue URLs von den MediathekView-Leuten holen, und zwar durch den Aufruf:

    $ livety.py update
    

    Das schreibt, wenn alles gut geht, die Programmdatei neu und funktioniert mithin nur, wenn ihr Schreibrechte auf das Verzeichnis habt, in dem livetv.py liegt – was besser nicht der Fall sein sollte, wenn ihr es z.B. nach /usr/local/bin geschoben habt.

    A propos Sicherheitsüberlegungen: Der update-Teil vertraut gegenwärtig ein wenig den MediathekView-Repo – ich entschärfe zwar die offensichtlichsten Probleme, die durch Kopieren heruntergeladenen Materials in ausführbaren Code entstehen, aber ich verspreche nicht, raffinierteren Angriffen zu widerstehen. Abgesehen vom update-Teil halte ich das Programm für sicherheits-unkritisch. Es redet selbst auch nicht mit dem Netz, sondern überlässt das dem mpv.

    Livetv.py sagt per Voreinstellung dem mpv, es solle einen „vernünftigen“ Stream aussuchen, was sich im Augenblick zu „2 Mbit/s oder weniger“ übersetzt. Wer eine andere Auffassung von „vernünftig“ hat, kann die --max-bitrate-Option verwenden, die einfach an mpvs --hls-bitrate weitergereicht wird. Damit könnt ihr

    $ livetv.py --max-bitrate min arte.de
    

    für etwas sagen, das für die Sender, die ich geprüft habe, auch auf sehr alten Geräten noch geht,

    $ livetv.py --max-bitrate max arte.fr
    

    für HD-Wahnsinn oder

    $ livetv.py --max-bitrate 4000000 dw live
    

    für einen Stream, der nicht mehr als 4 MB/s verbraucht.

    Technics

    Die größte Fummelei war, die Kanalliste geparst zu bekommen, denn aus Gründen, für die meine Fantasie nicht ausreicht (MediathekView-Leute: Ich wäre echt neugierig, warum ihr das so gemacht habt), kommen die Sender in einem JSON-Objekt (statt einer Liste), und jeder Sender hat den gleichen Schlüssel:

    "X" : [ "3Sat", "Livestream", ...
    "X" : [ "ARD", "Livestream", ...
    

    – ein einfaches json.loads liefert also ein Dictionary, in dem nur ein Kanal enthalten ist.

    Auch wenn ich sowas noch nie gesehen habe, ist es offenbar nicht ganz unüblich, denn der json-Parser aus der Python-Standardbibliothek ist darauf vorbereitet. Wer ein JSONDecoder-Objekt konstruiert, kann in object_pairs_hook eine Funktion übergeben, die entscheiden kann, was mit solchen mehrfach besetzen Schlüsseln pasieren soll. Sie bekommt vom Parser eine Sequenz von Schlüssel-Wert-Paaren übergeben.

    Für meine spezielle Anwendung will ich lediglich ein Mapping von Stationstiteln (in Element 3 der Kanaldefinition) zu Stream-URLs (in Element 8) rausziehen und den Rest der Information wegwerfen. Deshalb reicht mir Code wie dieser:

    def load_stations():
      channels = {}
      def collect(args):
        for name, val in args:
          if name=="X":
            channels[val[2]] = val[8]
    
      dec = json.JSONDecoder(object_pairs_hook=collect)
      dec.decode(LIST_CACHE)
    
      return channels
    

    – das channels-Dictionary, das collect nach und nach füllt, ist wegen Pythons Scoping-Regeln das, das load_stations definiert. Die collect-Funktion ist also eine Closure, eine Funktion, die Teile ihres Definitionsumfelds einpackt und mitnehmt. So etwas macht das Leben von AutorInnen von Code sehr oft leichter – aber vielleicht nicht das Leben der späteren LeserInnen. Dass die collect-Funktion als ein Seiteneffekt von dec.decode(...) aufgerufen wird und dadurch channels gefüllt wird, braucht jedenfalls erstmal etwas Überlegung.

    Der andere interessante Aspekt am Code ist, dass ich die Liste der Live-Streams nicht separat irgendwo ablegen wollte. Das Ganze soll ja ein Ein-Datei-Programm sein, das einfach und ohne Installation überall läuft, wo es Python und mpv gibt. Ein Blick ins Commit-Log der Kanalliste verrät, dass sich diese allein im letzten Jahr über ein dutzend Mal geändert hat (herzlichen Dank an dieser Stelle an die Maintainer!). Es braucht also eine Möglichkeit, sie aktuell zu halten, wenn ich die Liste nicht bei jedem Aufruf erneut aus dem Netz holen will. Das aber will ich auf keinen Fall, weniger, um github zu schonen, mehr, weil sonst github sehen kann, wer so alles wann livetv.py verwendet.

    Ich könnte die Liste beim ersten Programmstart holen und irgendwo im Home (oder gar unter /var/tmp) speichern. Aber dann setzt zumindest der erste Aufruf einen Datenpunkt bei github, und zwar für neue NutzerInnen eher überraschend. Das kann ich verhindern, wenn ich die Liste einfach im Programm selbst speichere, also selbstverändernden Code schreibe.

    Das ist in interpretierten Sprachen eigentlich nicht schwierig, da bei ihnen Quellcode und ausgeführtes Programm identisch sind. Zu den großartigen Ideen in Unix gehört weiter, dass (das Äquivalent von) sys.argv[0] den Pfad zur gerade ausgeführten Datei enthält. Und so dachte ich mir, ich ziehe mir einfach den eigenen Programmcode und ersetze die Zuweisung des LIST_CACHE (das json-Literal von github) per Holzhammer, also regulärem Ausdruck. In Code:

    self_path = sys.argv[0]
    with open(self_path, "rb") as f:
      src = f.read()
    
    src = re.sub(b'(?s)LIST_CACHE = """.*?"""',
      b'LIST_CACHE = """%s"""'%(in_bytes.replace(b'"', b'\\"')),
      src)
    
    with open(self_path, "wb") as f:
      f.write(src)
    

    Dass das Schreiben ein eigener, fast atomarer Schritt ist, ist Vorsicht: Wenn beim Ersetzen etwas schief geht und das Programm eine Exception wirft, ist das open(... "wb") noch nicht gelaufen. Es leert ja die Programmdatei, und solange es das nicht getan hat, hat mensch eine zweite Chance. Ähnlich übrigens meine Überlegung, das alles in Binärstrings zu bearbeiten: beim Enkodieren kann es immer mal Probleme geben, die am Ende zu teilgeschriebenen Dateien führen können. Vermeide ich Umkodierungen, kann zumindest die Sorte von Fehler nicht auftreten.

    Wie dem auch sei: Dieser Code funktioniert nicht. Und zwar in recht typischer Weise innerhalb der Familie von Quines und anderen Selbstanwendungsproblemen: das re.sub erwischt auch seine beiden ersten Argumente, denn beide passen auf das Muster LIST_CACHE = """.*?""". Deshalb würden von livetv.py update auch diese beiden durch das json-Literal mit den Senderdefinitionen ersetzt. Das so geänderte Programm hat zwei Syntaxfehler, weil das json natürlich nicht in die String-Literale passt, und selbst wenn es das täte, gingen keine weiteren Updates mehr, da die Such- und Ersatzpatterns wegersetzt wären.

    Eine Lösung in diesem Fall ist geradezu billig: in Python kann mensch ein Leerzeichen auch als '\x20' schreiben (das ASCII-Zeichen Nummer 0x20 oder 32), und schon matcht der reguläre Ausdruck nicht mehr sich selbst:

    re.sub(b'(?s)LIST_CACHE\x20= """.*?"""',
      b'LIST_CACHE\x20= """%s"""'...
    

    Sicherheitsfragen

    Ein Programm, das Daten aus dem Netz in sich selbst einbaut, muss eigentlich eine Ecke vorsichtiger vorgehen als dieses hier. Stellt euch vor, irgendwer bekommt etwas wie:

    { "Filmliste": [....,
      "X": ["...
      "Igore": ['"""; os.system("rm -r ~"); """']
    }
    

    in das MediathekView-Repo committet; das würde für die MediathekView immer noch prima funktionieren, das Objekt mit dem Schlüssel Ignore würde fast sicher tatsächlich einfach ignoriert.

    Wer dann allerdings livetv.py update laufen lässt, bekommt den ganzen Kram in Python-Quelltext gepackt, und der Inhalt des Ignore-Schlüssels wird vom Python-Parser gelesen. Der sieht, wie der lange String mit den drei Anführungszeichen geschlossen wird. Danach kommt eine normale Python-Anweisung. Die hier das Home-Verzeichnis der NutzerIn löscht. Python wird die treu ausführen. Bumm.

    So funktioniert das in Wirklichkeit zum Glück nicht, denn ich escape im realen Code Anführungszeichen (das .replace(b'"', b'\\"')). Damit …

  • Fixing "No sandbox user" the Right Way

    I'm setting up an ancient machine – a Pentium M box with a meme 256 MB of RAM – with current Debian bullseye, and I'm impressed that that still works: this machine is almost 20 years old. Hats off to the Debian folks.

    But that's not really my story. Instead, this is about fixing what's behind the message:

    No sandbox user '_apt' on the system, can not drop privileges
    

    from apt. As you probably have just done, my first reaction was to feed that message to a search engine.

    Quite a few pages were returned, and all I looked at suggested to simply create the user using one of the many ways a Debian box has for that. That is not totally unreasonable, but it does not really address the underlying cause, and hence I thought I should do better.

    The immediately underlying cause is that for whatever deeper reason a maintainer script – shell scripts that Debian packages run after installing packages or before removing them – has not properly run; that is usually the place where packages create users and do similar housekeeping. Just creating the user may or may not be enough, depending on what else the maintainer script would have done.

    Hence, the better way to fix things is to re-run the maintainer script, as that would either run the full routine or at least give an error message that lets you figure out the deeper cause of the problem. Dpkg runs the maintainer script(s) automatically when you re-install the package in question.

    But what is that “package in question” that should have created the user? You could guess, and in this particular case your guess would quite likely be right, but a more generally applicable technique is to simply see what script should have created the user. That's not hard to do once you know that the maintainer scripts are kept (next to other package metadata) in /var/lib/dpkg/info/; so, with GNU grep's -r (recursive) option, you can run:

    grep -lr "_apt" /var/lib/dpkg/info/
    

    which gives the names of all files containing _apt in files below that directory. On my box, that is:

    /var/lib/dpkg/info/python3-apt.md5sums
    /var/lib/dpkg/info/libperl5.32:i386.symbols
    /var/lib/dpkg/info/apt.postinst
    /var/lib/dpkg/info/python3-apt.list
    

    Ah-ha! The string is mentioned in the post-installation script of the apt package. Peeking inside this file, you see:

    if [ "$1" = 'configure' ]; then
            # add unprivileged user for the apt methods
            adduser --force-badname --system --home /nonexistent  \
                --no-create-home --quiet _apt || true
    fi
    

    So: this really tries to create the user when the package is being configured, but it ignores any errors that may occur in the process (the || true). That explains why the system installation went fine and I got the warnings later (rather than a hard error during the installation).

    Just re-configuring the apt package would therefore be enough to either fix things or at least see an error message. But really, unless it's a huge package I tend to save on brain cycles and just run apt reinstall, which in this particular case leads to the somewhat funky command line:

    apt reinstall apt
    

    For me, this fixed the problem – and I've not bothered to fathom why the user creation failed during initial system setup. If you've seen the same problem and still have a record of the installation, perhaps you could investigate and file a bug if necessary?

  • Stemming for the Search Engine

    First off, here is a quick reference for the search syntax on this site (the search form links here):

    • Phrase searches ("this is a phrase")
    • Exclusions (-dontmatch)
    • Matches only when two words appear within 10 tokens of each other (matches NEAR appear)
    • Trailing wildcard as in file patterns (trail*)
    • Searches don't use stemming by default, but stem for German when introduced with l:de and for English when introduced with l:en
    • See also the Xapian syntax.

    If you only came here for the search syntax, that's it, and you can stop reading here.

    Otherwise, if you have read the previous post on my little search engine, you will remember I was a bit unhappy that I completely ignored the language of the posts and had wanted to support stemming so that you can find, ideally, documents containing any of "search", "searches", "searching", and "searched" when searching for any of these. Being able to do that (without completely ruining precision) is obviously language-dependent, which means the first step to make it happen is to properly declare the languague of your posts.

    As discussed in the previous post, my blogsearch script only looks at elements with the CSS class indexable, and so I decided to have the language declaration there, too. In my templates, I hence now use:

    <div class="indexable" lang="{{ article.lang }}">
    

    or:

    <div class="indexable" lang="{{ page.lang }}">
    

    as appropriate.

    This is interpreted by the indexer rather straightforwardly by pulling the value out of the attribute and asking xapian for a stemmer for the named language. That works for at least most European two-letter country codes, because those happen to coincide with what's legal in HTML's lang universal attribute. It does not work for the more complex BCP 47 language tags like de-AT (where no actually existing stemmer would give results different from plain de anyway) or even sr-Latn-RS (for which, I think, no stemmer exists).

    On searching, I was worried that enabling stemming would blow unstemmed searches, but xapian's indexes are clever enough that that's not a problem. But I still cannot stem queries by default, because it is hard to guess their language from just a word or two. Hence, I have defined a query syntax extension: If you prefix your query with l:whatever, blogsearch will try to construct a xapian stemmer from whatever. If that fails, you'll get an error, if it succeeds, it will stem the query in that language.

    As an aside, I considered for a moment whether it is a terribly good idea to hand through essentially unfiltered user input to a C++ API like xapian's. I eventually settled for just making it a bit harder to craft buffer overflows by saying:

    lang = parts[0][2:30]
    

    – that is, I'm only allowing through up to 28 characters of language code. Not that I expect that anything in between my code and xapian's core has an overflow problem, but this is a cheap defensive measure that would also limit the amount of code someone could smuggle in in case some vulnerability did sneak in. Since it's essentially free, I'd say that's reasonable defensive programming.

    In closing, I do not think stemmed searches will be used a lot, and as usual with these very simple stemmers, they leave a lot to be desired from a linguistic point of view. Compare, for instance, a simple search for going with the result l:en going to see where this is supposed to go (and compare with the result when stemming as German). And then compare with l:en went, which should return the same as l:en going in an ideal world but of course doesn't: Not with the simple snowball stemmer that xapian employs.

    I'm still happy the feature's there, and I'm sure I'll need it one of these days.

    And again, if you need a CGI that can index and query your static HTML collection with low deployment effort: you're welcome.

  • Moving Clipboard Content Between Displays and Machines with Xclip

    Since Corona started, I've had to occasionally run zoom and other questionable telecon software. I don't want that proprietary junk on my main machine, partly because I'm a raving Free software lunatic, partly because binary packages from commercial vendors outside of the Debian main repository have a way of blowing up things years after one has put them on a box. I take some pride in never having re-installed my primary machine since 1996, so there would have been lots of opportunity for binary junk to accumulate.

    Hence I took a spare box I had sitting around idly, quickly put a simple Debian on its disk and then dumped all the questionable proprietary code next to its systemd and pulseaudio, reckoning that shredding that file system once the zoom pandemic is over will give me a lot of satisfaction.

    But now the various links, room ids and whatnot come in on the proper machine. Until a few days ago, I used to move them over to the zoom machine by having a screen open there, ssh-ing in from my main box, running screen -x to attach the screen that is already running in the ssh session, and then pasting the link into that shared screen. It works, but it feels clunky.

    The other day, I finally realised there's a better way using a nifty thing called xclip. I had already used xclip for ages whenever I have two displays running on a single box and I need to copy and paste between the two displaye; that happens when I'm at work. Then, I use the following key bindings (in this case for sawfish) on both ends:

    (bind-keys global-keymap "M-C-v"
            '(system "xclip -in < ~/.current-clipboard"))
    (bind-keys global-keymap "M-C-c"
            '(system "xclip -out > ~/.current-clipboard"))
    

    This lets me hit Alt-Ctrl-C on the first display and Alt-Ctrl-V on the second, and I'll then have what was in the primary selection on the first in the primary selection on the second.

    When later webkit on gtk3 started to copy links into the X11 clipboard rather than the primary selection and I wanted a quick way to get them to where I can middle-mouse them in again, I added another xclip binding to my sawfshrc:

    (bind-keys global-keymap "M-RET"
      '(system "xclip -out -selection clipboard | xclip -in"))
    

    – that's Meta-Return copying the content of the clipoard to the primary selection, and I've come to use that one quite extensively after initially piling quite a bit of abuse on the gtk3 policy of using the clipboard.

    What I noticed the other day was that xclip also lets me conveniently transport the telecon links. I've created an alias for that:

    alias zoomclip='xclip -o | ssh zoom "DISPLAY=:0 xclip -r -l 1 -i"'
    

    (zoom here is the name of the target machine). My new workflow is: select the string to transmit, run zoomclip in a terminal, hit the middle mouse button on the target machine to paste what I selected on the source machine. I'm not sure if it saves a lot of time over the old screen-based method, but it sure feels niftier, and I'd say that's reason enough for that alias.

    Note that the DISPLAY=:0 in the remote command is necessary because xclip of course is a normal X client and needs to know what display to talk to; and you want the local display on the target machine, not the display on the source machine. The -l 1, on the other hand, makes the xclip on the remote machine exit once you have pasted the content. Leave the option out if you expect to need to paste the thing multiple times. But without the -l 1, due to the way the selections are built on X11 (i.e, the system doesn't store selection content, you're always directly sending stuff between clients), xclip runs (and hence the ssh connection is being maintained) until some other client takes over the selection.

  • A Local Search Engine for Pelican-based Blogs

    As the number of posts on this blog approaches 100, I figured some sort of search functionality would be in order. And since I'm wary of “free” commercial services and Free network search does not seem to go anywhere[1], the only way to offer that that is both practical and respectful of the digital rights of my readers is to have a local search engine. True, having a search engine running somewhat defeats the purpose of a static blog, except that there's a lot less code necessary for doing a simple search than for running a CMS, and of course you still get to version-control your posts.

    I have to admit that the “less code” argument is a bit relative given that I'm using xapian as a full-text indexer here. But I've long wanted to play with it, and it seems reasonably well-written and well-maintained. I have hence written a little CGI script enabling search over static collections of HTML files, which means in particular pelican blogs. In this post, I'll tell you first a few things about how this is written and then how you'd run it yourself.

    Using Xapian: Indexing

    At its core, xapian is not much more than an inverted index: Essentially, you feed it words (“tokens”), and it will generate a database pointing from each word to the documents that contain it.

    The first thing to understand when using xapian is that it doesn't really have a model of what exactly a document is; the example indexer code, for instance, indexes a text file such that each paragraph is treated as a separate document. All xapian itself cares about is a string („data“, but usually rather metadata) that you associate with a bunch of tokens. This pair receives a numeric id, and that's it.

    There is a higher-level thing called omega built on top of xapian that does identify files with xapian documents and can crawl and index a whole directory tree. It also knows (to some extent) how to pull tokens from a large variety of file types. I've tried it, and I wasn't happy; since pelican creates all those ancillary HTML files for tags, monthly archives, and whatnot, when indexing with omega, you get lots of really spurious matches as soon as people enter a term that's in an article title, and entering a tag or a category will yield almost all the files.

    So, I decided to write my own indexer, also with a view to later extending it to language detection (this blog has articles in German and English, and they eventually should be treated differently). The core is rather plain in Python:

    for dir, children, names in os.walk(document_dir):
      for name in fnmatch.filter(names, "*.html"):
        path = os.path.join(dir, name)
        doc = index_html(indexer, path, document_dir)
    

    That's enough for iterating over all HTML files in a pelican output directory (which document_dir should point to).

    In the code, there's a bit of additional logic in the do_index function. This code enables incremental indexing, i.e., only re-indexing a file if it has changed since the last indexing run (pelican fortunately manages the file timestamps properly).

    Nachtrag (2021-11-13)

    It didn't, actually; see the search engine update post for how to fix that.

    What I had to learn the hard way is that since xapian has no built-in relationship between what it considers a document and an operating system file, I need to explicitly remove the previous document matching a particular file. The function get_indexed_paths produces a suitable data structure for that from an existing database.

    The indexing also defines my document model; as said above, as far as xapian is concerned, a document is just some (typically metadata) string under user control (plus the id and the tokens, obviously). Since I want structured metadata, I need to structure that string, and these days, json is the least involved thing to have structured data in a flat string. That explains the first half of the function that actually indexes one single document, the path of which comes in in f_name:

    def index_html(indexer, f_name, document_dir):
      with open(f_name, encoding="utf-8") as f:
        soup = bs4.BeautifulSoup(f, "lxml")
      doc = xapian.Document()
      meta = {
        "title": soup_to_text(soup.find("title")),
        "path": remove_prefix(f_name, document_dir),
        "mtime": os.path.getmtime(f_name),}
      doc.set_data(json.dumps(meta))
    
      content = soup.find(class_="indexable")
      if not content:
        # only add terms if this isn't some index file or similar
        return doc
      print(f"Adding/updating {meta['path']}")
    
      indexer.set_document(doc)
      indexer.index_text(soup_to_text(content))
    
      return doc
    

    – my metadata thus consists of a title, a path relative to pelican's output directory, and the last modification time of the file.

    The other tricky part in here is that I only index children of the first element with an indexable class in the document. That's the key to keeping out all the tags, archive, and category files that pelican generates. But it means you will have to touch your templates if you want to adopt this to your pelican installation (see below). All other files are entered into the database, too, in order to avoid needlessly re-scanning them, but no tokens are associated with them, and hence they will never match a useful query.

    Nachtrag (2021-11-13)

    When you add the indexable class to your, also declare the language in order to support stemming; this would look like lang="{{ page.lang }} (substituting article for page as appropriate).

    There is a big lacuna here: the recall, i.e., the ratio between the number of documents actually returned for a query and the number of documents that should (in some sense) match, really suffers in both German and English if you don't do stemming, i.e., fail to strip off grammatical suffixes from words.

    Stemming is of course highly language-dependent. Fortunately, pelican's default metadata includes the language. Less fortunately, my templates don't communicate that metadata yet – but that would be quick to fix. The actual problem is that when I stem my documents, I'll also have to stem the incoming queries. Will I stem them for German or for English?

    I'll think about that problem later and for now don't stem at all; if you remember that I don't stem, you can simply append an asterisk to your search term; that's not exactly the same thing, but ought to be good enough in many cases.

    Using xapian: Searching

    Running searches using xapian is relatively straightforward: You open the database, parse the query, get the set of matches and then format the metadata you put in during indexing into links to the matches. In the code, that's in cgi_main; one could do paging here, but I figure spitting out 100 matches will be plenty, and distributing 100 matches on multiple HTML pages is silly (unless you're trying to optimise your access statistics; since I don't take those, that doesn't apply to me).

    The part with the query parser deserves a second look, because xapian supports a fairly rich query language, where I consider the most useful features:

    • Phrase searches ("this is a phrase")
    • Exclusions (-dontmatch)
    • Matches only when two words appear within 10 tokens of each other (matches NEAR appear)
    • Trailing wildcard as in file patterns (trail*)

    That last feature needs to be explicitly enabled, and since I find it somewhat unexpected that keyword arguments are not supported here, and perhaps even that the flag constant sits on the QueryParser object, here's how enabling wildcards in xapian looks in code:

    qp = xapian.QueryParser()
    parsed = qp.parse_query(query, qp.FLAG_WILDCARD)
    

    Deploying this on your Pelican Installation

    You can re-use my search script on your site relatively easily. It's one file, and if you're running an apache or something else that can run CGIs[2], making it run first is close to trivial: Install your equivalents of the Debian python3-xapian, python3-bs4, and python3-lxml packages. Perhaps you also need to explicitly allow CGI execution on your web server. In Debian's apache, that would be a2enmod cgi, elsewhere, you may need to otherwise arrange for mod_cgi or its equivalent to be loaded.

    Then you need to dump blogsearch somewhere in the file system.

    Nachtrag (2022-10-07)

    Don't take it from here; rather, see https://codeberg.org/AnselmF/pelican-ext

    While Debian has a default CGI directory defined, I'd suggest to put blogsearch somewhere next to your blog; I keep everything together in /var/blog (say), have the generated output in /var/blog/generated and would then keep the script in a directory /var/blog/cgi. Assuming this and apache, You'd then have something like:

    DocumentRoot /var/blog/generated
    ScriptAlias /bin /var/blog/cgi
    

    in your configuration, presumably in a VirtualHost definition. In addition, you will have to tell the script where your pelican directory is. It expects that information in the environment variable BLOG_DIR; so, for apache, add:

    SetEnv BLOG_DIR /var/blog/generated
    

    to the VirtualHost.

    After restarting your web server, the script would be ready (with the configuration above …

  • Math with ReStructuredText and Pelican

    I recently wrote a piece on estimating my power output from CO₂ measurements (in German) and for the first time in this blog needed to write at least some not entirely trivial math. Well: I was seriously unhappy with the way formulae came out.

    Ugly math of course is very common as soon as you leave the lofty realms of LaTeX. This blog is made with ReStructuredText (RST) in pelican. Now, RST at least supports the math interpreted text role (“inline”) and directive (“block“ or in this case rather “displayed“) out of the box. To my great delight, the input syntax is a subset of LaTeX's, which remains the least cumbersome way to input typeset math into a computer.

    But as I said, once I saw how the formulae came out in the browser, my satifsfaction went away: there was really bad spacing, fractions weren't there, and things were really hard to read.

    In consequence, when writing the post I'm citing above, rather than reading the docutils documentation to research whether the ugly rendering was a bug or a non-feature, I wrote a footnote:

    Sorry für die hässlichen Formeln. Vielleicht schreibe ich mal eine Erweiterung für ReStructuredText, die die ordentlich mit TeX formatiert. Oder zumindest mit MathML. Bis dahin: Danke für euer Verständnis.

    (Sorry for the ugly formulae. Perhaps one of these days I'll write an RST extension that properly formats using TeX. Or at least MathML. Until then: thanks for your understanding.)

    This is while the documentation clearly said, just two lines below the example that was all I had initially bothered to look at:

    For HTML, the math_output configuration setting (or the corresponding --math-output command line option) selects between alternative output formats with different subsets of supported elements.

    Following the link at least would have told me that MathML was already there, saving me some public embarrassment.

    Anyway, when yesterday I thought I might as well have a look at whether someone had already written any of the code I was talking about in the footnote, rather than properly reading the documentation I started operating search engines (shame on me).

    Only when those lead me to various sphinx and pelican extensions and I peeked into their source code I finally ended up at the docutils documentation again. And I noticed that the default math rendering was so ugly just because I didn't bother to include the math.css stylesheet. Oh, the miracles of reading documentation!

    With this, the default math rendering suddenly turns from ”ouch” to “might just do”.

    But since I now had seen that docutils supports MathML, and since I have wanted to have a look at it at various times in the past 20 years, I thought I might as well try it, too. It is fairly straightforward to turn it on; just say:

    [html writers]
    math_output: MathML
    

    in your ~/.docutils (or perhaps via a pelican plugin).

    I have to say I am rather underwhelmed by how my webkit renders it. Here's what the plain docutils stylesheet works out to in my current luakit:

    Screenshot with ok formulae.

    And here's how it looks like via MathML:

    Screenshot with less ok formulae.

    For my tastes, the spacing is quite a bit worse in the MathML case; additionally, the Wikipedia article on MathML mentions that the Internet Explorer never supported it (which perhaps wouldn't bother me too much) and that Chromium withdrew support at some point (what?). Anyway: plain docutils with the proper css is the clear winner here in my book.

    I've not evaluated mathjax, which is another option in docutils math_output and is what pelican's render_math plugin uses. Call me a luddite, but I'll file requiring people to let me execute almost arbitrary code on their box just so they see math into the big folder labelled “insanities of the modern Web”.

    So, I can't really tell whether mathjax would approach TeX's quality, but the other two options clearly lose out against real TeX, which using dvipng would render the example to:

    Screenshot with perfect formulae

    – the spacing is perfect, though of course the inline equation has a terrible break (which is not TeX's fault). It hence might still be worth hacking a pelican extension that collects all formulae, returns placeholder image links for them and then finally does a big dvipng run to create these images. But then this will mean dealing with a lot of files, which I'm not wild about.

    What I'd like to ideally use for the small PNGs we are talking about here would be inline images using the data scheme, as in:

    <img src="..."/>
    

    But since I would need to create the data string when docutils calls my extension function, I in that scheme cannot collect all the math rendering for a single run of LaTeX and dvipng. That in turn would mean either creating a new process for TeX and dvipng each for each piece of math, which really sounds bad, or hacking some wild pipeline involving both, which doesn't sound like a terribly viable proposition either.

    While considering this, I remembered that matplotlib renders quite a bit of TeX math strings, too, and it lets me render them without any fiddling with external executables. So, I whipped up this piece of Python:

    import base64
    import io
    import matplotlib
    from matplotlib import mathtext
    
    matplotlib.rcParams["mathtext.fontset"] = "cm"
    
    def render_math(tex_fragment):
        """returns self-contained HTML for a fragment of TeX (inline) math.
        """
        res = io.BytesIO()
        mathtext.math_to_image(f"${tex_fragment}$",
          res, dpi=100, format="png")
        encoded = base64.b64encode(res.getvalue()).decode("ascii")
        return (f'<img src="data:image/png;base64,{encoded}"'
            f' alt="{tex_fragment}" class="math-png"/>')
    
    if __name__=="__main__":
        print(render_math("\int_0^\infty \sin(x)^2\,dx"))
    

    This prints the HTML with the inline formula, which with the example provided looks like this: \int_0^\infty \sin(x)^2\,dx – ok, there's a bit too much cropping, I'd have to trick in transparency, there's no displayed styles as far as I can tell, and clearly one would have to think hard about CSS rules to make plausible choices for scale and baseline – but in case my current half-satisfaction with docutils' text choices wears off: This is what I will try to use in a docutils extension.

  • Foced https Redirects Considered Harmful

    I don't remember where I first saw the admontion that “not everything that does HTTP is a browser“ – but I'd like to underscore this here. One corollary to this is:

    Please do not unconditionally redirect to https!

    People may have good reasons to choose unencrypted http, and sometimes they don't get to choose, in particular in embedded systems (where https may be prohibitively large) or when you cannot upgrade the ssl libraries and sooner or later the server no longer considers any of the ciphers you know safe.

    Case in point: I have a command line program to query bahn.de (python3 version)…

    Nachtrag (2022-09-04)

    after many years of relative stability, the Bahn web page has significantly changed their markup, which broke this script. There is a new bahnconn now.

    …which screen-scrapes the HTML pages that Deutsche Bahn's connection service hands out. I know bahn.de has a proper API, too, and I'm sure it would be a lot faster if I used it, but alas, my experiments with it were unpromising, with what's on the web working much better; perhaps I'll try again next time they change their HTML. But that's beside the point here.

    The point is: In contrast to browsers capable of rendering bahn.de's HTML/javascript combo, this script runs on weak hardware like my Nokia N900. Unfortunately, the N900 is more or less frozen at the state of something like Debian Lenny, because its kernel has proprietary components that (or so I think) deal with actually doing phone calls, and hence I can't upgrade it beyond 2.6.29. And that means more or less (sure, I could start building a lot of that stuff from source, but eventually the libc is too old, and newer libcs require at least kernel 2.6.32) that I'm stuck with Python 2.5 and an OpenSSL of that time. Since about a year ago, these have no ciphers any more that the bahn.de server accepts. But it redirects me to https nevertheless, and hence the whole thing breaks. For no good reason at all.

    You see, encryption buys me nothing when querying train connections. The main privacy breach here is bahn.de storing the request, and there I'm far better off with my script, as that (at least if more people used it) is a lot more anonymous than my browser with all the cookies I let Deutsche Bahn put into it and all the javascript goo they feed it. I furthermore see zero risk in letting random people snoop my train routes individually and now and then. The state can, regrettably, ask Deutsche Bahn directly ever since the Ottokatalog of about 2002. There is less than zero risk of someone manipulating the bahn.de responses to get me on the wrong trains.

    Now, I admit that when lots of people do lots of queries in the presence of adversarial internet service providers and other wire goblins, this whole reasoning will work out differently, and so it's probably a good idea to nudge unsuspecting muggles towards https. Well: That's easy to do without breaking things for wizards wishing to do http.

    Doing it right

    The mechanism for that is the upgrade-insecure-requests header that essentially all muggle browsers now send (don't confuse it with the upgrade-insecure-requests CSP). This does not lock out old clients while still giving muggles some basic semblance of crypto.

    And it's not hard to do, either. In Apache, you add:

    <If "%{req:Upgrade-Insecure-Requests} == '1'">
      Header always set Vary Upgrade-Insecure-Requests
      Redirect 307 "/" "https://<your domain>/"
    </If>
    

    rather than the unconditional redirect you'd otherwise have; I suppose you can parameterise this rule so you don't even have to edit in your domain, but since I'm migrating towards nginx on my servers, I'm too lazy to figure out how. Oh, and you may need to enable mod_headers; on Debian, that would be a2enmod headers.

    In nginx, you can have something like:

    set $do_http_upgrade "$https$http_upgrade_insecure_requests";
    location / {
    
      (whatever you otherwise configure)
    
      if ($do_http_upgrade = "1") {
         add_header Vary Upgrade-Insecure-Requests;
         return 307 https://$host$request_uri;
      }
    }
    

    in your server block. The trick with the intermediate do_http_upgrade variable makes sure we don't redirect if we already are on https; browsers shouldn't send the header on https connections, but I've seen redirect loops without this trick (origin).

    Browser considerations

    Me, I am by now taking it as a sign of quality if a server doesn't force https redirects and instead honours upgrade-insecure-requests. For instance, that way I can watch what some server speaks with the Javascript it executes on my machine without major hassle, and that's something that gives me a lot of peace of mind (but of course it's rather rare these days). In celebration of servers doing it right, I've configured my browser – luakit – to not send upgrade-insecure-requests; where I consider https a benefit rather than a liability for my privacy, I can remember switching to it myself, thank you.

    The way to do that is to drop a file no_https_upgrade_wm.lua into .config/luakit containing:

    local _M = {}
    
    luakit.add_signal("page-created",
        function(page)
            page:add_signal("send-request", function(p, _, headers)
                if headers["Upgrade-Insecure-Requests"] then
                    headers["Upgrade-Insecure-Requests"] = nil
                end
            end)
    end)
    

    (or fetch the file here). And then, in your rc.lua, write something like:

    require_web_module("no_https_upgrade_wm")
    

    ...and for bone-headed websites?

    In today's internet, it's quite likely that a given server will stink. As a matter of fact, since 1995, the part of the internet that stinks has consistently grown 20 percentage points[1] faster than the part that doesn't stink, which means that by now, essentially the entire internet stinks even though there's much more great stuff in it than there was in 1995: that's the miracle of exponential growth.

    But at least for escaping forced https redirects, there is a simple fix in that you can always run a reverse proxy to enable http on https-only services. I'm not 100% sure just how legal that is, but as long as you simply hand through traffic and it's not some page where cleartext on the wire can realistically hurt worse than the cleartext on the server side, I'd claim you're ethically in the green. So, to make the Deutsche Bahn connection finder work with python 2.5, all that was necessary was a suitable host name, an nginx, and a config file like this:

    server {
      listen 80;
      server_name bahnauskunft.tfiu.de;
    
      location / {
        proxy_pass https://reiseauskunft.bahn.de;
        proxy_set_header Host $host;
      }
    }
    
    [1]This figure is of course entirely made up<ESC>3bC only a conservative guess.
  • Bingo zur Wahl

    Bullshit Bingo-Karten zur Wahl

    Wer solche Bingo-Karten haben will: Das Wahlbingo-CGI macht sie euch gerne. Jeder Reload macht neue Karten!

    Auch wenn verschiedene Posts der letzten Zeit etwas anderes suggerieren mögen: Ich will gewiss nicht in die von Plakatwänden und aus Radiolautsprechern quellende Wahlaufregung einstimmen. Aber ein wenig freue ich mich doch auf die Bundestagswahl am nächsten Sonntag: ich kann nämlich wieder mein Wahlbingo spielen. Ihr kriegt bei jedem Reload der Bingo-Seite andere Karten, und zumindest Webkit-Browser sollten das auch so drucken, dass auf einer A4-Seite zwei Bingokarten rauskommen.

    Die Regeln sind dabei: Wer in der Wahlkampfberichterstattung eine hinreichend ähnliche Phrase aufschnappt, darf ein Feld abkreuzen (die Hälfte vom Spaß ist natürlich die lautstarke Klärung der Frage, ob eine Phrase ähnlich genug war). Wer zuerst vier…

    Nachtrag (2022-03-07)

    Die Erfahrung zeigt, dass es mit drei mehr Spaß macht.

    …Felder in horizonaler, vertikaler, oder diagonaler Richtung hat, hat gewonnen. Dabei gelten periodische Randbedingungen, mensch darf also über den Rand hinaus verlängern, als würde die eigene Karte die Ebene parkettieren.

    Das Ganze ist übrigens ein furchtbar schneller Hack an irgendeinm Wahlabend gewesen. Ich habe den gerade noch geschwinder für das Blog in ein CGI gewandelt. Wer darauf etwas Aufgeräumteres aufbauen will: die Quellen. Spenden für den Phrasenkorpus nehme ich sehr gerne per Mail.

  • Reading a zyTemp Carbon Dioxide Monitor using Tkinter on Linux

    Last weekend I had my first major in-person conference since the SARS-2 pandemic began: about 150 people congregated from all over Germany to quarrel and, more importantly, to settle quarrels. But it's still Corona, and thus the organisers put in place a whole bunch of disease control measures. A relatively minor of these was monitoring the CO2 levels in the conference hall as a proxy for how much aerosol may have accumulated. The monitor devices they got were powered by USB, and since I was sitting on the stage with a computer having USB ports anyway, I was asked to run (and keep an eye on) the CO2 monitor for that area.

    A photo of the CO2 meter

    The CO2 sensor I got my hands on. While it registers as a Holtek USB-zyTemp, on the back it says “TFA Dostmann Kat.Nr. 31.5006.02“. I suppose the German word for what's going on here is “Wertschöpfungskette“ (I'm not making this up. The word, I mean. Why there are so many companies involved I really can only guess).

    When plugging in the thing, my syslog[1] intriguingly said:

    usb 1-1: new low-speed USB device number 64 using xhci_hcd
    usb 1-1: New USB device found, idVendor=04d9, idProduct=a052, bcdDevice= 2.00
    usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
    usb 1-1: Product: USB-zyTemp
    usb 1-1: Manufacturer: Holtek
    usb 1-1: SerialNumber: 2.00
    hid-generic 0003:04D9:A052.006B: hiddev96: USB HID v1.10 Device [Holtek USB-zyTemp] on usb-0000:00:14.0-1/input0
    hid-generic 0003:04D9:A052.006C: hiddev96: USB HID v1.10 Device [Holtek USB-zyTemp] on usb-0000:00:14.0-1/input0
    

    So: The USB is not only there for power. The thing can actually talk to the computer. Using the protocol for human interface devices (HID, i.e., keyboards, mice, remote controls and such) perhaps is a bit funky for a measurement device, but, on closer reflection, fairly reasonable: just as the mouse reports changes in its position, the monitor reports changes in CO2 levels and temperatures of the air inside of it.

    Asking Duckduckgo for the USB id "04d9:a052" (be sure to make it a phrase search with the quotes our you'll be bombarded by pages on bitcoin scams) yields a blog post on decrypting the wire protocol and, even better, a github repo with a few modules of Python to read out values and do all kinds of things with them.

    However, I felt like the amount of code in that repo was a bit excessive for something that's in the league of what I call a classical 200 lines problem – meaning: a single Python script that works without any sort of installation should really do –, since all I wanted (for now) was a gadget that shows the current values plus a bit of history.

    Hence, I explanted and streamlined the core readout code and added some 100 lines of Tkinter to produce co2display.py3, showing an interface like this:

    A co2display screenshot

    This is how opening a window (the sharp drop of the curve on the left), then opening a second one (the even sharper drop following) and closing it again while staying in the room (the gentle slope on the right) looks like in co2display.py. In case it's not obvious: The current CO2 concentration was 420 ppm, and the temperature 23.8 degrees Centigrade (where I'm sure the thing doesn't measure to tenths of Kelvins; but then who cares about thenths of Kelvins?) when I took that screenshot.

    If you have devices like the zyTemp yourself, you can just download the program, install the python3-hid package (or its equivalent on non-Debian boxes) and run it; well, except that you need to make sure you can read the HID device nodes as non-root. The easiest way to do that is to (as root) create a file /etc/udev/rules.d/80-co2meter.rules containing:

    ATTR{idVendor}=="04d9", ATTR{idProduct}=="a052", SUBSYSTEM=="usb", MODE:="0666"
    

    This udev rule simply says that whenever a device with the respective ids is plugged in, any device node created will be world-readable and world-writable (and yeah, it does over-produce a bit[2]).

    After adding the rule, unplug and replug the device and then type python3 co2display.py3. Ah, yes, the startup (i.e., the display until actual data is available) probably could do with a bit of extra polish.

    First Observations

    I'm rather intrigued by the dynamics of CO2 levels measured in that way (where I've not attempted to estimates errors yet). In reasonably undisturbed nature at the end of the summer and during the day, I've seen 250 to 280 ppm, which would be consistent with mean global pre-industrial levels (which Wikipedia claims is about 280 ppm). I'm curious how this will evolve towards winter and next spring, as I'd guess Germany's temporal mean will hardly be below the global one of a bit more than 400 (again according to Wikipedia).

    In a basically empty train I've seen 350 ppm yesterday, a slightly stuffy train about 30% full was at 1015 ppm, about as much as I have in my office after something like an hour of work (anecdotically, I think half an hour of telecon makes for a comparable increase, but I can hardly believe that idle chat causes more CO2 production than heavy-duty thinking. Hm).

    On a balcony 10 m above a reasonably busy road (of order one car every 10 seconds) in a lightly built-up area I saw 330 ppm under mildly breezy conditions, dropping further to 300 as the wind picked up. Surprisingly, this didn't change as I went down to the street level. I can hardly wait for those winter days when the exhaust gases are strong in one's nose: I cannot imagine that won't be reflected in the CO2.

    The funkiest measurements I made on the way home from the meeting that got the device into my hands in the first place, where I bit the bullet and joined friends who had travelled their in a car (yikes!). While speeding down the Autobahn, depending on where I measured in the small car (a Mazda if I remember correctly) carrying four people, I found anything from 250 ppm near the ventilation flaps to 700 ppm around my head to 1000 ppm between the two rear passengers. And these values were rather stable as long as the windows were closed. Wow. Air flows in cars must be pretty tightly engineered.

    Technics

    If you look at the program code, you'll see that I'm basically polling the device:

    def _update(self):
      try:
        self._take_sample()
        ...
      finally:
        self.after(self.sample_interval, self._update)
    

    – that's how I usually do timed things in tkinter programs, where, as normal in GUI programming, there's an event loop external to your code and you cannot just say something like time.wait() or so.

    Polling is rarely pretty, but it's particularly inappropriate in this case, as the device (or so I think at this point) really sends data as it sees fit, and it clearly would be a lot better to just sit there and wait for its input. Additionally, _take_sample, written as it is, can take quite a bit of time, and during that time the UI is unresponsive, which in this case means that resizes and redraws don't take place.

    That latter problem could easily be fixed by pushing the I/O into a thread. But then this kind of thing is what select was invented for, or, these days, wrappers for it (or rather its friends) usually subsumed under “async programming“.

    However, marrying async and the Tkinter event loop is still painful, as evinced by this 2016 bug against tkinter. It's still open. Going properly async on the CO2monitor class in the program will still be the next thing to do, presumably using threads.

    Ah, that, and recovering from plugging the device out and in again, which would also improve behaviour as people suspend the machine.

    Apart from that, there's just one detail I should perhaps highlight in the code: The

    self.bind("<Configure>", lambda ev: self._update_plot())
    

    in the constructor. That makes the history plot re-scale if the UI is re-sized, and I've always found it a bit under-documented that <Configure> is the event to listen for in this situation. But perhaps that's just me.

    Nachtrag (2021-10-19)

    I've updated co2display.py3 as published here, since I've been hacking on it quite a bit in the meantime. In particular, if you rename the script co2log.py (or anything else with “log” in it), this will run as a plain logger (logging into /var/log/co2-levels by default), and there's a systemd unit at the end of the script that lets you run this automatically; send a HUP to the process to make it re-open its log; this may be useful together with logrotate if you let this run for weeks your months.

    You can also enable logging while letting the Tk UI run by passing a -d option …

  • Corona-Film, Teil 2

    Am Ende meines Posts zum Corona-Inzidenzfilm hatte ich zuversichtlich gesagt, der RKI-Datensatz gebe durchaus auch Plots von Altersmedianen her. Tja… da habe ich den Mund etwas zu voll genommen, denn was wirklich drinsteht, sind Altersgruppen, und die sind so grob, dass eine halbwegs seriöse Schätzung hinreichend dekorativer (also: mit mindestens 256 verschiedenen Werten) Mediane doch einige Überlegung erfordert.

    Die Altersgruppen reichen aber auch ohne Sorgfalt, um eine Art Score auszurechnen (ich verrate gleich, wie ich das gemacht habe), und damit ergibt sich zum Beispiel dieser Film:

    (sollte euer Browser das vermurksen: Download).

    Nachtrag (2021-11-03)

    Nur damit sich niemand wundert, wie Herbst-Zahlen in einen Post vom August kommen: Ich habe den Film bis November 2021 fortgesetzt und werde ihn künftig wohl noch ein paar Mal aktualisieren.

    Was ist da zu sehen? Die Farbskala gibt etwas, das ich Alters-Score genannt habe. Dabei habe ich die Altersgruppen aus den RKI-Daten so in Zahlen übersetzt:

    A00-A04 2
    A05-A14 10
    A15-A34 20
    A35-A59 47
    A60-A79 70
    A80+ 85
    unbekant ignoriert

    Das, was dabei rauskommt, mittele ich für alle berichteten Fälle innerhalb von 14 Tagen. Die robustere und ehrlichere Alternative wäre wahrscheinlich, da einen interpolierten Median auszurechnen, aber das habe ich schon deshalb gelassen, weil ich dann möglicherweise eine obere Grenze bei A80+ hätte annehmen müssen; so, wie es ist, ist es allenfalls ein Score, dessen Vergleichbarkeit zwischen Kreisen angesichts wahrscheinlich recht weit auseinanderliegender Altersverteilungen so la-la ist. Mehr Substanz als Uni- oder Mensarankings hat er aber auf jeden Fall (was kein starker Claim ist).

    Wirklich fummelig an dieser Visualisierung war, dass für weite Zeiträume in vielen Kreisen entweder gar keine Daten vorliegen, einfach weil es keine Infektionen gab, oder die Schätzung auf so wenigen Fällen beruht, dass sie recht wenig Bedeutung hat. Letzteres würde ein starkes Blubbersignal liefern, das in Wirklichkeit nur das Rauschen schlechter Schätzungen beziehungsweise schlecht definierter Verteilungen ist.

    Deshalb habe ich die Inzidenz in die Transparenz gesteckt; zwecks Abstand vom Hintergrund-Weiß fange ich dabei aber gleich bei 20% an, und weil mir 100 Fälle robust genug erscheinen, setze ich ab einer 14-Tage-Inzidenz von 100 auch 100% Deckung. Wo Daten ganz fehlen, male ich nur die Umrisse der Kreise.

    Was habe ich aus dem Film gelernt? Nun, offen gestanden erheblich weniger als aus dem Inzidenzfilm im letzten Post. Ich hatte eigentlich gehofft, dass (mit der derzeitigen Colourmap) ein dramatischer Umschwung von rötlich nach bläulich stattfindet, wenn Anfang 2021 die Impfungen in den großen Altenpflegeeinrichtungen anlaufen. Das ist aber allenfalls dann sichtbar, wenn mensch genau drauf aufpasst. Überhaupt hat mich überrascht, wie niedrig die Alters-Scores doch meist sind. Hätte ich vorher nachgedacht, hätten sowohl die Inzidenz-Heatmap des RKI wie auch einige Prosa zu den Altersverteilungen das allerdings schon stark nahegelegt, so etwa im letzten Wochenbericht des RKI:

    Von allen Todesfällen waren 79.101 (86%) Personen 70 Jahre und älter, der Altersmedian lag bei 84 Jahren. Im Unterschied dazu beträgt der Anteil der über 70-Jährigen an der Gesamtzahl der übermittelten COVID-19-Fälle etwa 13 %.

    – Corona war zwar von den Folgen her vor allem ein Problem ziemlich alter Menschen, getragen haben die Pandemie aber praktisch durchweg die jüngeren.

    Aufschlussreich ist vielleicht, dass die Kreise meist von Blau nach Rot gehen, Ausbrüche also bei relativ jungen Personen anfangen und sich zu älteren hinbewegen. Das ist schon beim Heinsberg-Ausbruch zu sehen, der mit einem Score von 36 anfängt (das hätte ich für einen Kappenabend nie vorhergesagt) und recht monoton immer weiter steigt. Bei etwa 55 habe ich ihn aus den Augen verloren. Diese, wenn mensch so will, Rotverschiebung ist ein recht häufig zu beobachtendes Phänomen in dem Film. Mein unheimlicher Verdacht ist ja, dass dabei die outgesourcten Putz- und Pflegekräfte, die im Namen der Kostenersparnis nicht selten als Kolonne durch mehrere Altenpflegeeinrichtungen hintereinander gescheucht werden, eine große Rolle gespielt haben.

    Recht erwartbar war, dass bei den „jungen“ Kreisen regelmäßig Unistädte auftauchen, Göttingen z.B. im ansonsten ruhigen Juni 2020, während gleichzeitig in Gütersloh die Tönnies-Wanderarbeiter deutlich höhere Alters-Scores haben – beeindruckend, dass diese die Schinderei in unseren Schlachthöfen in das bei diesem Ausbruch starke A35-A59-bin durchhalten.

    In dieser Ausprägung nicht erwartet hätte ich die grün-rot-Trennung zwischen West- und Ostdeutschland in der zweiten Welle, besonders deutlich im Januar 2021. Ein guter Teil davon wird sicher die Basisdemographie sein, denn arg viele junge Leute, die überhaupt krank werden könnten, gibt es in weiten Teilen Ostdeutschlands nicht mehr. Aber so viel anders dürfte das in vielen ländlichen Kreisen Westdeutschlands auch nicht sein. Hm. Ich brauche gelegentlich nach Alter und Kreis aufgelöste Demographiedaten für die BRD.

    Nehmen wir mal den Landkreis Hof, der im Juni 2021 in den fünf jüngsten Kreisen mitspielt: da würde ich eigentlich eine recht alte Bevölkerung erwarten. Der niedrige Score in der Zeit ist also be-stimmt Folge von, jaklar, den wilden Parties der Jugend, von denen wir schon im Sommer 2020 so viel gehört haben. Naughty kids.

    Mit anderen Worten: Ich habe leider keine sehr tiefen Erkenntnisse aus der Visualisierung gezogen. Wenn das, was da gezeigt ist, nicht ziemlich ernst wäre, könnte mensch sich immerhin an der lavalampenähnlichen Erscheinung freuen.

    Technics

    Nachtrag (2022-10-27)

    Der Code ist jetzt am Codeberg

    Das Umschreiben des Codes vom vorigen Post war eine interessante Übung, die insbesondere eine (vor dem Hintergrund der Empfehlung der Gang of Four, normalerweise eher über Komposition als über Vererbung nachzudenken) recht natürliche Anwendung von Vererbung mit sich brachte, nämlich in der Plotter-Klasse. Auch die Parametrisierung dessen, worüber iteriert wird (_iter_maps, iter_freqs, iter_age_scores) war, nun, interessant.

    Das Programm hat dabei eine (fast) ordentliche Kommandozeilenschnittstelle bekommen:

    $ python3 mkmovie.py --help
    usage: mkmovie.py [-h] [-d] [-i N] [-m ISODATE] {inc,age}
    
    Make a movie from RKI data
    
    positional arguments:
      {inc,age}             select what kind of movie should be made
    
    optional arguments:
      -h, --help            show this help message and exit
      -d, --design_mode     just render a single frame from a dict left in a
                            previous run.
      -i N, --interpolate N
                            interpolate N frames for one day
      -m ISODATE, --min-date ISODATE
                            discard all records earlier than ISODATE
    

    Damit entsteht der Film oben durch:

    $ python3 mkmovie.py --min-date=2020-02-20 -i 7 age
    

    Der aktuelle Code: mkmovie.py und corona.py.

  • Corona als Film

    Inzidenzen mögen nicht mehr das ideale Mittel sein, um die aktuelle Corona-Gefährdungslage zu beschreiben, auch wenn es zumindest so lange kaum ein schnelleres Signal geben wird, wie nicht flächendeckend PCR auf Abwässer läuft. Aber sie ist bei allen auch hier angemäkelten Defiziten und Ungenauigkeiten doch kaum zu schlagen, wenn es um ein nach Zeit und Ort aufgelöstes Bild davon geht, was das Virus – Verzeihung, die Viren – so getrieben haben.

    Und nachdem ich neulich angefangen hatte, mit dem großen Infektions-Datensatz des RKI zu spielen, ist mir aufgefallen, dass es ausgehend von dem dort diskutierten Code nicht schwierig sein sollte, einen Film zu basteln, der die Inzidenzverläufe auf Kreisebene visualisiert. Nun, das habe ich gemacht:

    (sollte euer Browser das vermurksen: Download; Kreispolygone: © GeoBasis-DE / BKG 2021). Im Unterschied zu den bekannten Bildern aus dem RKI-Bericht (die die Inzidenz nach dem Meldedatum rechnen) verwende ich hier übrigens das Referenzdatum, also wenn möglich das Datum der Infektion und nur andernfalls das Meldedatum.

    Nachtrag (2021-11-03)

    Nur damit sich niemand wundert, wie Herbst-Zahlen in einen Post vom August kommen: Ich habe den Film bis November 2021 fortgesetzt und werde ihn künftig wohl noch ein paar Mal aktualisieren.

    Ein paar Dinge, die ich daraus gelernt habe:

    • Es gab offenbar schon vor Heinsberg einiges an Corona-Rauschen in der Republik (am Anfang vom Video). Wie viel davon einfach Tippfehler sind, ist natürlich schwer zu sagen. Aber einiges wird schon real sein, und dann ist sehr bemerkenswert, dass es in keiner dieser Fälle zu einem klinisch auffallenden Ausbruch gekommen ist. Das wäre eine sehr deutliche Illustration der hohen Überdispersion zumindest des ursprünglichen Virus: fast alle Infizierten steck(t)en [1] niemanden an, so dass, solange die Ausbrüche klein sind, sie schnell wieder verschwinden.
    • Ich hatte ganz vergessen, wie schnell am Anfang der Pandemie die Folgen des Bierfests in Tirschenreuth die des Kappenabends in Heinsberg überrundet hatten (nämlich so um den 10.3. rum). Es lohnt sich, im März 2020 Ostbayern im Blick zu halten.
    • Die etwas brodelnde Erscheinung des Bildes speziell in ruhigeren Phasen – wie ein träge kochender Brei vielleicht, bei dem an mehr oder minder zufälligen Stellen immer mal wieder eine Blase hochkommt – zeigt wieder, dass sich Corona vor allem in Ausbrüchen ausbreitet. Das tut es bestimmt auch in weniger ruhigen Phasen, aber dann sind überall Blasen („sprudelnd kochend“), so dass das nicht mehr auffällt.
    • Die großen Ausbrüche des Sommers 2020 (vor allem Gütersloh und Dingolfing) waren erstaunlich allein stehende Ereignisse. Wenn mensch bedenkt, dass es ja schon einen Haufen Schlachthöfe und andere ähnlich furchtbare Betriebe in der BRD gibt, fragt sich schon, warum es nicht mehr Ausbrüche im Tönnies-Stil gab. Waren die anderen Läden alle vorsichtiger? Oder hatten die einfach Glück, weil niemand mit hoher Virusausscheidung in ihre Betriebshallen gekommen ist? Sind vielleicht Ausbrüche übersehen worden? Rein demographisch nämlich waren beispielsweise die Tönnies-Leute so jung und fit, dass nur wenige im Krankenhaus landeten und keineR gestorben ist.
    • Auch in den zweiten und dritten Wellen blieb erstaunlich viel Struktur im Infektionsgeschehen – mehr, als ich nach Betrachtung der statischen RKI-Plots und der relativ parallelen Bundesländer-Inzidenzen erwartet hätte, vor allem aber entgegen der Einlassung, das Geschehen sei „diffus“. Angesichts des weiter bestehenden „Brodelns“ würde mich eigentlich überraschen, wenn sich B.1.1.7 oder die B.1.617.x so viel anders ausbreiten würden als der ursprüngliche Wildtyp.
    • Insbesondere gibt es auch in den späteren Wellen Kreise, die kurz „hochbubbeln“ und dann auch wieder rasch unauffällig werden. Es wäre bestimmt aufschlussreich, wenn mensch wüsste, warum das mit dem Rasch-Unauffällig-Werden in den Kreisen in Südsachsen und -thüringen über lange Zeit nicht passiert ist.

    Technics

    Nachtrag (2022-10-27)

    Der Code ist jetzt am Codeberg.

    Es war übrigens doch nicht so ganz einfach, diesen Film zu machen, und zwar vor allem, weil ich eine Weile mit den Polygonen für die Kreise gerungen habe. Mein erster Plan war, die einfach aus der Openstreetmap zu ziehen. Ich wollte aber nicht an einem kompletten OSM-Dump herumoperieren, und so war ich sehr angetan von osm-boundaries. Mit diesem Dienst ließen sich eigentlich recht leicht die „administrative boundaries“ von Kreisen (das wäre dort Level 6) in geojson bekommen.

    Abgesehen davon aber, dass die interaktive Auswahl gerne mit „cannot download tree“ scheiterte, mein Webkit lange am Javascript kaute, und Cloudflare die Downloads regelmäßig zu früh abbrach (die MacherInnen von osm-boundaries verstehen das selbst nicht so ganz), sind die Label der Kreise doch sehr verschieden von dem, was das RKI hat: „Solingen (Klingenstadt)“ auf RKI-Seite war ebenso lästig wie fehlende Unterscheidung zwischen Stadt- und Landkreisen in den Bezeichnungen der Openstreetmap (was immerhin durch Betrachtung der Flächen zu umgehen war).

    Aber auch, als ich mir die Abbildung zwischen den verschiedenen Bezeichnern zusammengehackt hatte, blieben einige weiße Flecken, Kreise also, die ich in der Openstreetmap schlicht nicht finden konnte. An dem Punkt bin ich zur offiziellen Quelle gegangen, nämlich dem Bundesamt für Kartographie und Geodäsie, speziell zum VG2500-Datensatz, der zu meiner großen Erleichterung auch die Kreis-Identifier des RKI (1001 etwa für Flensburg-Stadt) enthält. Na ja, abgesehen von Berlin, das das RKI aufteilt, was wiederum etwas Gefummel zur Wiedervereinigung von Berlin im Code braucht.

    Aber leider: Der Kram kommt als Shape, wie das BKG sagt ein „De-facto-Industriestandard“, mit dem allerdings ich noch nie etwas zu tun hatte und der als über einige Dateien verteilte Binärsoße daherkommt. Immerhin: das Debian-paketierte Cartopy kann damit umgehen. Puh. Nur: frech add_geometries auf die Geometrien loslassen, die aus dem Reader herausfallen, führt zu einer leeren Karte.

    Im Folgenden bin ich etwas untergegangen in all den Referenzsystemen, mit denen sich die GeographInnen so rumschlagen müssen. Ach, haben wir es gut in der Astronomie. Ja, klar, auch wir haben einen Haufen verschiedene Äquatoren und Nullpunkte (z.B. zwei verschiedene Systeme für galaktische Koordinaten, und haufenweise historische äquatoriale Systeme, die zudem durch verschiedene Sternkataloge definiert waren): Aber letztlich sind das im Wesentlichen Drehungen mit winzigen Knitterungen, und schlimmstenfalls kommen die Dinge halt am falschen Platz raus, was für meine gegenwärtigen Zwecke völlig wurst gewesen wäre.

    Aber hier: Nichts auf der ganzen Karte. Es braucht beim Plotten zwingend das richtige Quell-Bezugssystem, hier (wie aus dem .prj des VG2500 hervorgeht) das EPSG-System Nummer 25832 (Fünfundzwanzigtausenachthundertzweiundreißig! Holy Cow!). Damit kann Cartopy sogar umgehen, aber es zieht dann bei jedem Programmlauf die Beschreibung des Systems erneut von einem Onlinedienst, und das geht in meiner Welt gar nicht. Deshalb habe ich mir geschwind eine Proj4Projection-Klasse gefummelt, die den String, der von dem Online-Dienst kommt, händisch in die zugrundeliegende Bibliothek packt. Warnung: Das ist ohne Sachkenntnis geschrieben; dass ich da die Gültigkeitsgrenzen völlig fake, ist vermutlich Gift außerhalb dieser spezifischen Anwendung.

    Der Rest des Codes ist harmloses Python, das die Eingabedaten hinmassiert. Weil die RKI-Daten leider nach Kreis und nicht nach Datum sortiert sind, muss ich den kompletten Datensatz ins RAM nehmen; auf nicht völlig antiker Hardware ist das aber kein Drama.

    Was für den optischen Eindruck noch ziemlich wichtig ist: Ich interpoliere linear zwischen den Tagen (die iter_interpolated-Funktion). Das ist nützlich, weil damit die Übergänge nicht so hart sind, und auch, damit der Film nicht nur 25 Sekunden (also rund 600 Frames) lang ist, sondern etwas wie zwei Minuten läuft.

    Wers nachbauen will oder z.B. Altersgruppen-spezifische Filme machen will oder welche mit dem Median des Alters – das würde der Datensatz durchaus hergeben – oder welche, die nicht alles über 350 saturiert darstellen oder so etwas, braucht mkmovie.py und corona.py. Die Quelldaten werden in einem in corona.py definierten externen Verzeichnis erwartet (DATA_DIR); welche, und woher die kommen, steht am Kopf von mkmovie.py.

    [1]Bei den späteren Virusvarianten mag es eingestandenermaßen nicht mehr ganz so einfach sein, weshalb ich hier so ein vorsichtiges Präteritum schreibe.
  • Long live PGP

    The other day I ran into two rants against PGP, What's the matter with PGP?, which still is relatively reasonable, and then the raving PGP Problem by people running a security consulting shop called Latacora. It's this second diatribe that made me write this post, because the amount of badmouthing of PGP done there, on the blog of a company promising to teach startups “security“ on top, is not only unwarranted, it's also actively damaging to meaningful encryption.

    Let me start with what I think are the fundamental fallacies of the Latacora folks is: They seem to think that identity, and hence key management, is something that others can do for you, and that thinking about crypto is “bad user experience“. In contrast, I'd argue that crypto you don't notice isn't crypto at all (but instead somewhere on the obfuscation spectrum), and that identity management done by others is equivalent to them encrypting for you.

    You see, if you're encrypting something, you're encrypting it for someone. In public key encryption, this “someone“ has two aspects: a real-world entity (your friend, a bank, whatever), and a public key. Any crypto system that does not make this separation transparent and asks you the question of how well you think the two things match (and whether you care), is fundamentally broken. That's because that question plainly has to be answered. If it's not you who answers it, it's someone else. And hence that someone else is free to change the mapping from key to real-world entity, and hence they determine which real-world entity gets to read what you've encrypted.

    This, by the way, is how https is regularly subverted by businesses, anti-virus software, and occasionally state actors, who simply make your browser trust their word on who is what. At that moment, they can inspect everything your browser exchanges with the rest of the world, be it in “anti-virus portals” or bluntly in surveillance systems. This is also why very few users of “encrypted” messengers would even notice if the operating company snooped on them.

    The big advantage of PGP is what the Latacora people call „obnoxious UX“. Yes, you have to make up your mind on keys, yes, you have to explicitly manage them: but that is what you need to understand if you want meaningful encryption, and plastering abstractions on top of that only gives extra levels people have to understand – and that can break. No: if you want to do encryption, you'll have to understand key management, and PGP makes that as explicit and transparent as any crypto system I've ever seen. That's a feature, not a bug.

    Actually, this one thing is far more important than regular key rotation (as an aside: I'm not aware of anyone ever having broken a PGP secret key because too much material has been encrypted using it; it's certainly not a major reason for failing encryption) or the latest cryptographic primitives (even 1024 bit RSA keys still require serious investment to break, 20 years after they've been state of the art).

    More generally, the scenarios requiring frequent key rotation mostly imagine a targeted attack from a state actor, following you into hotel rooms to steal your secret key and install key loggers to skim your passphrase (or similar). Frankly: it's an illusion to believe a muggle (or even normal wizards doing their daily work) could withstand such a determined and comptentent state actor for a long time.

    But that's not necessary. It's already great progress if Google, the (normal) police, and the people at your employer's computation centre can't read the content of your mails unless they really try hard. Actually, most “adversaries” are not terribly determined and/or terribly competent. You hence don't need a perfect cryptosystem. It just needs to withstand the most basic of man-in-the-middle attacks (which fends off the not terribly competent adversaries), and it needs to require at least a bit of individual effort to break each person's crypto (which fends off the not terribly determined ones). Any system with centralised identity management fails at least on the side of the individual effort – get the central entity to collude, or break it, and you have it all. And given the central entity at least legally sits somewhere if it's commercial, at least for the state of residence it fails the withstand-most-basic criterion as well: The average state has no trouble obtaining court orders as soon as it moans “terrorism”.

    PGP, on the other hand, does a fairly good job on both counts once people have grokked it, at least a much better one than anything SSL-based I've ever seen. People can understand the basic operations of PGP if they want, something that is much harder with SSL and X.509. The problem is that few people want to understand that. But in that case, any kind of crypto is doomed. Hence, the problem isn't PGP, it is, as so often, education. Working on that is effort well spent, as once people have understood PGP to the level of confidently using it, they have a much better chance of being able to competently use other, less explicit crypto systems.

    Having said that: sure, PGP and its UIs could be improved. But you can't get around PGP's long-term keys for e-mail, and whatever alternative you'd come up with, you'll still need keyrings and reasonable UIs to mark up the trust in there. And, in particular, you'll still need open standards so you don't have to take a single company's word for what it does.

    That's basically where I think Latacora's arguments are harmful. But since most of the claims in the article are fairly outrageous, I can't resist commenting them, too:

    • “Absurd Complexity“ – that's not a terribly credible charge given the page comes over HTTPS, which in many ways is a lot more complex than OpenPGP, in particular because it contains the nightmare that is X.509. But really: Much as I am all for reducing complexity, I'm even more for maintaining backward compatibility. Being able to read encrypted mails I got in the 1990ies matters to me, and anything flexible enough to at least support modern crypto and deal with archive data will have to have a certain amount of complexity. Start something new, and in 10 years it'll look the same. Or worse. Only it won't be able to read archival data, and it'll have repeated the history of early bugs software simply has, in particular when you have to worry about side channel attacks.
    • “Swiss Army Knife“ – that would be more convincing if they said how exactly PGP signatures are “mediocre“ or the encryption is a “pretty bad job“. I accept the argument they make a bit down that people may want to have forward secrecy in IM and it's easy to have there, so going for PGP alternatives may be a good idea there. But then I can't remember the last time I used PGP over XMPP, so these alternatives have existed for a long time, without any need to kill PGP.
    • “Mired in Backwards Compatibility“ – well, as said above, backwards compatiblity is great if your systems live a long time. And OpenPGP is doing a reasonable job to have both backwards compatiblity and evolvability. That rolling out new features isn't instantaneous, in particular for a federated, largely volunteer effort, is useless criticism: Build another distributed, open, volunteer effort, and it'll go the same way. Oh, and just incidentally: HTTPS is from the nineties, too, and the X.509 standard was published in 1988.
    • ”Obnoxious UX“, “Long-Term Secrets“, “Incoherent Identity“ – these are the core of Latacora's fallacies; see above.
    • “Broken Authentication“ – I can't say I've thought through the problem they're pointing to here; if it is as bad as they claim, there's no reason not to fix it within OpenPGP rather than invent something entirely new.
    • “Leaks Metadata“ – sure. As long as there's SMTP, there's no way around having quite a bit of cleartext: intermediate mail servers will have an idea who's mailing whom (though of course there's still Mixmaster). Having e-mail, or at least something that doesn't require me to be always online, is so important that the metadata leaks are an almost negligible price to pay, at least compared to the fine-grained activity profile you leak when you leave your phone online all the time, or the loss of control when you (or people you trust) can't run the necessary infrastructure components (as in: a mail server) any more.
    • ”No Forward Secrecy“ – fair enough, but again that's hard to have in store-and-forward e-mail, in particular when you'd like to have mailing lists and the like. Of course, it's true that »[a]gainst serious adversaries and without forward secrecy, breaches are a question of “when”, not “if”.« But on the other hand, most serious adversaries lose interest in the cleartext hours, weeks, or at worst a few years after the communication, or at least stop being terribly serious. So, for these intents and purposes holding …

« Seite 3 / 4 »

Letzte Ergänzungen