Select And Merge Pages From Lots Of PDFs Using pdftk

For most of my ad-hoc PDF manipulation needs (cut and paste pages, watermark, fill forms, attach files, decrypt, etc), I am relying on pdftk: Fast, Debian-packaged (in pdftk-java), and as reliable as expectable given the swamp of half-baked PDF writers. So, when I recently wanted to create a joint PDF from the first pages of about 50 other PDFs, I immediately started thinking along the lines of ls and perhaps a cat -b (which would number the lines and thus files) and then pdftk.

Why cat -b? Well, to do cut-and-merge with pdftk, you have to come up with a command line like:

pdftk A=input1.pdf B=input2.pdf cat A1-4 B5-8 output merged.pdf

This would produce a document merged.pdf from pages 1 through 4 of input1.pdf and pages 5 through 8 of input2.pdf. I hence need to produce a “handle” for each input file, for which something containing the running number would a appear an obvious choice.

My initial plan had therefore been to turn lines like 1 foo.pdf from ls | cat -b into doc1=foo.pdf with a dash of sed and go from there. If I were more attentive than I am, I would immediately have realised that won't fly: With handles containing digits, pdftk would have no robust way to tell whether doc12 means “page 12 from doc“, “page 2 from doc1“, or “all pages from doc12”. Indeed, pdftk's man page says:

Input files can be associated with handles, where a handle is one or more upper-case letters[.]

Oh dang. I briefly meditated whether I could cook up unique sequences of uppercase handles (remember, I had about 50 files, so just single uppercase letters wouldn't have done it) using a few shell hacks. But I then decided[1] that's beyond my personal shell script limit and calls for a more systematic programming language like, umm, python[2].

The central function in the resulting little program is something that writes integers using uppercase letters only. Days later, I can't explain why I have not simply exploited the fact that there are a lot more uppercase letters than there are decimal digits, and hence making uppercase labels from integers is solvable using string.translate. A slightly overcompact rendering of that would be:

DIGIT_TO_LETTER = {ascii: chr(ascii+17) for ascii in range(48, 59)}
def int_to_uppercase(i):
  return str(i).translate(DIGIT_TO_LETTER)

(if you don't remember the ASCII table: 48 is the ASCII code for zero, and 48+17 is 65, which is the ASCII code for the uppercase A).

But that's not what I did, perhaps because of professional deformation (cf. my crusade against base-60). Instead, I went for a base-26 representation using uppercase letters only, just like the common base-16 (“hex”) representation that, however, uses 0-9 and A-F and thus is unsuitable here. With this, you would count like this (where more signifiant “digits“ are on the right rather than on the western-conventional left here because it doesn't matter and saves a reverse):

A, B, C, D..., X, Y, Z, AB, BB, CB, ... ZB, AC, BC...
0, 1, ..............25, 26, 27,.......      52, 53

I freely admit I was at first annoyed that my handles went from Z to AB (rather than AA). It did take me longer than I care to confess here to realise that's because A is the zero here, and just like 01 is the same as 1 decimal[3], AA is equal to A (and BA equal to B) in that system. Consequently, my function for unique handles didn't produce AA even though I hadn't realised the problem when writing the function – there's nothing as practical as a good theory.

With that function, the full ad-hoc script to pick pages one (that's encoded in the f"{hdl}1" in case you want other page ranges) from all files matching /some/dir/um*.pdf looks like this:

import glob
import os
import subprocess

def make_handle(ind):
    """returns a pdftk handle for a non-negative integer.

    This is a sequence of one or more uppercase letters.
    """
    hdl = []
    while True:
        hdl.append(chr(65+ind%26))
        ind = ind//26
        if not ind:
            break
    return "".join(hdl)


sources = [(make_handle(ind), name)
  for ind, name in enumerate(sorted(glob.glob("/some/dir/um*.pdf")))]
subprocess.check_call(["pdftk"]+[f"{hdl}={name}" for hdl, name in sources]+
    ["cat"]+[f"{hdl}1" for hdl, _ in sources]+
    ["output", "output.pdf"])

Looking back, not only the massively silly base-26 handles are unnecessarily complicated. Had I realised from the beginning I would be using python in the end, I would probably have gone for pdfrw right away; while the complexity in terms of Debian dependencies is roughly the same (“one over what you'll already have”), avoiding a subprocess call is almost always a win[4].

But these misgivings are one reason why I wrote this post: This is a compact illustration of the old programmers' wisdom to “Plan to throw one away – you will anyway“. Except that for tiny little ad-hoc scripts like this, a bit of baroque adornment and an extra process do not hurt and the code above ought to work just fine if you need to produce a PDF document from some fixed page range of a few dozen or hundred other PDF documents.

[1]Decided foolishly, by the way, as tr 0123456789 ABCDEFGHIJ immediately turns a sequence of distinct integers into a sequence of distinct uppercase-only strings.
[2]I don't feel too good about being in the mainstream for a change, but I can prove that I'd have chosen python long before it became fashionable.
[3]Not in Python, though, where 01 thankfully is a syntax error, and not neccessarily in C, where you may be surprised to see that, for instance, 077 works out to 63 decimal. I would rank this particular folly among the most questionable design decisions in the history of programming languages.
[4]That, and my growing suspicion that “you'll already have a Java runtime on your box” is quickly becoming a rather daring assumption. Once the assumption is plain wrong, pdftk stops being a cheap dependency, as it will pull in a full JRE.
Kategorie: edv

Letzte Ergänzungen