For most of my ad-hoc PDF manipulation needs (cut and paste pages,
watermark, fill forms, attach files, decrypt, etc), I am relying on
pdftk: Fast, Debian-packaged (in pdftk-java), and as reliable as
expectable given the swamp of half-baked PDF writers. So, when I
recently wanted to create a joint PDF from the first pages of about 50
other PDFs, I immediately started thinking along the lines of ls and
perhaps a cat -b (which would number the lines and thus files) and
then pdftk.
Why cat -b? Well, to do cut-and-merge with pdftk, you have to come
up with a command line like:
pdftk A=input1.pdf B=input2.pdf cat A1-4 B5-8 output merged.pdf
This would produce a document merged.pdf from pages 1 through 4 of
input1.pdf and pages 5 through 8 of input2.pdf. I hence
need to produce a “handle” for each input file, for which something
containing the running number would a appear an obvious choice.
My initial plan had therefore been to turn lines like 1 foo.pdf from
ls | cat -b into doc1=foo.pdf with a dash of sed and go from
there. If I were more attentive than I am, I would immediately have
realised that won't fly: With handles containing digits, pdftk would
have no robust way to tell whether doc12 means “page 12 from doc“,
“page 2 from doc1“, or “all pages from doc12”. Indeed, pdftk's man page
says:
Input files can be associated with handles, where a handle is one or
more upper-case letters[.]
Oh dang. I briefly meditated whether I could cook up unique sequences
of uppercase handles (remember, I had about 50 files, so just single
uppercase letters wouldn't have done it) using a few shell hacks. But I
then decided that's beyond my personal shell script limit and
calls for a more systematic programming language like, umm, python.
The central function in the resulting little program is something that
writes integers using uppercase letters only. Days later, I can't
explain why I have not simply exploited the fact that there are a lot
more uppercase letters than there are decimal digits, and hence making
uppercase labels from integers is solvable using string.translate.
A slightly overcompact rendering of that would be:
DIGIT_TO_LETTER = {ascii: chr(ascii+17) for ascii in range(48, 59)}
def int_to_uppercase(i):
return str(i).translate(DIGIT_TO_LETTER)
(if you don't remember the ASCII table: 48 is the ASCII code for zero,
and 48+17 is 65, which is the ASCII code for the uppercase A).
But that's not what I did, perhaps because of professional deformation
(cf. my crusade against base-60). Instead, I went for a base-26
representation using uppercase letters only, just like the common
base-16 (“hex”) representation that, however, uses 0-9 and A-F and thus
is unsuitable here. With this, you would count like this (where more
signifiant “digits“ are on the right rather than on the
western-conventional left here because it doesn't matter and saves a
reverse):
A, B, C, D..., X, Y, Z, AB, BB, CB, ... ZB, AC, BC...
0, 1, ..............25, 26, 27,....... 52, 53
I freely admit I was at first annoyed that my handles went from Z to AB
(rather than AA). It did take me longer than I care to confess here to
realise that's because A is the zero here, and just like 01 is the same
as 1 decimal, AA is equal to A (and BA equal to B) in that
system. Consequently, my function for unique handles didn't produce AA
even though I hadn't realised the problem when writing the function –
there's nothing as practical as a good theory.
With that function, the full ad-hoc script to pick pages one (that's
encoded in the f"{hdl}1" in case you want other page ranges) from
all files matching /some/dir/um*.pdf looks like this:
import glob
import os
import subprocess
def make_handle(ind):
"""returns a pdftk handle for a non-negative integer.
This is a sequence of one or more uppercase letters.
"""
hdl = []
while True:
hdl.append(chr(65+ind%26))
ind = ind//26
if not ind:
break
return "".join(hdl)
sources = [(make_handle(ind), name)
for ind, name in enumerate(sorted(glob.glob("/some/dir/um*.pdf")))]
subprocess.check_call(["pdftk"]+[f"{hdl}={name}" for hdl, name in sources]+
["cat"]+[f"{hdl}1" for hdl, _ in sources]+
["output", "output.pdf"])
Looking back, not only the massively silly base-26 handles are
unnecessarily complicated. Had I realised from the beginning I would be
using python in the end, I would
probably have gone for pdfrw right away; while the complexity in terms
of Debian dependencies is roughly the same (“one over what you'll
already have”), avoiding a subprocess call is almost always a win.
But these misgivings are one reason why I wrote this post: This is a
compact illustration of the old programmers' wisdom to “Plan to throw
one away – you will anyway“. Except that for tiny little ad-hoc scripts
like this, a bit of baroque adornment and an extra process do not hurt
and the code above ought to work just fine if you need to produce a PDF
document from some fixed page range of a few dozen or hundred other PDF
documents.