A Local Search Engine for Pelican-based Blogs

As the number of posts on this blog approaches 100, I figured some sort of search functionality would be in order. And since I'm wary of “free” commercial services and Free network search does not seem to go anywhere[1], the only way to offer that that is both practical and respectful of the digital rights of my readers is to have a local search engine. True, having a search engine running somewhat defeats the purpose of a static blog, except that there's a lot less code necessary for doing a simple search than for running a CMS, and of course you still get to version-control your posts.

I have to admit that the “less code” argument is a bit relative given that I'm using xapian as a full-text indexer here. But I've long wanted to play with it, and it seems reasonably well-written and well-maintained. I have hence written a little CGI script enabling search over static collections of HTML files, which means in particular pelican blogs. In this post, I'll tell you first a few things about how this is written and then how you'd run it yourself.

Using Xapian: Indexing

At its core, xapian is not much more than an inverted index: Essentially, you feed it words (“tokens”), and it will generate a database pointing from each word to the documents that contain it.

The first thing to understand when using xapian is that it doesn't really have a model of what exactly a document is; the example indexer code, for instance, indexes a text file such that each paragraph is treated as a separate document. All xapian itself cares about is a string („data“, but usually rather metadata) that you associate with a bunch of tokens. This pair receives a numeric id, and that's it.

There is a higher-level thing called omega built on top of xapian that does identify files with xapian documents and can crawl and index a whole directory tree. It also knows (to some extent) how to pull tokens from a large variety of file types. I've tried it, and I wasn't happy; since pelican creates all those ancillary HTML files for tags, monthly archives, and whatnot, when indexing with omega, you get lots of really spurious matches as soon as people enter a term that's in an article title, and entering a tag or a category will yield almost all the files.

So, I decided to write my own indexer, also with a view to later extending it to language detection (this blog has articles in German and English, and they eventually should be treated differently). The core is rather plain in Python:

for dir, children, names in os.walk(document_dir):
  for name in fnmatch.filter(names, "*.html"):
    path = os.path.join(dir, name)
    doc = index_html(indexer, path, document_dir)

That's enough for iterating over all HTML files in a pelican output directory (which document_dir should point to).

In the code, there's a bit of additional logic in the do_index function. This code enables incremental indexing, i.e., only re-indexing a file if it has changed since the last indexing run (pelican fortunately manages the file timestamps properly [Update (2021-11-13): it didn't, actually; see the search engine update post for how to fix that]). What I had to learn the hard way is that since xapian has no built-in relationship between what it considers a document and an operating system file, I need to explicitly remove the previous document matching a particular file. The function get_indexed_paths produces a suitable data structure for that from an existing database.

The indexing also defines my document model; as said above, as far as xapian is concerned, a document is just some (typically metadata) string under user control (plus the id and the tokens, obviously). Since I want structured metadata, I need to structure that string, and these days, json is the least involved thing to have structured data in a flat string. That explains the first half of the function that actually indexes one single document, the path of which comes in in f_name:

def index_html(indexer, f_name, document_dir):
  with open(f_name, encoding="utf-8") as f:
    soup = bs4.BeautifulSoup(f, "lxml")
  doc = xapian.Document()
  meta = {
    "title": soup_to_text(soup.find("title")),
    "path": remove_prefix(f_name, document_dir),
    "mtime": os.path.getmtime(f_name),}
  doc.set_data(json.dumps(meta))

  content = soup.find(class_="indexable")
  if not content:
    # only add terms if this isn't some index file or similar
    return doc
  print(f"Adding/updating {meta['path']}")

  indexer.set_document(doc)
  indexer.index_text(soup_to_text(content))

  return doc

– my metadata thus consists of a title, a path relative to pelican's output directory, and the last modification time of the file.

The other tricky part in here is that I only index children of the first element with an indexable class in the document. That's the key to keeping out all the tags, archive, and category files that pelican generates. But it means you will have to touch your templates if you want to adopt this to your pelican installation (see below). All other files are entered into the database, too, in order to avoid needlessly re-scanning them, but no tokens are associated with them, and hence they will never match a useful query.

[Update (2021-11-13): When you add the indexable class to your, also declare the language in order to support stemming; this would look like lang="{{ page.lang }} (substituting article for page as appropriate).]

There is a big lacuna here: the recall, i.e., the ratio between the number of documents actually returned for a query and the number of documents that should (in some sense) match, really suffers in both German and English if you don't do stemming, i.e., fail to strip off grammatical suffixes from words.

Stemming is of course highly language-dependent. Fortunately, pelican's default metadata includes the language. Less fortunately, my templates don't communicate that metadata yet – but that would be quick to fix. The actual problem is that when I stem my documents, I'll also have to stem the incoming queries. Will I stem them for German or for English?

I'll think about that problem later and for now don't stem at all; if you remember that I don't stem, you can simply append an asterisk to your search term; that's not exactly the same thing, but ought to be good enough in many cases.

Using xapian: Searching

Running searches using xapian is relatively straightforward: You open the database, parse the query, get the set of matches and then format the metadata you put in during indexing into links to the matches. In the code, that's in cgi_main; one could do paging here, but I figure spitting out 100 matches will be plenty, and distributing 100 matches on multiple HTML pages is silly (unless you're trying to optimise your access statistics; since I don't take those, that doesn't apply to me).

The part with the query parser deserves a second look, because xapian supports a fairly rich query language, where I consider the most useful features:

  • Phrase searches ("this is a phrase")
  • Exclusions (-dontmatch)
  • Matches only when two words appear within 10 tokens of each other (matches NEAR appear)
  • Trailing wildcard as in file patterns (trail*)

That last feature needs to be explicitly enabled, and since I find it somewhat unexpected that keyword arguments are not supported here, and perhaps even that the flag constant sits on the QueryParser object, here's how enabling wildcards in xapian looks in code:

qp = xapian.QueryParser()
parsed = qp.parse_query(query, qp.FLAG_WILDCARD)

Deploying this on your Pelican Installation

You can re-use my search script on your site relatively easily. It's one file, and if you're running an apache or something else that can run CGIs[2], making it run first is close to trivial: Install your equivalents of the Debian python3-xapian, python3-bs4, and python3-lxml packages. Perhaps you also need to explicitly allow CGI execution on your web server. In Debian's apache, that would be a2enmod cgi, elsewhere, you may need to otherwise arrange for mod_cgi or its equivalent to be loaded.

Then you need to dump blogsearch somewhere in the file system. While Debian has a default CGI directory defined, I'd suggest to put blogsearch somewhere next to your blog; I keep everything together in /var/blog (say), have the generated output in /var/blog/generated and would then keep the script in a directory /var/blog/cgi. Assuming this and apache, You'd then have something like:

DocumentRoot /var/blog/generated
ScriptAlias /bin /var/blog/cgi

in your configuration, presumably in a VirtualHost definition. In addition, you will have to tell the script where your pelican directory is. It expects that information in the environment variable BLOG_DIR; so, for apache, add:

SetEnv BLOG_DIR /var/blog/generated

to the VirtualHost.

After restarting your web server, the script would be ready (with the configuration above at <server-url>/bin/blogsearch, where the details obviously depend on your configuration) and ought to output its HTML form. But of course, there's no index yet, so searching will fail.

Before your create the index: As said above, blogsearch will ignore all material that is not a child of the (first) element with a class indexable in each document. Hence, you will have to change your article and page templates (in pelican) and embed the material you want to index into something like:

<div class="indexable">
  ...
</div>

It would not be hard to extend that scheme to multiple indexable elements in one page – just use BeautifulSoup's find_all method and write something like " ".join(soup_to_text(el) for el in content) in the index_text call. But I like the concept of having exactly one non-fluff region per page. For now.

Once you've re-made your site, you can index; the cgi doubles as its indexer, so the index will be built by adapting

env BLOG_DIR=/var/blog/generated /var/blog/cgi/blogsearch

to your system and letting it run. The script will only re-index what has changed. Since the index is written into the document directory (the documents are public anyway, so that seems rather harmless to me), whoever executes that will have to have write access there. Me, I'm running this as part of my install target.

Finally, you probably want add a search widget somewhere in your base template. I'm using:

<form action="/bin/blogsearch" method="GET"
  style="display:flex;width:100%;flex-flow:row wrap">
  <input type="text" value="" name="q"
    placeholder="xapian syntax"
    style:"flex-grow:1"/>
  <input type="submit" value="Find"
    style="flex-grow:0"/>
</form>

In case you're wondering where the HTML output of the script comes from: The templates are inline in the script. If you'd like to adapt those, let me know and I'll write something that lets you override the templates without having to hack the script.

To Do

In addition to the open question of doing proper, multi-language stemming, there is also the snippet thing. That commerical search engines give short contexts of the matched words is a useful feature. But that's hard to do unless I store the full article texts (well, as far as I index them) in the database.

[Update 2021-11-13: That's now built in; start your queries with l:(xapian language code) to have the queries stemmed, and see above for how to declare the document languages]

Sure, by today's standards that's almost no data at all. I think I should just do it. But somehow it still feels wrong, and perhaps I should pull the snippets from the actual files? Hm.

[1]As someone who's run a YaCy node for several years, I feel entitled to that somewhat devastating statement.
[2]More on what to do when you're not running a cgi-enabled webserver (as I do on the public site) in an upcoming post.

Zitiert in: How I'm Using Pelican Stemming for the Search Engine Der hundertste Post

Kategorie: edv