Engelszüngeln: Stemming for the Search Engine

First off, here is a quick reference for the search syntax on this site (the search form links here):

Phrase searches ("this is a phrase")
Exclusions (-dontmatch)
Matches only when two words appear within 10 tokens of each other (matches NEAR appear)
Trailing wildcard as in file patterns (trail*)
Searches don't use stemming by default, but stem for German when introduced with l:de and for English when introduced with l:en
See also the Xapian syntax.

If you only came here for the search syntax, that's it, and you can stop reading here.

Otherwise, if you have read the previous post on my little search engine, you will remember I was a bit unhappy that I completely ignored the language of the posts and had wanted to support stemming so that you can find, ideally, documents containing any of "search", "searches", "searching", and "searched" when searching for any of these. Being able to do that (without completely ruining precision) is obviously language-dependent, which means the first step to make it happen is to properly declare the languague of your posts.

As discussed in the previous post, my blogsearch script only looks at elements with the CSS class indexable, and so I decided to have the language declaration there, too. In my templates, I hence now use:

<div class="indexable" lang="{{ article.lang }}">

or:

<div class="indexable" lang="{{ page.lang }}">

as appropriate.

This is interpreted by the indexer rather straightforwardly by pulling the value out of the attribute and asking xapian for a stemmer for the named language. That works for at least most European two-letter country codes, because those happen to coincide with what's legal in HTML's lang universal attribute. It does not work for the more complex BCP 47 language tags like de-AT (where no actually existing stemmer would give results different from plain de anyway) or even sr-Latn-RS (for which, I think, no stemmer exists).

On searching, I was worried that enabling stemming would blow unstemmed searches, but xapian's indexes are clever enough that that's not a problem. But I still cannot stem queries by default, because it is hard to guess their language from just a word or two. Hence, I have defined a query syntax extension: If you prefix your query with l:whatever, blogsearch will try to construct a xapian stemmer from whatever. If that fails, you'll get an error, if it succeeds, it will stem the query in that language.

As an aside, I considered for a moment whether it is a terribly good idea to hand through essentially unfiltered user input to a C++ API like xapian's. I eventually settled for just making it a bit harder to craft buffer overflows by saying:

lang = parts[0][2:30]

– that is, I'm only allowing through up to 28 characters of language code. Not that I expect that anything in between my code and xapian's core has an overflow problem, but this is a cheap defensive measure that would also limit the amount of code someone could smuggle in in case some vulnerability did sneak in. Since it's essentially free, I'd say that's reasonable defensive programming.

In closing, I do not think stemmed searches will be used a lot, and as usual with these very simple stemmers, they leave a lot to be desired from a linguistic point of view. Compare, for instance, a simple search for going with the result l:en going to see where this is supposed to go (and compare with the result when stemming as German). And then compare with l:en went, which should return the same as l:en going in an ideal world but of course doesn't: Not with the simple snowball stemmer that xapian employs.

I'm still happy the feature's there, and I'm sure I'll need it one of these days.

And again, if you need a CGI that can index and query your static HTML collection with low deployment effort: you're welcome.

Zitiert in: Trailing blanks, vim and git A Local Search Engine for Pelican-based Blogs

Letzte Ergänzungen