SPARQL 2: Improvising a client

A Yak on a mountain path, watching the observer

There is still a lot of hair on the Yak I am shaving in this little series of posts on SPARQL. All the Yaks shown in the series lived on the Valüla Mountain in Vorarlberg, Austria.

This picks up my story on figuring out whether birthdays are dangerous using SPRAQL on Wikidata. You can probably skip this part if you're only interested in writing SPARQL queries to Wikidata and are happy with the browser form they give you. But you shouldn't. On both accounts.

At the end of part one, I, for one, was unhappy about the Javascript-based UI at Wikidata and had decided I wanted a user interface that would let me edit my queries in a proper editor (in particular, locally on my machine, giving me the freedom to choose my tooling).

My browser's web inspector quickly showed me that the non-Javascript web UI simply sent a query argument to https://query.wikidata.org/sparql. That's easy to do using curl, except I want to read the argument from a file (that is, the one I am editing in my vi). Helpfully, curl's man page informs on the --form option:

This enables uploading of binary files etc. To force the 'content' part to be a file, prefix the file name with an @ sign. To just get the content part from a file, prefix the file name with the symbol <. The difference between @ and < is then that @ makes a file get attached in the post as a file upload, while the < makes a text field and just get the contents for that text field from a file.

Uploads, Multipart, Urlencoded, Oh My!

In this case, Wikidata probably does not expect actual uploads in the query argument (and the form does not submit it in this way), so < it ought to be.

To try it, I put:

SELECT ?p ?o
WHERE {
  wd:Q937 ?p ?o.
}
LIMIT 5

(the query for everything Wikidata says about Albert Einstein, plus a LIMIT clause so I only pull five triples, both to reduce load on Wikidata and to reduce clutter in my terminal while experimenting) into a file einstein.rq. And then I typed:

curl --form query=<einstein.rq https://query.wikidata.org/sparql

into my shell. Soberingly, this gives:

Not writable.

Huh? I was not trying to write anything, was I? Well, who knows: Curl, in its man page, says that using --form does a POST with a media type of multipart/form-data, which many web components (mistakenly, I would argue) take as a file upload. Perhaps the remote machinery shares this misconception?

Going back to the source of https://query.wikidata.org/, it turns out the form there does a GET, and the query parameter hence does not get uploaded in a POST but rather appended to the URL. Appending to the URL isn't trivial with curl (I think), but curl's --data option at least POSTs the parameters in application/x-www-form-urlencoded, which is what browsers do when you don't have uploads. It can read from files, too, using @<filename>. Let's try that:

curl --data query=@einstein.rq https://query.wikidata.org/sparql

Oh bother. That returns a lenghty message with about a ton of Java traceback and an error message in its core:

org.openrdf.query.MalformedQueryException: Encountered " <LANGTAG> "@einstein "" at line 1, column 1.
Was expecting one of:
    "base" ...
    "prefix" ...
    "select" ...
    "construct" ...
    "describe" ...
    "ask" ...

Hu? Apparently, my query was malformed? Helpfully, Wikidata says what query it saw: queryStr=@einstein.rq. So, curl did not make good on its promise of putting in the contents of einstein.rq. Reading the man page again, this time properly, I have to admit I should have expected that: “if you start the data with the letter @“, it says there (emphasis mine). But haven't I regularly put in query parameters in this way in the past?

Sure I did, but I was using the --data-urlencode option, which is what actually simulates a browser and has a slightly different syntax again:

curl --data-urlencode query@einstein.rq https://query.wikidata.org/sparql

Ha! That does the trick. What comes back is a bunch of XML, starting with:

<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
  <head>
    <variable name='p'/>
    <variable name='o'/>
  </head>
  <results>
    <result>
      <binding name='p'>
        <uri>http://schema.org/version</uri>
      </binding>
      <binding name='o'>
        <literal datatype='http://www.w3.org/2001/XMLSchema#integer'>1692345626</literal>
      </binding>
    </result>

Making the Output Friendlier: Turtle?

Hm. That's not nice to read. I thought: Well, there's Turtle, a nice way to write RDF triples in plain text. In RDF land, people rather regularly support the HTTP accept header, a wildly underused and cool feature of HTTP that lets a client say what kind of data it would like to get (see Content negotiation in the Wikipedia). So, I thought, perhaps I can tell Wikidata to produce Turtle using accept?

This plan looks like this when translated to curl:

curl --header "accept: text/turtle" \
  --data-urlencode query@einstein.rq https://query.wikidata.org/sparql

Only, the output does not change, Wikidata ignores my request.

Thinking again, it is well advised to do so (except it could have produced a 406 Not Acceptable response, but that would probably be even less useful). The most important thing to remember from part one is that RDF talks about triples of subject, predicate, and object. In SPARQL, you have a SELECT clause, which means a result row in general will not consist of subject, predicate, and object. Hence, the service couldn't possibly return results in Turtle: What does not consist of RDF triples canot be serialised as RDF triples.

Making the Output Friendlier: XSLT!

But then what do I do instead to improve result readability? For quick and (relatively) easy XML manipulation on the command line, I almost always recommend xmlstarlet. While I give you its man page has ample room for improvement, and compared to writing XSL stylesheets, the command line options of xmlstarlet sel (use its -h option for explanations) are somewhat obscure, but it just works and is compact.

If you inspect the response from Wikidata, you will notice that the results come in result elements, which for every variable in your SELECT clause have one binding element, which in turn has a name attribute and then some sort of value in its content; for now, I'll settle for fetching either uri or literal (again, part one has a bit more on what that might mean). What I need to tell xmlstarlet thus is: “Look for all result elements and produce one output record per such element. Within each, make a name/value pair from a binding's name attribute and any uri or literal element you find.” In code, I furthermore need to add an XML prefix definition (that's totally orthogonal to RDF prefixes). With the original curl and a pipe, this results in:

curl --data-urlencode query@einstein.rq https://query.wikidata.org/sparql \
| xmlstarlet sel -T -N s="http://www.w3.org/2005/sparql-results#" -t \
  -m //s:result --nl -m s:binding -v @name -o = -v s:uri -v s:literal --nl

Phewy. I told you xmlstarlet sel had a bit of an obscure command line. I certainy don't want to type that every time I run a query. Saving keystrokes that are largely constant across multiple command invocations is what shell aliases are for, or, because this one would be a bit long and fiddly, shell functions. Hence, I put the following into my ~/.aliases (which is being read by the shell in most distributions, I think; in case of doubt, ~/.bashrc would work whenever you use bash):

function wdq() {
  curl -s --data-urlencode "query@$1" https://query.wikidata.org/sparql
  | xmlstarlet sel -T -N s="http://www.w3.org/2005/sparql-results#" -t \
    -m //s:result --nl -m s:binding -v @name -o = -v s:uri -v s:literal --nl
}

(notice the $1 instead of the constant file name here). With an exec bash – my preferred way to get a shell to reflecting the current startup scripts –, I can now type:

wdq einstein.rq | less

and get a nicely paged output like:

p=http://schema.org/version
o=1692345626

p=http://schema.org/dateModified
o=2022-07-31T01:52:04Z

p=http://schema.org/description
o=ލިޔުންތެރިއެއް

p=http://schema.org/description
o=ಗಣಿತಜ್ಞ

p=http://schema.org/description
o=भौतिकशास्त्रातील नोबेल पारितोषिकविजेता शास्त्रज्ञ.

We will look at how to filter out descriptions in languagues one can't read, let alone speak, in the next instalment.

For now, I'm reasonably happy with this, except of course I'll get many queries wrong initially, and then Wikidata does not return XML at all. In that case, xmlstarlet produces nothing but an unhelpful error message of its own, because it refuses anything but XML.

To cope with that, it's nice to have a version of the function that just pages the raw response. I'm calling it wikidata-query-raw or wdqr for short:

function wdqr() {
  curl -s --data-urlencode "query@$1" \
    https://query.wikidata.org/sparql |& less
}

roquet

You might argue that all that experimentation was a bad idea anyway, because there is a proper SPARQL client packaged in Debian, roqet, from the package rasqal-utils. That does all of that in the proper, non shell-hacked way. Or should do that, at least.

When I had discovered roqet, I first tried:

roqet einstein.rq

The way I have written einstein.rq above, this results in an error message:

roqet: Error -  - The namespace prefix in "wd:Q937" was not declared.
roqet: Error - URI https://query.wikidata.org/sparql:3 - syntax error, unexpected $end, expecting '}'

Clearly, roqet tries to parse the SPARQL, and since I have been relying on Wikidata for binding the prefix (wd:) to the proper URI, that has to fail (if you do not agree, see part one). So, let's add the prefix mapping to einstein.rq using the fairly intuitive SPARQL syntax for that:

PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?p ?o
WHERE {
  wd:Q937 ?p ?o.
}
LIMIT 5

– the button with the pin on https://query.wikidata.org/ lets you generate those, for instance. Then,

$ roqet einstein.rq
roqet: Running query from file einstein.rq
roqet: Query has a variable bindings result
roqet: Query returned 0 results

What? Why are there 0 results? Well… Thinking again… we have not told roqet that we would like to query Wikidata, have we? Indeed, we have not.

This kind of silent failure is not unusual in RDF land: the whole system is so flexible and easy-going that it's easy to write things that are fairly nonsensical but technically correct. Here, roqet simply looks for statements on wd:Q937 in its internal triple store and understandably finds nothing.

A quick grep in roqet's man page digs up:

-p, --protocol SERVICE-URI
       Call  the  SPARQL HTTP protocol SERVICE-URI to execute the query
       instead of executing it inside the Rasqal query engine locally
       (when -e is given, or a query string given)

Let's see:

$ roqet -p https://query.wikidata.org/sparql einstein.rq
[some noise elided]
roqet: Error - URI [noise] - Resolving URI failed with HTTP status 403

What? HTTP 403 is a stark “Forbidden“; in constrast to 401, where you could perhaps authenticate, this is an unconditional No. Yikes. Why would that be, given I've hammered the URI from curl without any problem? Do the Wikidata folks lock out roqet? Or perhaps the library it is built on?

Well, this is one of the many cases in which I wish people wouldn't force https on me. With just http, I could look at what roqet and wikidata discuss between the two of them, as in:

sudo ngrep ".*" host query.wikidata.org and port 80

Regrettably, wikidata forces HTTPS on everyone, too, so all that yields is:

T 2620:0:862:ed1a::1:80 -> [...]:40638 [AP] #6
HTTP/1.1 301 TLS Redirect..Date: Sun, 31 Jul 2022 09:16:59 GMT..Server: Var
nish..X-Varnish: 379230833..X-Cache: cp3060 int..X-Cache-Status: int-front.
...

Boo. Please don't do that. For cases like this, for instance, or for un-upgradable ancient hardware. Use upgrade-insecure-requests if you want, but let people use http if they want to.

I could now run a man-in-the-middle attack against myself to nevertheless see what my computer and Wikidata tell each other, but that's complex (starting with: make roqet use a proxy). Perhaps roqet can say a bit more about what Wikidata replies? Grepping for “debug” in the man page only brings up the -d option that does not apply here (because that's about query parsing and not service communication).

But given the forced https upgrade, I gave up my privacy and asked Google about “wikidata 403 rasqal” (rasqal is the library underlying roqet), and based on the matches, I again gave up my privacy and looked at this stackoverflow question from September 2021 that unfortunately went unanswered.

This is how forced https kills peoples' privacy.

Aborted Debugging

Well, the RDF community apparently is rather small. I still was not prepared to MITM myself, and instead fetched the source (see my story on understanding a netsurf issue for my practices in that department).

Entering the unpacked src directory, things look reasonably clean: A few dozen C files and headers, a parser generated by lex and yacc – nice. But where do they do their http connections that I'd like to watch? grep http clearly will go nowhere because of the sheer number of false positives. grep connection? No… the underlying SPARQL notion is that of a SERVICE, so… rasqal_service.c? Hm… This talks a lot about raptor_uri. Raptor is a well-known RDF machinery. So that's what it uses to query Wikidata?

A quick apt-get source raptor2-utils indeed brings up a file raptor_www_curl.c, which contains a lot code dealing with raptor_uri-s. That gives me a lot of symbols I might want to set breakpoints on when debugging, so I did as in the netsurf case and installed the debug symbols for libraptor2-0. Except: there are none. Oh, yikes. Perhaps I had better MITM myself after all? Nah, I've tired of this particular game at this point.

Given I'm happy with my clients for now, I just have filed a bug on Wikidata rather than keep looking for a workaround.

Let's see what comes of that.

[Read on]

Zitiert in: SPARQL 4: Be Extra Careful on your Birthday SPAQRL 3: Who Died on their Birthdays? SPARQL and Wikidata 1: Setting out

Kategorie: edv

Letzte Ergänzungen