SPARQL 2: Improvising a client
There is still a lot of hair on the Yak I am shaving in this little series of posts on SPARQL. All the Yaks shown in the series lived on the Valüla Mountain in Vorarlberg, Austria.
This picks up my story on figuring out whether birthdays are dangerous using SPRAQL on Wikidata. You can probably skip this part if you're only interested in writing SPARQL queries to Wikidata and are happy with the browser form they give you. But you shouldn't. On both accounts.
At the end of part one, I, for one, was unhappy about the Javascript-based UI at Wikidata and had decided I wanted a user interface that would let me edit my queries in a proper editor (in particular, locally on my machine, giving me the freedom to choose my tooling).
My browser's web inspector quickly showed me that the non-Javascript web UI simply sent a query argument to https://query.wikidata.org/sparql. That's easy to do using curl, except I want to read the argument from a file (that is, the one I am editing in my vi). Helpfully, curl's man page informs on the --form option:
This enables uploading of binary files etc. To force the 'content' part to be a file, prefix the file name with an @ sign. To just get the content part from a file, prefix the file name with the symbol <. The difference between @ and < is then that @ makes a file get attached in the post as a file upload, while the < makes a text field and just get the contents for that text field from a file.
Uploads, Multipart, Urlencoded, Oh My!
In this case, Wikidata probably does not expect actual uploads in the query argument (and the form does not submit it in this way), so < it ought to be.
To try it, I put:
SELECT ?p ?o
WHERE {
wd:Q937 ?p ?o.
}
LIMIT 5
(the query for everything Wikidata says about Albert Einstein, plus a LIMIT clause so I only pull five triples, both to reduce load on Wikidata and to reduce clutter in my terminal while experimenting) into a file einstein.rq. And then I typed:
curl --form query=<einstein.rq https://query.wikidata.org/sparql
into my shell. Soberingly, this gives:
Not writable.
Huh? I was not trying to write anything, was I? Well, who knows: Curl, in its man page, says that using --form does a POST with a media type of multipart/form-data, which many web components (mistakenly, I would argue) take as a file upload. Perhaps the remote machinery shares this misconception?
Going back to the source of https://query.wikidata.org/, it turns out the form there does a GET, and the query parameter hence does not get uploaded in a POST but rather appended to the URL. Appending to the URL isn't trivial with curl (I think), but curl's --data option at least POSTs the parameters in application/x-www-form-urlencoded, which is what browsers do when you don't have uploads. It can read from files, too, using @<filename>. Let's try that:
curl --data query=@einstein.rq https://query.wikidata.org/sparql
Oh bother. That returns a lenghty message with about a ton of Java traceback and an error message in its core:
org.openrdf.query.MalformedQueryException: Encountered " <LANGTAG> "@einstein "" at line 1, column 1.
Was expecting one of:
"base" ...
"prefix" ...
"select" ...
"construct" ...
"describe" ...
"ask" ...
Hu? Apparently, my query was malformed? Helpfully, Wikidata says what query it saw: queryStr=@einstein.rq. So, curl did not make good on its promise of putting in the contents of einstein.rq. Reading the man page again, this time properly, I have to admit I should have expected that: “if you start the data with the letter @“, it says there (emphasis mine). But haven't I regularly put in query parameters in this way in the past?
Sure I did, but I was using the --data-urlencode option, which is what actually simulates a browser and has a slightly different syntax again:
curl --data-urlencode query@einstein.rq https://query.wikidata.org/sparql
Ha! That does the trick. What comes back is a bunch of XML, starting with:
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
<head>
<variable name='p'/>
<variable name='o'/>
</head>
<results>
<result>
<binding name='p'>
<uri>http://schema.org/version</uri>
</binding>
<binding name='o'>
<literal datatype='http://www.w3.org/2001/XMLSchema#integer'>1692345626</literal>
</binding>
</result>
Making the Output Friendlier: Turtle?
Hm. That's not nice to read. I thought: Well, there's Turtle, a nice way to write RDF triples in plain text. In RDF land, people rather regularly support the HTTP accept header, a wildly underused and cool feature of HTTP that lets a client say what kind of data it would like to get (see Content negotiation in the Wikipedia). So, I thought, perhaps I can tell Wikidata to produce Turtle using accept?
This plan looks like this when translated to curl:
curl --header "accept: text/turtle" \ --data-urlencode query@einstein.rq https://query.wikidata.org/sparql
Only, the output does not change, Wikidata ignores my request.
Thinking again, it is well advised to do so (except it could have produced a 406 Not Acceptable response, but that would probably be even less useful). The most important thing to remember from part one is that RDF talks about triples of subject, predicate, and object. In SPARQL, you have a SELECT clause, which means a result row in general will not consist of subject, predicate, and object. Hence, the service couldn't possibly return results in Turtle: What does not consist of RDF triples canot be serialised as RDF triples.
Making the Output Friendlier: XSLT!
But then what do I do instead to improve result readability? For quick and (relatively) easy XML manipulation on the command line, I almost always recommend xmlstarlet. While I give you its man page has ample room for improvement, and compared to writing XSL stylesheets, the command line options of xmlstarlet sel (use its -h option for explanations) are somewhat obscure, but it just works and is compact.
If you inspect the response from Wikidata, you will notice that the results come in result elements, which for every variable in your SELECT clause have one binding element, which in turn has a name attribute and then some sort of value in its content; for now, I'll settle for fetching either uri or literal (again, part one has a bit more on what that might mean). What I need to tell xmlstarlet thus is: “Look for all result elements and produce one output record per such element. Within each, make a name/value pair from a binding's name attribute and any uri or literal element you find.” In code, I furthermore need to add an XML prefix definition (that's totally orthogonal to RDF prefixes). With the original curl and a pipe, this results in:
curl --data-urlencode query@einstein.rq https://query.wikidata.org/sparql \ | xmlstarlet sel -T -N s="http://www.w3.org/2005/sparql-results#" -t \ -m //s:result --nl -m s:binding -v @name -o = -v s:uri -v s:literal --nl
Phewy. I told you xmlstarlet sel had a bit of an obscure command line. I certainy don't want to type that every time I run a query. Saving keystrokes that are largely constant across multiple command invocations is what shell aliases are for, or, because this one would be a bit long and fiddly, shell functions. Hence, I put the following into my ~/.aliases (which is being read by the shell in most distributions, I think; in case of doubt, ~/.bashrc would work whenever you use bash):
function wdq() {
curl -s --data-urlencode "query@$1" https://query.wikidata.org/sparql
| xmlstarlet sel -T -N s="http://www.w3.org/2005/sparql-results#" -t \
-m //s:result --nl -m s:binding -v @name -o = -v s:uri -v s:literal --nl
}
(notice the $1 instead of the constant file name here). With an exec bash – my preferred way to get a shell to reflecting the current startup scripts –, I can now type:
wdq einstein.rq | less
and get a nicely paged output like:
p=http://schema.org/version o=1692345626 p=http://schema.org/dateModified o=2022-07-31T01:52:04Z p=http://schema.org/description o=ލިޔުންތެރިއެއް p=http://schema.org/description o=ಗಣಿತಜ್ಞ p=http://schema.org/description o=भौतिकशास्त्रातील नोबेल पारितोषिकविजेता शास्त्रज्ञ.
We will look at how to filter out descriptions in languagues one can't read, let alone speak, in the next instalment.
For now, I'm reasonably happy with this, except of course I'll get many queries wrong initially, and then Wikidata does not return XML at all. In that case, xmlstarlet produces nothing but an unhelpful error message of its own, because it …
![[RSS]](./theme/image/rss.png)