SPARQL and Wikidata 1: Setting out
If you continue, you will read about a first-rate example of Yak Shaving
While listening to a short biography of the astrophysicist Mary Lea Heger (my story; sorry, in German), I learned that she died on her birthday. That made me wonder: How common is that? Are people prone to die on their birthdays, perhaps because the parties are so strenuous, perhaps because they consider them a landmark that they are so determined to reach that they hold on to dear life until they have reached it? Or are they perhaps less likely to die because all that attention strengthens their spirits?
I figured that could be a nice question for Wikidata, a semantic database that feeds Wikipedia with all kinds of semi-linguistic or numeric information. Even if you are not Wikipedia, you can run fairly complex queries against it using a language called SPARQL. I've always wanted to play with that, in particular because SPARQL seems an interesting piece of tech. Answering the question of the letality of birthdays turned out to be a mildly exciting (in a somewhat nerdy sense) journey, and I thought my story of how I did my first steps with SPARQL might be suitably entertaining.
Since it is a relatively long story, I will split it up into a few instalments. This first part relates a few preliminaries and then does the first few (and very simple) queries. The preliminaries are mainly introducing the design of (parts of) RDF with my take on why people built it like that.
Basics: RDF in a few paragraphs
For motivating the Resource Description Format RDF and why people bother with it, I couldn't possibly do better than Norman Gray in his witty piece on Jordan, Jordan and Jordan. For immediate applicability, Wikidata's User Manual is hard to beat.
But if you're in a hurry, you can get by with remembering that within RDF, information is represented in triples of (subject, predicate, object). This is somewhat reminiscent of a natural-language sentence, although the “predicate“ typically would be a full verb phrase, possibly with a few prepositions sprinkled in for good measure. Things typically serving as predicates are called “property“ in RDF land, and the first example for those defined in wikidata, P10[1], would be something like has-a-video-about-it-at or so, as in:
"Christer Fuglesang", P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20en.webm "Christer Fuglesang", P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20ru.webm
If you know about first order logic: It's that kind of predicate. And please don't ask who Christer Fuglesang is, the triples just came up first in a query you will learn about a bit further down.
This was a bit of a simplification, because RDF will not usually refer to a thing (an “entity“ in RDF jargon) with a string (“literal”), if only because there could be multiple Christer Fuglesangs and a computer would be at a loss as to which one I mean in the two triples above. RDF instead talks about “resources“, which is anything that has a URI and encompasses both entities and properties. So, a statement as above would actually combine three URIs:
http://www.wikidata.org/entity/Q317382, http://www.wikidata.org/prop/direct/P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20en.webm
CURIEs
That is a lot of stuff to type, and thus almost everything in the RDF world supports URL abbreviation using prefixes. Basically, in some way you say that whenever there's wpt: in a token, some magic replaces it by http://www.wikidata.org/prop/direct/. Ff you know about XML namespaces (and if you're doing any sort of non-trivial XML, you should): same thing, except that the exact syntax for writing down the mapping from prefixes to URIs depends on how you represent the RDF triples.
These “words with a colon that expand to long URIs by some find-and-replace rules“ were called CURIEs, compact URIs, for a while, but I think that term has become unpopular again. I consider this a bit of a pity, because it was nice to have a name for them, and such a physics-related one on top. But it seems nobody cared enough to push the W3C draft for that ahead.
As with XML namespaces, each RDF document could use its own prefix mapping; if you want, you could instruct an RDF processor to let you write wikidata-direct-property: for http://www.wikidata.org/prop/direct/ rather than wpt: as usual. But that would be an unfriendly act. At least for the more popular collections of RDF resources, there are canonical prefixes: you don't have to use them, but everyone will hate you if you don't. In particular, don't bind well-known prefixes like foaf: (see below) to URIs other than the canonical ones except when seeing whether a piece of software does it right or setting traps for unsuspecting people you don't like.
Then again, for what we are doing here, you do not need to bother about prefix mappings at all, because the wikidata engine has all prefixes we will use prefined and built in. So, as long as you are aware that you can replace the funny prefixes with URI parts and that there is some place where those URIs parts are defined, you are fine.
Long URIs in RDF?
You could certainly ask why we're bothering with the URIs in the first place if people in practice use the canonical prefixes almost exclusively. I think the main reason RDF was built on URIs was because its parents on the one hand wanted to let everyone “build” resources with minimal effort. On the other hand, they wanted to ensure as best they could that two people would not accidentally use the same resource identifier while meaning different things.
To ensure the uniqueness of identifiers, piggybacking on the domain name system, which already makes sure that there are never two machines called, say, blog.tfiu.de in the world, is a rather obvious move. In HTTP URIs, domain names show up as the authority (the host part, the thing between the first colon and the double slash), and so with URIs of that sort you can start creating (if you will) your resources and would never conflict with anyone else as long as hold on to your domain.
In addition, nobody can predict which of these namespace URIs will become popular enough to warrant a globally reserved prefix of their own; you see, family-safe prefixes with four (or fewer) letters are a rather scarce resource, and you don't want to run a registry of those. If you did, you would become really unpopular with all the people you had to tell things like “no, your stuff is too unimportant for a nice abbreviation, but you can have aegh7Eba-veeHeql1:“
The admittedly unwieldy URIs in practice also have a big advantage, and one that would require some complex scheme like the Handle system if you wanted to replicate it with prefixes: most of the time, you can resolve them.
Non-speaking URIs
While RDF itself does not care, most URIs in this business actually resolve to something that is readable in a common web browser; you can always try it with the resources I will be mentioning later. This easy resolution is particularly important in the case of Wikidata's URIs, which are essentially just numbers. Except for a few cases (wd:Q42 is Douglas Adams, and wd:Q1 is the Universe), these numbers don't tell you anything.
There is no fixed rule that RDF URIs must have a lexical form that does not suggest their meaning. As a counterexample, http://xmlns.com/foaf/0.1/birthday is a property linking a person with its, well, birthday in a popular collection of RDF resources[2] called foaf (as in friend of a friend – basically, you can write RDF-complicant address books with that).
There are three arguments I have heard against URIs with such a speaking form:
- Don't favour English (a goal that the very multilingual Wikipedia projects might plausibly have).
- It's hard to automatically generate such URIs (which certainly is an important point when someone generates resources as quickly and with minimal supervision as Wikidata).
- People endlessly quarrel about what should be in the URI when they should really be quarrelling about the label, i.e., what is actually shown to readers in the various natural languages, and still more about the actual concepts and definitions. Also, you can't repair the URI if you later find you got the lexical form slightly wrong, whereas it's easy to fix labels.
I'm not sure which of these made Wikidata choose their schema of Q<number> (for entities) and P<number> (for properties) – but all of them would apply, so that's what we have: without looking …
![[RSS]](./theme/image/rss.png)