SPARQL and Wikidata 1: Setting out

Yaks todding along a mountain path

If you continue, you will read about a first-rate example of Yak Shaving

While listening to a short biography of the astrophysicist Mary Lea Heger (my story; sorry, in German), I learned that she died on her birthday. That made me wonder: How common is that? Are people prone to die on their birthdays, perhaps because the parties are so strenuous, perhaps because they consider them a landmark that they are so determined to reach that they hold on to dear life until they have reached it? Or are they perhaps less likely to die because all that attention strengthens their spirits?

I figured that could be a nice question for Wikidata, a semantic database that feeds Wikipedia with all kinds of semi-linguistic or numeric information. Even if you are not Wikipedia, you can run fairly complex queries against it using a language called SPARQL. I've always wanted to play with that, in particular because SPARQL seems an interesting piece of tech. Answering the question of the letality of birthdays turned out to be a mildly exciting (in a somewhat nerdy sense) journey, and I thought my story of how I did my first steps with SPARQL might be suitably entertaining.

Since it is a relatively long story, I will split it up into a few instalments. This first part relates a few preliminaries and then does the first few (and very simple) queries. The preliminaries are mainly introducing the design of (parts of) RDF with my take on why people built it like that.

Basics: RDF in a few paragraphs

For motivating the Resource Description Format RDF and why people bother with it, I couldn't possibly do better than Norman Gray in his witty piece on Jordan, Jordan and Jordan. For immediate applicability, Wikidata's User Manual is hard to beat.

But if you're in a hurry, you can get by with remembering that within RDF, information is represented in triples of (subject, predicate, object). This is somewhat reminiscent of a natural-language sentence, although the “predicate“ typically would be a full verb phrase, possibly with a few prepositions sprinkled in for good measure. Things typically serving as predicates are called “property“ in RDF land, and the first example for those defined in wikidata, P10[1], would be something like has-a-video-about-it-at or so, as in:

"Christer Fuglesang", P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20en.webm
"Christer Fuglesang", P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20ru.webm

If you know about first order logic: It's that kind of predicate. And please don't ask who Christer Fuglesang is, the triples just came up first in a query you will learn about a bit further down.

This was a bit of a simplification, because RDF will not usually refer to a thing (an “entity“ in RDF jargon) with a string (“literal”), if only because there could be multiple Christer Fuglesangs and a computer would be at a loss as to which one I mean in the two triples above. RDF instead talks about “resources“, which is anything that has a URI and encompasses both entities and properties. So, a statement as above would actually combine three URIs:

http://www.wikidata.org/entity/Q317382, http://www.wikidata.org/prop/direct/P10, http://commons.wikimedia.org/wiki/Special:FilePath/Christer%20Fuglesang%20en.webm

CURIEs

That is a lot of stuff to type, and thus almost everything in the RDF world supports URL abbreviation using prefixes. Basically, in some way you say that whenever there's wpt: in a token, some magic replaces it by http://www.wikidata.org/prop/direct/. Ff you know about XML namespaces (and if you're doing any sort of non-trivial XML, you should): same thing, except that the exact syntax for writing down the mapping from prefixes to URIs depends on how you represent the RDF triples.

These “words with a colon that expand to long URIs by some find-and-replace rules“ were called CURIEs, compact URIs, for a while, but I think that term has become unpopular again. I consider this a bit of a pity, because it was nice to have a name for them, and such a physics-related one on top. But it seems nobody cared enough to push the W3C draft for that ahead.

As with XML namespaces, each RDF document could use its own prefix mapping; if you want, you could instruct an RDF processor to let you write wikidata-direct-property: for http://www.wikidata.org/prop/direct/ rather than wpt: as usual. But that would be an unfriendly act. At least for the more popular collections of RDF resources, there are canonical prefixes: you don't have to use them, but everyone will hate you if you don't. In particular, don't bind well-known prefixes like foaf: (see below) to URIs other than the canonical ones except when seeing whether a piece of software does it right or setting traps for unsuspecting people you don't like.

Then again, for what we are doing here, you do not need to bother about prefix mappings at all, because the wikidata engine has all prefixes we will use prefined and built in. So, as long as you are aware that you can replace the funny prefixes with URI parts and that there is some place where those URIs parts are defined, you are fine.

Long URIs in RDF?

You could certainly ask why we're bothering with the URIs in the first place if people in practice use the canonical prefixes almost exclusively. I think the main reason RDF was built on URIs was because its parents on the one hand wanted to let everyone “build” resources with minimal effort. On the other hand, they wanted to ensure as best they could that two people would not accidentally use the same resource identifier while meaning different things.

To ensure the uniqueness of identifiers, piggybacking on the domain name system, which already makes sure that there are never two machines called, say, blog.tfiu.de in the world, is a rather obvious move. In HTTP URIs, domain names show up as the authority (the host part, the thing between the first colon and the double slash), and so with URIs of that sort you can start creating (if you will) your resources and would never conflict with anyone else as long as hold on to your domain.

In addition, nobody can predict which of these namespace URIs will become popular enough to warrant a globally reserved prefix of their own; you see, family-safe prefixes with four (or fewer) letters are a rather scarce resource, and you don't want to run a registry of those. If you did, you would become really unpopular with all the people you had to tell things like “no, your stuff is too unimportant for a nice abbreviation, but you can have aegh7Eba-veeHeql1:“

The admittedly unwieldy URIs in practice also have a big advantage, and one that would require some complex scheme like the Handle system if you wanted to replicate it with prefixes: most of the time, you can resolve them.

Non-speaking URIs

While RDF itself does not care, most URIs in this business actually resolve to something that is readable in a common web browser; you can always try it with the resources I will be mentioning later. This easy resolution is particularly important in the case of Wikidata's URIs, which are essentially just numbers. Except for a few cases (wd:Q42 is Douglas Adams, and wd:Q1 is the Universe), these numbers don't tell you anything.

There is no fixed rule that RDF URIs must have a lexical form that does not suggest their meaning. As a counterexample, http://xmlns.com/foaf/0.1/birthday is a property linking a person with its, well, birthday in a popular collection of RDF resources[2] called foaf (as in friend of a friend – basically, you can write RDF-complicant address books with that).

There are three arguments I have heard against URIs with such a speaking form:

  • Don't favour English (a goal that the very multilingual Wikipedia projects might plausibly have).
  • It's hard to automatically generate such URIs (which certainly is an important point when someone generates resources as quickly and with minimal supervision as Wikidata).
  • People endlessly quarrel about what should be in the URI when they should really be quarrelling about the label, i.e., what is actually shown to readers in the various natural languages, and still more about the actual concepts and definitions. Also, you can't repair the URI if you later find you got the lexical form slightly wrong, whereas it's easy to fix labels.

I'm not sure which of these made Wikidata choose their schema of Q<number> (for entities) and P<number> (for properties) – but all of them would apply, so that's what we have: without looking up the label(s), there's really no indication what your average Wikidata resource is.

SPARQL

With these preliminaries, I think one can understand what Wikidata is: It's a queriable collection of RDF triples (a „triplestore“), where the subjects often correspond to Wikipedia entries, the properties often are useful to fill Wikipedia templates, and the objects can be a lot of things: Dates, video URLs as we saw above, other entities described in the Wikipedia. Or there's something completely different in there.

I believe the Wikipedia still primarily uses Wikidata to fill templated boxes. Consider, for instance, the page for the Lick Observatory (which is where Mary Lea Heger, the person that started my present effort, was active). The box in the article basically reflects a lot of triples linking wd:Q461613 (I used the search service mentioned below to figure out that's Wikidata's CURIE for the Lick Observatory) with, for instance wd:Q629500 (that's James Lick; predicate: wdt:P138, named after) or the literal 1283 metres (predicate: wdt:P2044, elevation above sea level).

You can also see all of that and more in the dump of the Wikidata triples at http://www.wikidata.org/entity/Q461613 (remember: this URI is totally equivalent to wd:Q461613 given Wikidata's prefix mapping). To generate the box in the Wikipedia article from that data, you would be asking „what predicates and objects do you know when the subject is wd:Q461613?” or, more likely in this case where you put in very specific information in a very specific order, “what object(s) do you have for the subject wd:Q461613 and the predicate wdt:P138?“.

But then there are the category-like things, such as “People born on April 5th“. That leads to a query where you give the predicate and the object (more or less the day April 5th) and look for all matching subjects. Or perhaps even “what went on on April 5th?“, where you might just give the object and ask for all subjects and predicates.

To easily accomodate all of these modes, SPARQL lets you give a triple template, where you fill in whatever parts you have and write the equivalent of question marks for what you're looking for. Add in a bit of syntax (e.g., triples are separated by dots) and a SELECT clause as in SQL to let you pick specific parts of your matched triples, and you end up with a first SPARQL query:

SELECT ?s ?o
WHERE {
  ?s wdt:P10 ?o.
}
LIMIT 10

Try it – just looking at the output, I think, helps a lot when you want to work out the basics of SPARQL.

Oh: You remember P10? That was has-a-video-about-it-at, so this query will return 10 pairs of entities and URIs of videos about them.

And what about birthdays now?

Except for the thing with the properties named P<something> in Wikidata, this is about how much I knew when I set out to figure out how many people died on their birthdays and boldly went to Wikidata's query service at https://query.wikidata.org/. Armed with that amount of knowlege, there's of course no way I had to read the user manual before starting to type. As usual, in the end I wished I had at least skimmed it right at the start. As usual, I didn't.

The way it was I first wondered: „What property is died-on?”. Fortunately, I had read enough documentation to have seen the search service at https://www.wikidata.org/wiki/Special:Search that helps turn natural language to Wikidata identifiers. Typing “death” there gave me

  • Death (Q161936) personification of death – 57 statements, 50 sitelinks
  • Death (Q192843) American death metal band – 39 statements, 46 sitelinks
  • melodic death metal (Q253918) subgenre of death metal – 16 statements, 51 sitelinks
  • death (Q4) permanent cessation of vital functions – 137 statements, 210 sitelinks

and so forth. The first thing that surprised me was that none of these things had the hundreds of thousands of “statements” (so: triples?) I would have expected for died-on. Of course, had I read the user manual, I would have noticed that all of these are entities (Q<whatever>). And it probably would have occurred to me that the number of statements probably refers to the number of triples in which the entity in question is in the subject position, whereas I was after a property and then the number of triples that have my died-on property in the predicate position.

Why read when you can type?

You can almost always replace reading documentation by clever experimentation (it's just slower), and at that point I figured: Well, Einstein is dead, so I should find the property I'm after checking him out. Typing „Einstein” in the search service immediately brought up

  • Albert Einstein (Q937) – German-born theoretical physicist; developer of the theory of relativity (1879–1955) – 502 statements, 290 sitelinks
  • Bose–Einstein condensate (Q46202) state of matter of a dilute gas of bosons cooled to temperatures very near absolute zero – 24 statements, 59 sitelinks
  • 2001 Einstein (Q146709) Inner main belt asteroid – 27 statements, 43 sitelinks

and so forth, so Q937 it was. Ha! I was ready for my first useful SPARQL query:

SELECT ?p ?o
WHERE {
  wd:Q937 ?p ?o.
}

(try it)

This yields (right now) 1445 triples, such as Einstein's name in all kinds of languages and scripts. But where is his birthday? I tried the browser search for 1955 (the year of his death), but to no avail.

Ha! The result browser is paged, so there's only the first 200 matches displayed in one go by default out of the 1445 result triples. But one can tell the pager to retrieve 1000 entries at a time, so that's what I did, and Presto!, I see something that looks just like birthday and date of death:

  • wdt:P569 14 March 1879
  • wdt:P570 18 April 1955

The result browser shows the properties as links, and when one clicks on the link for wdt:P569, it's explained as “date on which the subject was born“. Even without that link, the list of default prefixes under the query form's pin button shows that the wdt: prefix is just http://www.wikidata.org/prop/direct/, and so you could have come up with http://www.wikidata.org/prop/direct/P569 as the property's URI yourself. Pointing a browser there again resolves to a nice explanation.

And this actually has the many triples I had expected for predicates like died-on or born-on. Like SQL, SPARQL has an aggregate function COUNT() to learn how many matches you have, except that you must give an alias in SPARQL:

SELECT (COUNT(*) AS ?n)
 WHERE {
   ?s wdt:P569 ?o.
 }

(try it)

This gives n=5816933 at the moment, which is within my range of expectation for the number of entries with birthdays in the Wikipedia.

Oh no! Javascript and Local Storage

But at that point I was intensely grinding my teeth, because in keeping with the general suckiness of today's Web, the form is overloaded with Javascript magic, which makes focus management really break on my broser (luakit) when you have javascript and local storage enabled – basically, you would click into the text field but would still type to the browser in visual mode. And without javascript, you get back raw JSON when submitting the form. With only javascript and no local storage, you cannot type at all: the Javascript will swallow your keystrokes without comment. And don't get me started about clipboard and selection management, though I give you that's mostly webkitgtk's fault in my case.

So, regardless of how I configure my browser the query form sucks. And hence I simply had to make it so I could edit my queries with a proper editor and still have halfway readable results. That's the topic of the next instalment.

For today, let me close with pointing out that just because something is a property doesn't mean it can only be in the predicate position of a triple. For instance, our birthday property has a few triples in which it is the subject. You have seen the sort of query before:

SELECT ?p ?o
WHERE {
  wdt:P569 ?p ?o.
}

(try it)

In this case, that's not much, just a declaration that this is a data property to programs that know the Web Ontology Language OWL, which is another common collection of RDF resources with a cute acronym, this time intended to talk about relationships of RDF resources among themselves.

Frankly, I would at least have expected a label and a description in what is coming back for that query. This information must be somewhere in the triplestore, since all the items on http://www.wikidata.org/prop/direct/P569 almost certainly originated from RDF triples. I suspect that's related to the fact that Wikidata properties exist is a few varieties (look for “truthy” in the user manual). But working that out is for another day.

[Read on]

[1]For reasons I'd be curious to learn, properties P0 to P9 do not exist in Wikidata. If you know the story, do let me know!
[2]This one is rather well organised, and when you have a set of reasonably well-connected RDF resources, some people like to call it an “ontology”. Don't be scared when they do. It's still just a bunch of RDF triples, just one that claims to have a bit more inner structure than random collections of triples..

Zitiert in: SPARQL 4: Be Extra Careful on your Birthday SPAQRL 3: Who Died on their Birthdays? SPARQL 2: Improvising a client

Kategorie: edv

Letzte Ergänzungen