SPARQL 4: Be Extra Careful on your Birthday

Enough Yak Shaving already.
I am now three parts into my yak-shaving investigation of Wikidata and SPARQL (one, two, three) with the goal of figuring out whether birthdays are more or less dangerous – measured by whether or not people survive them – than ordinary days.
At the end of the last instalment I was stumped by timeouts on Wikidata, and so much of this post is about massaging my query so that I get some answers in the CPU minute that one gets on Wikidata's triplestore. While plain optimisation certainly is necessary, working on six million people within a minute is probably highly ambitious whatever I do[1]. I will hence have to work on a subset. Given what I have worked out in part 3,
# This will time out and just waste CPU. SELECT (count(?person) AS ?n) WHERE { ?person rdfs:label ?personName. ?person wdt:P569 ?bdate. hint:Prior hint:rangeSafe true. ?person wdt:P570 ?ddate. hint:Prior hint:rangeSafe true. FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1) FILTER (MONTH(?bdate) = MONTH(?ddate) && DAY(?bdate) = DAY(?ddate) && YEAR(?bdate) != YEAR(?ddate)) FILTER (YEAR(?bdate)>1850) FILTER (lang(?personName)="en") }
– how do I do a subset? Proper sampling takes almost as much time as working with the whole thing. But for now, I'd be content with just evaluating my query on whatever subset Wikidata likes to work on. Drawing such a (statiscally badly sampled) subset is what the LIMIT clause you have already seen in part one does. But where do I put it, since, if things work out, my query would only return a single row anyway?
Subqueries in SPARQL, and 50 Labels per Entity
The answer, as in SQL, is: A subquery. In SPARQL, you can have subqueries like this:
SELECT (count(?person) as ?n) WHERE { { SELECT ?person ?personName ?bdate ?ddate WHERE { ?person rdfs:label ?personName; wdt:P569 ?bdate; wdt:P570 ?ddate. } LIMIT 50 } FILTER (lang(?personName)="en") }
– so, within a pair of curly braces, you write another SELECT clause, and its is then the input for another WHERE, FILTER, or other SPARQL construct. In this case, by the way, I'm getting 50 records with all kinds of labels in the subquery and then filter out everything that's not English. Amazingly, only one record out of these 50 remains: there are clearly at least 50 statements on labels for the first entity Wikidata has picked here.
Raising the innner limit to 500, I get 10 records. For the particular sample that Wikidata chose for me, a person indeed has 50 labels on average. Wow. Raising the limit to 5000, which probably lowers the the pharaohs to non-pharaohs in the sample, gives 130 records, which translates into 38 labels per person.
Clearly, adding the labels is an expensive operation, and since I do not need them for counting, I will drop them henceforth. Also, I am doing my filtering for off-January 1st birthdays in the subquery. In this way, I probably have a good chance that everything coming out of the subquery actually counts in the outer filter, which means I can compute the rate of people dying on their birthday by dividing my count by the limit.
Let's see where this gets us:
SELECT (count(?person) AS ?n) WHERE { { SELECT ?person ?bdate ?ddate WHERE { ?person wdt:P569 ?bdate. hint:Prior hint:rangeSafe true. ?person wdt:P570 ?ddate. FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1) FILTER (YEAR(?bdate)>1850) hint:Prior hint:rangeSafe true. } LIMIT 500 } FILTER (MONTH(?bdate) = MONTH(?ddate) && DAY(?bdate) = DAY(?ddate) && YEAR(?bdate) != YEAR(?ddate)) hint:Prior hint:rangeSafe true. }
Named Subqueries and Planner Barriers
That's returning a two for me, which is not implausible, but for just 500 records it ran for about 20 seconds, which does not feel right. Neither pulling the 500 records nor filtering them should take that long.
When a database engine takes a lot longer than one thinks it should, what one should do is take look at the query plan, in which the database engine states in which sequence it will compute the result, which indexes it intends to use, etc.
Working out a good query plan is hard, because in general you need to know the various partial results to find one; in my example, for instance, the system could first filter out everyone born after 1850 and then find the death dates for them, or it could first match the death dates to the birthdays (discarding everything that does not have a death day in the process) and then filter for 1850. Ff there were may people with birthdays but no death days (for instance, because your database talks mostly about living people), the second approach might be a lot faster. But you, that is, the database engine, have to have good statistics to figure that out.
Since that is such a hard problem, it is not uncommon that the query planner gets confused and re-orders operations such that things are a lot slower than they would be if it hadn't done the ordering, and that's when one should use some sort of explain feature (cf. Wikidata's optimisation hints). On Wikidata, you can add an explain=detail parameter to the query and then somehow display the bunch of HTML you get back.
I did that and, as I usually do when I try this kind of thing, found that query plans are notoriously hard to understand, in particular when one is unfamiliar with the database engine. But in the process of figuring out the explain thing, I had discovered that SPARQL has the equivalent of SQL's common table expressions (CTEs), which gave me an excuse to tinker rather than think about plans. Who could resist that?
In SPARQL, CTEs are called named subqueries and used like this:
SELECT (count(?person) AS ?n) WITH { SELECT ?person ?bdate ?ddate WHERE { ?person wdt:P569 ?bdate; hint:Prior hint:rangeSafe true. ?person wdt:P570 ?ddate. hint:Prior hint:rangeSafe true. FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1) FILTER (YEAR(?bdate)>1850) } LIMIT 30000 } AS %selection WHERE { INCLUDE %selection FILTER (MONTH(?bdate) = MONTH(?ddate) && DAY(?bdate) = DAY(?ddate) && YEAR(?bdate) != YEAR(?ddate)) hint:Prior hint:rangeSafe true.
– you write your subquery in a WITH-block and give it a name that you then INCLUDE in your WHERE clause. In several SQL database systems, such a construct provides a planner barrier, that is, the planner will not rearrange expressions across a WITH.
So does, according to the optimisation hints, Wikidata's current SPARQL engine. But disappointingly, things don't speed up. Hmpf. Even so, I consider named subexpresisons more readable than nested ones[2], so for this post, I will stay with them. In case you come up with a brilliant all-Wikidata query, you probably want to go back to inline subqueries, though, because then you probably do not want to constrain the planner too much.
Finally: Numbers. But what do they Mean?
With my barrier-protected subqueries, I have definitely given up on working with all 6 million persons with birthdays within Wikidata. Interestingly, I could go from a limit of 500 to one of 30'000 and stay within the time limit. I never went back to try and figure out what causes this odd scaling law, though I'd probably learn a lot if I did. I'd almost certainly learn even more if I tried to understand why with a limit of 50'000, the queries tended to time out. But then 30'000 persons are plenty for my purpose provided they are drawn reasonably randomly, and hence I skipped all the tempting opportunities to learn.
And, ta-da: With the above query, I get 139 matches (i.e., people who died on their birthdays).
What does that say on the danger of birthdays? Well, let's see: If birthdays were just like other days, one would expect 30'000/365 deaths on birthdays in this sample, which works out to a bit more than 80. Is the 140 I am finding different from that 80 taking into account statistical noise? A good rule of thumb (that in the end is related to the grand central limit theorem) is that when you count n samples, your random error is something like √(n) if everything is benevolent. For 140, that square root is about 12, which we physicist-hackers like to write as σ = 12, and then we quickly divide the offset (i.e., 140 − 80 = 60) by that σ and triumphantly proclaim that “we've got a 5-σ effect”, at which point everyone is convinced that birthdays are life-threatening.
This is related to the normal distribution (“Gauss curve”) that has about 2/3s of its area within “one σ” (which is its width as half maximum and would be the standard deviation of something you draw from such a distribution), 95% of …