Engelszüngeln: SPARQL 4: Be Extra Careful on your Birthday

Enough Yak Shaving already.

Subqueries in SPARQL, and 50 Labels per Entity
Named Subqueries and Planner Barriers
Finally: Numbers. But what do they Mean?
Control: What About Day -1 and Day +1?
Control: What about a Month? And Well-Defined Sampling?
Appendix: Born in the Future

I am now three parts into my yak-shaving investigation of Wikidata and SPARQL (one, two, three) with the goal of figuring out whether birthdays are more or less dangerous – measured by whether or not people survive them – than ordinary days.

At the end of the last instalment I was stumped by timeouts on Wikidata, and so much of this post is about massaging my query so that I get some answers in the CPU minute that one gets on Wikidata's triplestore. While plain optimisation certainly is necessary, working on six million people within a minute is probably highly ambitious whatever I do[1]. I will hence have to work on a subset. Given what I have worked out in part 3,

# This will time out and just waste CPU.
SELECT (count(?person) AS ?n)
WHERE {
  ?person rdfs:label ?personName.
  ?person wdt:P569 ?bdate.
  hint:Prior hint:rangeSafe true.
  ?person wdt:P570 ?ddate.
  hint:Prior hint:rangeSafe true.

  FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))
  FILTER (YEAR(?bdate)>1850)

  FILTER (lang(?personName)="en")
}

– how do I do a subset? Proper sampling takes almost as much time as working with the whole thing. But for now, I'd be content with just evaluating my query on whatever subset Wikidata likes to work on. Drawing such a (statiscally badly sampled) subset is what the LIMIT clause you have already seen in part one does. But where do I put it, since, if things work out, my query would only return a single row anyway?

Subqueries in SPARQL, and 50 Labels per Entity

The answer, as in SQL, is: A subquery. In SPARQL, you can have subqueries like this:

SELECT (count(?person) as ?n)
WHERE {
  { SELECT ?person ?personName ?bdate ?ddate
    WHERE {
      ?person rdfs:label ?personName;
        wdt:P569 ?bdate;
        wdt:P570 ?ddate.
    }
    LIMIT 50
  }
  FILTER (lang(?personName)="en")
}

– so, within a pair of curly braces, you write another SELECT clause, and its is then the input for another WHERE, FILTER, or other SPARQL construct. In this case, by the way, I'm getting 50 records with all kinds of labels in the subquery and then filter out everything that's not English. Amazingly, only one record out of these 50 remains: there are clearly at least 50 statements on labels for the first entity Wikidata has picked here.

Raising the innner limit to 500, I get 10 records. For the particular sample that Wikidata chose for me, a person indeed has 50 labels on average. Wow. Raising the limit to 5000, which probably lowers the the pharaohs to non-pharaohs in the sample, gives 130 records, which translates into 38 labels per person.

Clearly, adding the labels is an expensive operation, and since I do not need them for counting, I will drop them henceforth. Also, I am doing my filtering for off-January 1st birthdays in the subquery. In this way, I probably have a good chance that everything coming out of the subquery actually counts in the outer filter, which means I can compute the rate of people dying on their birthday by dividing my count by the limit.

Let's see where this gets us:

SELECT (count(?person) AS ?n)
WHERE {
  { SELECT ?person ?bdate ?ddate
    WHERE {
      ?person wdt:P569 ?bdate.
      hint:Prior hint:rangeSafe true.
      ?person wdt:P570 ?ddate.
       FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
      FILTER (YEAR(?bdate)>1850)
      hint:Prior hint:rangeSafe true.
    }
    LIMIT 500
  }

  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))
  hint:Prior hint:rangeSafe true.
}

Named Subqueries and Planner Barriers

That's returning a two for me, which is not implausible, but for just 500 records it ran for about 20 seconds, which does not feel right. Neither pulling the 500 records nor filtering them should take that long.

When a database engine takes a lot longer than one thinks it should, what one should do is take look at the query plan, in which the database engine states in which sequence it will compute the result, which indexes it intends to use, etc.

Working out a good query plan is hard, because in general you need to know the various partial results to find one; in my example, for instance, the system could first filter out everyone born after 1850 and then find the death dates for them, or it could first match the death dates to the birthdays (discarding everything that does not have a death day in the process) and then filter for 1850. Ff there were may people with birthdays but no death days (for instance, because your database talks mostly about living people), the second approach might be a lot faster. But you, that is, the database engine, have to have good statistics to figure that out.

Since that is such a hard problem, it is not uncommon that the query planner gets confused and re-orders operations such that things are a lot slower than they would be if it hadn't done the ordering, and that's when one should use some sort of explain feature (cf. Wikidata's optimisation hints). On Wikidata, you can add an explain=detail parameter to the query and then somehow display the bunch of HTML you get back.

I did that and, as I usually do when I try this kind of thing, found that query plans are notoriously hard to understand, in particular when one is unfamiliar with the database engine. But in the process of figuring out the explain thing, I had discovered that SPARQL has the equivalent of SQL's common table expressions (CTEs), which gave me an excuse to tinker rather than think about plans. Who could resist that?

In SPARQL, CTEs are called named subqueries and used like this:

SELECT (count(?person) AS ?n)
WITH { SELECT ?person ?bdate ?ddate
    WHERE {
      ?person wdt:P569 ?bdate;
      hint:Prior hint:rangeSafe true.
      ?person wdt:P570 ?ddate.
      hint:Prior hint:rangeSafe true.
       FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
      FILTER (YEAR(?bdate)>1850)
    }
    LIMIT 30000
  } AS %selection
WHERE {
  INCLUDE %selection

  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))
  hint:Prior hint:rangeSafe true.

– you write your subquery in a WITH-block and give it a name that you then INCLUDE in your WHERE clause. In several SQL database systems, such a construct provides a planner barrier, that is, the planner will not rearrange expressions across a WITH.

So does, according to the optimisation hints, Wikidata's current SPARQL engine. But disappointingly, things don't speed up. Hmpf. Even so, I consider named subexpresisons more readable than nested ones[2], so for this post, I will stay with them. In case you come up with a brilliant all-Wikidata query, you probably want to go back to inline subqueries, though, because then you probably do not want to constrain the planner too much.

Finally: Numbers. But what do they Mean?

With my barrier-protected subqueries, I have definitely given up on working with all 6 million persons with birthdays within Wikidata. Interestingly, I could go from a limit of 500 to one of 30'000 and stay within the time limit. I never went back to try and figure out what causes this odd scaling law, though I'd probably learn a lot if I did. I'd almost certainly learn even more if I tried to understand why with a limit of 50'000, the queries tended to time out. But then 30'000 persons are plenty for my purpose provided they are drawn reasonably randomly, and hence I skipped all the tempting opportunities to learn.

And, ta-da: With the above query, I get 139 matches (i.e., people who died on their birthdays).

What does that say on the danger of birthdays? Well, let's see: If birthdays were just like other days, one would expect 30'000/365 deaths on birthdays in this sample, which works out to a bit more than 80. Is the 140 I am finding different from that 80 taking into account statistical noise? A good rule of thumb (that in the end is related to the grand central limit theorem) is that when you count n samples, your random error is something like √(n) if everything is benevolent. For 140, that square root is about 12, which we physicist-hackers like to write as σ = 12, and then we quickly divide the offset (i.e., 140 − 80 = 60) by that σ and triumphantly proclaim that “we've got a 5-σ effect”, at which point everyone is convinced that birthdays are life-threatening.

This is related to the normal distribution (“Gauss curve”) that has about 2/3s of its area within “one σ” (which is its width as half maximum and would be the standard deviation of something you draw from such a distribution), 95% of its area within “two σ” and 99% of its area within “three σ”. Reckless as we gung-ho physicists are, we usually translate that into “the chance that a three-σ effect I've just measured is pure chance is less than a percent”. And hence a “five σ” effect is almost certainly real for people like me.

All that is a lot more handwaving than hard math, but if whatever you're looking at behaves reasonably Gaussian (which is another rather handwavy term), these are surprisingly good rules of thumb.

Control: What About Day -1 and Day +1?

Anyway, in this particular case I am more concerned about SPARQL thinkos than about statistical noise, and hence I would like to compare this result with the deaths before and after the birthday. To avoid producing dates like July 0th or August 32nd, I need to restrict the analysis to days 2 to 27 of the month; this lets me just add or subtract 1 from the day and still have a valid date, and I have no reason to assume the days I am filtering out would change my result.

I also trust that I just get one row per person (that wouldn't be true if a person had more than one date of birth or death, which is something I'd have to ascertain if this were science) and write count(*) rather than count(?person), which – according to the optimisation hints, commen sense, and my experience – is a good deal faster. In sum:

SELECT (count(*) AS ?n)
WITH { SELECT ?person ?bdate ?ddate
    WHERE {
      ?person wdt:P569 ?bdate.
      hint:Prior hint:rangeSafe true.
      ?person wdt:P570 ?ddate.
      hint:Prior hint:rangeSafe true.
      FILTER (DAY(?bdate)>1 && DAY(?bdate)<27)
      FILTER (YEAR(?bdate)>1850)
    }
    LIMIT 50000
  } AS %selection
WHERE {
  INCLUDE %selection

  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))
}

I can vary this query by writing DAY(?bdate)-1 = DAY(?ddate) to count people who died on the day before their birthday, and DAY(?bdate)+1 = DAY(?ddate) for those dying right after their birthday. When I tried this a week ago or so, I counted 164/198/138 deaths (the day before/birthday/the day after). Out of curiousity, I re-ran them now, which gave me 169/206/144.

That that's different is to be expected, because the engine is perfectly allowed to return a different selection of 50000 matches each time I run the query, and quite likely quite a few triples were edited or added this past week. Considering that, it is almost surprising that the three values are all within “one sigma” (in the square-root sense introduced above) of what I got a week ago.

The clear result: At least famous people (the sort that has triples in Wikidata) indeed die quite a lot more often on their birthdays than on other days. The death rate even is somewhat higher on the day before the birthday, but pretty normal on the day after (the base rate here would be 50000/365 or about 140)..

I suppose the short version of this is: „Hangovers don't kill you“.

Control: What about a Month? And Well-Defined Sampling?

At this point, I got curious how things would look like for dates obviously uncorrelated. How do I pick those? For simplicity, let us use a month earlier and later.

To do that, I need to further restrict the analysis to months between February and November for the same reason that made me discard all people born on days outside of the range 2…27. Discarding whole months is a lot more dangerous than kicking out days at the beginning and end of months from a sampling point of view, because people do behave differently in the winter than in the summer, and people born in winter are now severely underrepresented in the sample. But on the other hand, I can now look at a month earlier and later, and so I am willing to pay that price[3]:

FILTER (MONTH(?bdate)+/-1 = MONTH(?ddate)
  && DAY(?bdate) = DAY(?ddate)

Regrettably, with the additional condition, I get timeouts for 50000 records. With a sample of 30000 records, I get 81/122/78 for a month before/on birthday/a month later. That's again a strong signal, and again I think the LIMIT clause will not have introduced any biases relevant to my present question.

But I was still somewhat uneasy about that uncontrolled sampling. Can't I do better than that?

Turns out I can. In the meantime, I had actually skimmed the optimisation hints and found that my usage of YEAR(?bdate) is pretty much discouraged (which I had not expected from my SQL instincts, where I'd have created an index on exactly that expression). Instead, Wikidata recommends filtering against full datetime instances. For instance, to get all people born in the 1910s, one could write:

FILTER("1910-00-00"^^xsd:dateTime <= ?bdate &&
  ?bdate < "1920-00-00"^^xsd:dateTime)

The ^^xsd:dateTime is RDF's equivalent of C's casts, i.e., a forced type conversion, which certainly helps the triplestore to realise that it can apply its indexes. With this, I can tell Wikidata to look at all people born in the 1910s without a timeout:

SELECT (count(*) AS ?n)
WITH { SELECT ?person ?bdate ?ddate
    WHERE {
      ?person wdt:P569 ?bdate.
      hint:Prior hint:rangeSafe true.
      ?person wdt:P570 ?ddate.
      hint:Prior hint:rangeSafe true.

      FILTER (DAY(?bdate)>1 && DAY(?bdate)<27)
      FILTER (MONTH(?bdate)>1 && MONTH(?bdate)<12)
      FILTER("1910-00-00"^^xsd:dateTime <= ?bdate &&
         ?bdate < "1920-00-00"^^xsd:dateTime)
    }
  } AS %selection
WHERE {
  INCLUDE %selection

  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))

The results, generated with the query modifictions above: 320/315/370/308/311 for month before/day before/on birthday/day after/month after. The signal of a dangerous pre-birthday has vanished – statistically, all the numbers except the one for the birthday itself are pretty much the same (√(300) ≈ 17). This might make me a tad skeptical, because with my first query it would have seemed fairly reliable that the pre-birthday is dangerous. Hm.

On the other hand, perhaps the 1910ers are somewhat exceptional, because for people born in the 1920s, the slight preference for dying right before the birthday could be back; the counts for people in Wikidata born between 1920 and 1930 (some of those still have to watch out for their birthdays!) are 358/400/484/390/377.

Whatever the wisdom on the vicinity of the birthday: That the birthday is dangerous and that Mary Lea Heger's case is the rule more than the exception: That's fairly robust.

Appendix: Born in the Future

In part three I had wondered whether Wikidata lists persons born in the future and found some. But due to timeouts I could not find out what person had their birthday in the most remote future. With the tricks from this instalment, I can get around that and quickly reduce the result set to a level where it can be sorted long before the timeout strikes:

SELECT ?person ?bdate ?name
WHERE {
  ?person wdt:P569 ?bdate;
    rdfs:label ?name.
  hint:Prior hint:rangeSafe true.
  FILTER("2100-00-00"^^xsd:dateTime <= ?bdate)
  FILTER(lang(?name)="en")
}
ORDER BY DESC(?bdate)
LIMIT 5

The result (as produced by my wdq shell function from part two) looks like this:

person=http://www.wikidata.org/entity/Q113122082
bdate=171998-01-01T00:00:00Z
name=Ash “Kash” Straughn

person=http://www.wikidata.org/entity/Q113437436
bdate=9964-03-21T00:00:00Z
name=John Agyare

person=http://www.wikidata.org/entity/Q2715136
bdate=5355-01-01T00:00:00Z
name=Belgarion

person=http://www.wikidata.org/entity/Q3337177
bdate=4000-01-01T00:00:00Z
name=Nausicaä

person=http://www.wikidata.org/entity/Q64141306
bdate=4000-01-01T00:00:00Z
name=Jhil

So, yes, my original recipe of taking characters 6-10 to get the month and day would have failed for one person from the future. Although… I can't say I've ever heard of one of these persons (well, Nausicaa I have heard of, but not one born the year 4000).

Simply opening the links in a browser shows that Mrs. Straughn actually is not a person from the future at all and was really born 1988-08-17 (I believe; at any rate her birthday in the far future is just a parse error). John Agyare is a Ghanaean politician probably born in 1964. Belgarion's birthday, finally, apparently is not a typo. That guy is a king in some Fantasy trilogy (of the kind that has five parts) by David Eddings. Full disclosure: I don't even Eddings. Should I?

Incidentally, I have fixed Straughn's and Agyare's birthdays; I guess looking at extreme values in Wikidata and fixing obvious errors could become a rewarding hobby. I am particularly tempted to remove all the spurious birth dates of 2000-01-01. But then it seems these (all 42590 of them) do have a special meaning. If you run:

SELECT ?s
WHERE {
  ?s wdt:P569 '2000-01-01'^^xsd:dateTime.
}
Limit 30

and follow a few of these links, you will see that the web frontend makes “20th centruy“ from these dates. One last thing learned here: these “even” dates are magic values. Which of course makes me wonder what happens to people who were actually born on January 1st, 2000.

But that's for another day. I declare this little investigation closed. Thanks for staying with me until the end.

[1]

That is perhaps not entirely true, as a Wikidata wizard could perhaps exploit the fact that individual days are entities, too; for instance, May 5th is wd:Q2550. Such entities have pointers to days preceding and following them, and I could imagine that there is a way from persons to such entities via, say, categories. Perhaps there is a way to write a query like ?person property-birthday ?day; property-deathday ?day (where one would again have to filter out stillborn babies) I'd expect such a thing would be blazingly fast. Or, to find persons who died on the day after their birthday, something like ?person property-birthday ?day. ?day property-followed-by ?next-day. ?person property-deathday ?next-day.

But if these properties exist, they are not easy to find. Naively looking for subjects and predicates for May 5th does not go anywhere:

SELECT ?name ?p
WHERE {
  ?s ?p wd:Q2250;
    rdfs:label ?name
  FILTER(lang(?name)="en")
}

I do not understand a single row of what is coming back there right now:

p=http://www.wikidata.org/prop/direct/P921
name=Telemarkskanalen – minute by minute

p=http://www.wikidata.org/prop/direct/P301
name=Category:Telemark Canal

p=http://www.wikidata.org/prop/direct/P361
name=Bandak–Norsjø Canal

p=http://www.wikidata.org/prop/direct/P361
name=Norsjø–Skien Canal

p=http://www.wikidata.org/prop/direct/P971
name=Category:Canal locks of Telemarkskanalen

p=http://www.wikidata.org/prop/direct/P971
name=Category:Ships of Telemarkskanalen

p=http://www.wikidata.org/prop/direct/P1830
name=Telemarkskanalen FKF

Um… pardon me? I think I better call it a day.

[2]	Cf. the Zen of Python, `python -c "import this" \| grep nested`

[3] An alternative for the months would be to compare “modulo 12”, as in (Python syntax) (month-1)%12+1 – try it. But I was too lazy to work out how that's written in SPARQL, concerned that these extra computations further eat into my CPU budget, and confident that people born in December and January aren't that different anyway, so I didn't do that.

Zitiert in: SPAQRL 3: Who Died on their Birthdays?