SPAQRL 3: Who Died on their Birthdays?

Many Yaks grazing on a mountain meadow

A lot of Yak Shaving left to do here.

Now that I have learned how to figure out dates of birth and death in Wikidata and have made myself sensible tools to submit queries, I can finally start to figure out how I can let Wikidata pick out people dying on the same day of the year they were born on, just like Mary Lea Heger.

I first fetched people with their birthdays and dates of death:

SELECT ?person ?bday ?dday
WHERE {
  ?person wdt:P569 ?bday.
  ?person wdt:P570 ?dday.
}
LIMIT 2

Consider that query for a while and appreciate that by giving two triple patterns using the same variable, ?person, I am performing what in SQL databases would be a join. That that's such a natural thing in SPARQL is, I'd say, a clear strong point for the language.

Here is what I got back when I ran this this through my wpd shell function from part two:

person=http://www.wikidata.org/entity/Q18722
dday=-1871-06-29T00:00:00Z
bday=-2000-01-01T00:00:00Z

person=http://www.wikidata.org/entity/Q18734
dday=-1884-01-01T00:00:00Z
bday=-2000-01-01T00:00:00Z

This seems to work, except the dates look a bit odd. Did these people really die more than a hundred years before they were born? Ah, wait: these are negative dates. Opening the person URIs as per part one in a browser, I one learns that Q18722 is pharaoh Senusret II, and at least his birthday clearly is… not very certain. If these kinds of estimates are common, I probably should exclude January 1st from my considerations.

Getting Labels

But the first thing I wanted at that point was to not have to click on the URIs to see names. I knew enough about RDF to simply try and get labels according to RDF schema:

SELECT ?personName ?bday ?dday
WHERE {
  ?person rdfs:label ?personName.
  ?person wdt:P569 ?bday.
  ?person wdt:P570 ?dday.
}
LIMIT 10

That's another SQL join, by the way. Nice. Except what comes back is this:

dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=Хуан-ди

dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=Huang Di

dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=ኋንግ ዲ

dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=هوان جي دي

dday=-2596-01-01T00:00:00Z
bday=-2710-01-01T00:00:00Z
personName=Emperador mariellu

If you select the URI in ?person in addition to just the name, you'll see that we now got many records per person. Actually, one per label, as is to be expected in a proper join, and since there are so many languages and scripts out there, many persons in Wikidata have many labels.

At this point I consulted Bob DuCharme's Learning SPARQL and figured a filter on the language of the label is what I wanted. This does not call for a further join (i.e., triple pattern), as the language is something like a property of the object literal (i.e., the string) itself. To retrieve it, there is a function determining the language, used with a FILTER clause like this:

SELECT ?personName ?bday ?dday
WHERE {
  ?person rdfs:label ?personName.
  ?person wdt:P569 ?bday.
  ?person wdt:P570 ?dday.

  FILTER (lang(?personName)="en")
}
LIMIT 10

FILTER is a generic SPARQL thing that is more like a SQL WHERE clause than SPARQL's WHERE clause itself. We will be using it a lot more below.

There is a nice alternative to this kind of joining and filtering I would have found in the wikidata user manual had I read it in time. You see, SPARQL also defines a service interface, and that then allows queriers to mix and match SPARQL-speaking services within a query. Wikidata has many uses for that, and one is a service that can automatically add labels with minimal effort. You just declare that you want that service to filter your query, and then you write ?varLabel to get labels instead of URIs for ?var, as in:

# don't run this.  It'll put load on Wikidata and then time out
SELECT ?personLabel ?bday ?dday
WHERE {
  ?person wdt:P569 ?bday.
  ?person wdt:P570 ?dday.

  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}
LIMIT 10

The trouble with that: This would first pull out all result triples (a couple of million) out of wikidata and then hand over these triples to the wikibase:label service, which would then add the labels and hand back all the labelled records. Only then will the engine discard the all result rows but the first 10. That obviously is terribly inefficient, and Wikidata will kill this query after a minute.

So: Be careful with SERVICE when intermediate result sets could be large. I later had to use subqueries anyway; I could have used them here, too, to work around the millions-of-triples problem, but at that point I hadn't progressed to these subqueries, let alone means to defeat the planner (that's part four material).

Determining Day and Month

Turtle (about my preference for which you could already read in part two) has a nifty abbreviation where you can put a semicolon after a triple and then leave out the subject in the next triple. SPARQL will put the previous subject into this next triple. That works in SPARQL, too, so I can write my query above with fewer keystrokes:

SELECT ?personName ?bday ?dday
WHERE {
  ?person rdfs:label ?personName;
    wdt:P569 ?bday;
    wdt:P570 ?dday.

  FILTER (lang(?personName)="en")
}
LIMIT 10

Now I need to figure out the birthday, as in: date within a year. In Bob DuCharme's book I found a SUBSTR function, and BIND clauses that let you compute values and bind them to variables. Treating the dates as ISO strings (“YYYY-MM-DD“) should let me pull out month and date starting at index 6 (gna: SPARQL starts with index 1 rather than with 0 as all deities in known history wanted), and then five characters, no? Let's see:

SELECT ?personName ?bdate ?ddate
 WHERE {
   ?person rdfs:label ?personName;
     wdt:P569 ?bday;
     wdt:P570 ?dday.

   BIND (SUBSTR(?bday, 6, 5) as ?bdate)
   BIND (SUBSTR(?dday, 6, 5) as ?ddate)

   FILTER (lang(?personName)="en")
 }
 LIMIT 3

This gives:

personName=Sobekhotep I
bdate=-01-0
ddate=-01-0

personName=Amenemhat I
ddate=-02-1

personName=Senusret II
bdate=-01-0
ddate=-06-2

Well, that is a failure. And that's because my assumptions on string indices are wrong in general, that is: for people born before the Christian era, and then again for people born after 9999-12-31. Which makes we wonder: Does Wikidata have people born in the future? Well, see below.

I'll cheat a bit and spare you a few of the dead alleys I actually followed trying to fix this, because they are not very educational. Also, when I strugged with the timeouts I will discuss in a moment I learned about the right way to do this on Wikidata's optimisation page: When something is a date, you can apply the functions DAY, MONTH, and YEAR on it, and that will plausibly even use indexes and thus be a lot faster.

Thinking about YEAR and having seen the fantasy dates for the ancient Egyptian pharaohs, I also decided to exclude everyone before 1850; that ought to suffice for letting me forget about Gregorian vs. Julian dates, and the likelihood that the dates are more or less right ought to be a lot higher in those days than in the 14th century, say.

With that, I can write the “birth day equals death day“ in a filter without further qualms. The resulting query is starting to look imposing:

SELECT ?person ?personName ?bdate ?ddate
WHERE {
  ?person rdfs:label ?personName.
  ?person rdfs:label  wdt:P569 ?bdate.
  hint:Prior hint:rangeSafe true.
  ?person rdfs:label wdt:P570 ?ddate.
  hint:Prior hint:rangeSafe true.

  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate))
  FILTER (YEAR(?bdate)>1850)

  FILTER (lang(?personName)="en")
}
LIMIT 10

The odd triples with hint:Prior are hints for Wikidata's triple store that, or so I understand Wikidata's documentation, encourages it to use indexes on the properties mentioned in the previous lines; the query certainly works without those, and I frankly have not bothered to see if they actually do anything at all for the present queries. There. Accuse me of practising Cargo Cult if you want.

Anyway, the result is looking as awful as I had expected given my first impressions with January 1st, and clearly, ensuring birthdays after 1850 is not enough to filter out junk:

person=http://www.wikidata.org/entity/Q112689783
ddate=0080-01-01T00:00:00Z
bdate=1920-01-01T00:00:00Z
personName=Enrico Mezzasalma

person=http://www.wikidata.org/entity/Q19976926
ddate=1342-01-01T00:00:00Z
bdate=2000-01-01T00:00:00Z
personName=Peter Jonsson

person=http://www.wikidata.org/entity/Q19291026
ddate=1400-01-01T00:00:00Z
bdate=2000-01-01T00:00:00Z
personName=Galceran Marquet
...

It seems Wikidata even uses 2000-01-01 as some sort of NULL value. Dang. Let's filter out January 1st, then:

SELECT ?person ?personName ?bdate ?ddate
WHERE {
  ?person rdfs:label ?personName;
    wdt:P569 ?bdate;
    wdt:P570 ?ddate.

  FILTER (MONTH(?bdate) > 1 || DAY(?bdate) >1)
  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate))
  FILTER (YEAR(?bdate)>1850)

  FILTER (lang(?personName)="en")
}
LIMIT 10

This yields:

person=http://www.wikidata.org/entity/Q75246593
ddate=1851-02-03T00:00:00Z
bdate=1851-02-03T00:00:00Z
personName=Maria de Gloria de Bragança e Saxe-Coburgo-Gotha, Infanta de Portugal

person=http://www.wikidata.org/entity/Q76312566
ddate=1851-04-05T00:00:00Z
bdate=1851-04-05T00:00:00Z
personName=Constance Mary Lousada

person=http://www.wikidata.org/entity/Q75654920
ddate=1852-05-03T00:00:00Z
bdate=1852-05-03T00:00:00Z
personName=unknown son Tottenham

Nice. This works. Except… Wait a second: These people died on the exact date of their birth. It seems if you are a Saxecoburgotsky you can still get a Wikidata entry even if they are a stillborn child. Given that birth is about the most dangerous thing many of us do in their entire lives (in particular in the Global South, were of course giving birth usually is still more dangerous), having stillborn children in this calculation would severely taint it. So, let's filter out those, too:

SELECT ?person ?personName ?bdate ?ddate
WHERE {
  ?person rdfs:label ?personName;
    wdt:P569 ?bdate;
    wdt:P570 ?ddate.

  FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))
  FILTER (YEAR(?bdate)>1850)

  FILTER (lang(?personName)="en")
}
LIMIT 10

This returns:

person=http://www.wikidata.org/entity/Q75309787
ddate=1855-06-15T00:00:00Z
bdate=1854-06-15T00:00:00Z
personName=Henry William Spencer Dashwood

person=http://www.wikidata.org/entity/Q75984563
ddate=1859-09-01T00:00:00Z
bdate=1857-09-01T00:00:00Z
personName=William Hall Greaves

person=http://www.wikidata.org/entity/Q76020909
ddate=1865-06-01T00:00:00Z
bdate=1860-06-01T00:00:00Z
personName=Lilian Josephine Asquith

These still are persons that died when they were rather young, but I'd argue that biologically, the first birthday is not really very special any more, and hence having many people die then already counts as irregular.

So… it seems I'm done. Let's count how many of these there are:

# Don't run this yourself
SELECT (count(?person) AS ?n)
WHERE {
  ?person rdfs:label ?personName;
    wdt:P569 ?bdate;
    wdt:P570 ?ddate.

  FILTER (MONTH(?bdate)>1 || DAY(?bdate)>1)
  FILTER (MONTH(?bdate) = MONTH(?ddate)
    && DAY(?bdate) = DAY(?ddate)
    && YEAR(?bdate) != YEAR(?ddate))
  FILTER (YEAR(?bdate)>1850)

  FILTER (lang(?personName)="en")
}

This, regrettably, times out. Well, there are 5.8 million triples the machinery has to look at for the birthday part – I know that because I ran:

SELECT (count(*) as ?n)
WHERE {
  ?person wdt:P569 ?o
}

which is fast because there is an index on the predicates –, and the result of my query cannot be served from indexes alone, not to mention the tuples will be all over the disk (assuming the triple store isn't in RAM alone), so even even pulling in all these blocks will possibly take more than the minute Wikidata gives us.

Bummer. In the next instalment of this little series, I will talk about ways to approach my problem nevertheless.

Appendix: People born in the future

Since I only asked myself whether Wikidata knows about people of the future (trekkies, flock to Wikidata!) while writing this post, let me cover this outside of the main narrative. I've tried:

SELECT ?name ?bdate
WHERE {
        ?thing wdt:P570 ?bdate.
  hint:Prior hint:rangeSafe true.
              ?thing rdfs:label ?name.
        FILTER(lang(?name)='en')
        FILTER(YEAR(?bdate)>2023)
}
LIMIT 10

That, indeed, returns science fiction characters:

bdate=2263-01-01T00:00:00Z
name=Spock

bdate=2100-01-01T00:00:00Z
name=Rick Deckard

bdate=2064-01-01T00:00:00Z
name=Susan Calvin
...

But when I try to find the person born in the farthest future, like this:

# Don't run this, it'll time out
SELECT ?name ?bdate
WHERE {
        ?thing wdt:P570 ?bdate.
  hint:Prior hint:rangeSafe true.
              ?thing rdfs:label ?name.
        FILTER(lang(?name)='en')
        FILTER(YEAR(?bdate)>2023)
}
ORDER BY DESC(?bdate)
LIMIT 10

I get timeouts again. I think I'll revisit this in the next instalment, too.

[Read on]

Zitiert in: SPARQL 4: Be Extra Careful on your Birthday SPARQL 2: Improvising a client

Kategorie: edv

Letzte Ergänzungen