Extension talk:WikiGrok/Claim suggestions

Discuss stuff here :)

tl;dr
At worst, we can semi-reliably extract a year (of birth/death for people, of formation for bands, and of release for albums) from categories, infoboxes, persondata, and lead sentences, in order of increasing complexity. At best, we can extract the full date from infoboxes, persondata, and lead sentences.

Infoboxes and persondata
Both infoboxes and persondata use templates that, typically, have dates confirming to MOS:YEAR. Extracting dates from infoboxes and persondata then involves some "parsing" of the article's source, probably with a regular expression or two.

Categories
Birth and death categories are a good source of birth/death year suggestions. There are well formed categories for births and deaths in most years from 17th century BC, e.g. Category:1646_BC_deaths to 21st century, e.g. Category:2014_births.

Lead sentences
There isn't any requirement that birth and death dates in lead sentences follow a format or are in a specific position. Biographical articles tend to have the dates following the subject's name, possibly separated by an em dash. Like infoboxes and persondata, extracting dates from lead sentences will require a handful of regular expressions that match a variety of date formats and positions. This shouldn't be a huge performance hit because we're only ever trying to match one sentence.

Generating a corpus of suggestions
There are three approaches to ensuring that WikiGrok has a good corpus of suggestions, which can all share the same extraction code:


 * 1) Bulk extract suggestions from articles for eligible items
 * 2) Extract suggestions when the article is saved by listening to the PageContentSaveComplete or CategoryAfterPageAdded hooks
 * 3) Extract suggestions when suggestions are requested during a WikiGrok game

2 and 3 are complementary. If we only used approach 2, then it'd take a non-trivial amount of time to generate a good corpus of suggestions.

Cheap suggestions
We currently have 1 "cheap claim suggestion" (that only requires looking at the existing Wikidata claims for that item): Extension:MobileFrontend/WikiGrok/Claim suggestions. Here are some other ideas:

Country of origin is US, but original language not set as English: 14549 pages http://wdq.wmflabs.org/api?q=claim%5B495:30%5D%20AND%20noclaim%5B364:1860%5D

Country of origin is France, but original language is not set to French: 13407 pages http://wdq.wmflabs.org/api?q=claim%5B495:142%5D%20AND%20noclaim%5B364:150%5D

Country of origin is UK, but original language not set as English: 4998 pages http://wdq.wmflabs.org/api?q=claim%5B495:145%5D%20AND%20noclaim%5B364:1860%5D

Country of origin is Germany, but original language is not set to German: 854 pages http://wdq.wmflabs.org/api?q=claim%5B495:183%5D%20AND%20noclaim%5B364:188%5D

Politician (P106:Q82955), country of citizenship US (P27:Q30), no member of political party set (P102) [We can ask Democrat or Republican randomly]: ~5800 pages