Talk:Wikibase/Indexing

Gremlin query examples
--Smalyshev (WMF) (talk) 22:57, 29 November 2014 (UTC)
 * Top 10 countries by population:
 * People born in a city with more than 100k inhabitants:
 * Largest 10 cities in Europe that have a female mayor:

Countries by population
g.listOf('Q6256').as('c').groupBy{it}{it.claimValues('P1082').preferred.latest}.cap .scatter.filter{it.value.size>0}.transform{it.value = it.value.P1082value.collect{it?it as int:0}.max; it} .order{it.b.value <=> it.a.value}.transform{[it.key.wikibaseId, it.key.labelEn, it.value]}

List of occupations
tree[28640][][31,279] g.wd('Q28640').treeIn('P279').instances.dedup.namesList

List of potential nationalities
Warning: big query, do not run unbounded. tree[350][][17,131] AND (claim[31:6256] OR claim[31:15634554]) claim[31:5] AND claim[19] AND between[569,1750]

g.listOf('Q5').as('humans').claimValues('P569').filter{it.P569value != 'somevalue' && it.P569value > Date.parse('yyyy', '1750')} .back('humans').claimVertices('P19').toCountry.as('countries').select(['humans', 'countries']){it.labelEn}{it.labelEn}

People born before 1880 having no date of death
Warning: big query, do not run unbounded. claim[31:5] AND noclaim[570] AND between[569,0,1880] g.listOf('Q5').as('humans').claimValues('P569').filter{it.P569value && it.P569value < DateUtils.fromYear(1880)} .back('humans').filter{!it.out('P570').hasNext}[0..10]

instance of human; occupation writer; not occupation author
claim[31:5] AND claim[106:36180] AND noclaim[106:482980]

g.wd('Q36180').in('P106').has('P31link', CONTAINS, 'Q5').filter{!it.out('P106').has('wikibaseId', 'Q482980').hasNext}[0..100]

Places in the U.S. that are named after Francis of Assisi
(TREE[30][150][17,131] AND CLAIM[138:676555])

g.wd('Q676555').in('P138').filter{it.toCountry.has('wikibaseId', 'Q30').hasNext}.namesList

All items in the taxonomy of the Komodo dragon
TREE[4504][171,273,75,76,77,70,71,74,89]

g.wd('Q4504').as('loop').out('P171').loop('loop'){true}{true}.dedup.namesList

All animals on Wikidata
TREE[729][][171,273,75,76,77,70,71,74,89] g.wd('Q729').as('loop').in('P171').loop('loop'){it.object.in('P171').hasNext}{true}.dedup.namesList

Bridges in Germany
(CLAIM[31:(TREE[12280][][279])] AND TREE[183][150][17,131])

g.wd('Q12280').treeIn('P279').in('P31').as('b').toCountry.has('wikibaseId', 'Q183').back('b').namesList

Bridges across the Danube
(CLAIM[31:(TREE[12280][][279])] AND CLAIM[177:1653])

g.wd('Q12280').treeIn('P279').in('P31').as('b').out('P177').has('wikibaseId', 'Q1653').back('b').namesList

Items with VIAF string "64192849"
STRING[214:'64192849']

g.E.has('P214value', '64192849').outV.namesList

People who were born 1924-1925, and died 2012-2013
(BETWEEN[569,+00000001924-00-00T00:00:00Z,+00000001926-00-00T00:00:00Z] AND BETWEEN[570,+00000002012-00-00T00:00:00Z,+00000002014-00-00T00:00:00Z])"}    g.listOf('Q5').outE('P569').interval('P569value', DateUtils.fromYear(1924), DateUtils.fromYear(1926))       .outV.outE('P570').interval('P570value', DateUtils.fromYear(2012),  DateUtils.fromYear(2013)).outV

Items 15km around the center of Cambridge, UK
AROUND[625,52.205,0.119,15] g.E.has('P625value', WITHIN, Geoshape.circle(52.205,0.119,15)).outV.namesList

WHo was born in 427 BCE
g.E.has('property','P569').has('P569value', DateUtils.fromYear(-427)).outV.namesList

Reconciliation from OpenRefine
Hey guys. I might be on a different page from what you have in mind for this feature, but it would be great if the service would be able to act as a reconciliation service for OpenRefine.

For those who haven't worked with OpenRefine, it's a web tool for data cleaning - the way I see it is a spreadsheet application with all the right buttons and features for data analysis. Reconciliation is a semi-automated process of matching text names to database IDs (keys) and it currently works out of the box with Freebase. This is most of the times enough for English-language data, as Freebase extracts some of its data from Wikipedia, but I found that it doesn't work so well with other languages. As Wikidata is much more multilingual and (hopefully) much more dynamic than Freebase, it would really help a lot if OpenRefine users could connect directly to Wikidata.

Some implementation notes:
 * most of what reconciliation does can do can be done by calling the Wikidata API and parsing the return value with some scripts. Of course, this is not as straightforward for non-programmers.
 * these is an OpenRefine extension that allows reconciliation against SPARQL endpoints and rdf dumps, so this might a quick way to have this functionality.

Looking forward to your input on this request.--Strainu (talk) 20:48, 12 December 2014 (UTC)


 * It would be interesting to know what kind of requests such tool would need from Wikidata API. --Smalyshev (WMF) (talk) 07:15, 17 December 2014 (UTC)
 * I'm not sure I understand the question: are you refering to the API between OpenRefine and Wikidata or to the information that one could extract from Wikidata. Could you please elaborate?--Strainu (talk) 22:53, 17 December 2014 (UTC)
 * More what kind of queries OpenRefine needs to run - i.e. what types of queries, how big the result sets would be, etc. --Smalyshev (WMF) (talk) 05:24, 23 December 2014 (UTC)


 * Well, it would work in two steps : first, it would do a search by name (possibly filtered by language) for the all the entries in the list. this can be arbitrarily large (up to several tens of thousands of elements) but I suppose it does support paging.  Next, individual requests for different properties will be made for each element. In worse case scenarios, it would request all the properties of the wikidata entry, but more often it would be specific to 1 or 2 properties. These request can also be grouped in batches - - Strainu (talk) 10:59, 4 January 2015 (UTC)


 * Search by name may be a bit tricky, since names are not unique, and it may already be covered by existing ElasticSearch. In general, this looks like something that is mostly already covered - you can already search by name and extract the data via JSON exports, isn't that the case? Maybe I'm missing something and other wikidata people could chime in here, or maybe even better on the IRC or mailing list. --Smalyshev (WMF) (talk) 22:09, 10 January 2015 (UTC)

Production use
While there may be some minor points re WDQ, it is used. It is used a lot, it is even vital for the Wikidata users. All the others are at this stage pie in the sky. This is not a tabula rasa situation.

There has also been a previous development for Wikidata query. Where can we read about that.. It was shelved for reasons that were not made public. It would have been standards compliant and it would have been in use for many months now. Thanks, GerardM (talk) 08:09, 26 December 2014 (UTC)

Tools
I would like to know if it will be possible to start a Gremlin query from tools developed by users and use the result for some automatic editing. --Molarus (talk) 23:32, 11 January 2015 (UTC)
 * At some point we'll have a user facing query language but it likely will have restrictions that Gremlin doesn't have. We'll install this in labs and we'll release a relatively easy to install locally version that you can use for tooling.  NEverett (WMF) (talk) 23:11, 28 January 2015 (UTC)

SQL-based service
Hello, one more suggestion: use read-only SQL DB as query engine. E. g. query "STRING[214:'64192849']" will look like:
 * "SELECT EntityID FROM Claim WHERE Property = 214 AND StringValue = '64192849'"

Points:
 * Robust, scalable well-known engine (for example MySQL).
 * Many users know SQL language already.
 * SQL has very rich query syntax.
 * Very simple service implementation (redirect request to DB and translate result to JSON).

Negative:
 * Query is more complex than can be.

This service can be used as Phase 1 or Low Level Service until better solution appears. Ivan A. Krestinin (talk) 12:03, 19 January 2015 (UTC)

WikiGrok Needs

 * More detailed explanation on how "Item eligibility" and "Potential claims generation" supposed to work with regard to Query Engine
 * Top 5/10 list of high priority queries, including
 * Query description
 * Output expectance (yes/no, list - where, in which format)
 * Interactivity - online/offline, up-to-date reqs for lists - i.e. how frequently the list should be updated
 * For lists - do we need exhaustive list or some sample?
 * More queries with lower priority, along the same lines