Talk:Wikibase/Indexing/Data Model

Criticism (by Daniel)
I see several issues with the heuristics for "best" values described here.
 * 1) It doesn't match the spec. The wikibase data model defines the semantics of ranks such that for a query, the only "preferred" claims for a given properties should be considered, if there are any. If there are no "preferred" claims, only the "normal" claims shall be considered. The claims that are thus defined to be relevant to queries according to this are referred to as the "best" claims for that property.
 * 2) It leads to surprises. The graph database is intended to be used for queries, not searches. Queries have a well defined result set, which should be clearly predictable to the author of the query. Predictability is important; A search index may used heuristics to follow the actual content. A query index should have clearly defined behavior, and allow content to me modeled accordingly.
 * 3) Applying such heuristics takes away one of the main incentives to actually rank statements manually (resp by bot). Explicit ranking is extremely valuable, and useful for using values in infoboxes etc. One reason we don't see many "preferred" ranks on Wikidata is that they don't have much effect yet. Once people see how ranking effects query results, this will hopefully be used a lot more. The heuristics suggested here would obscure this effect.
 * 4) The heuristics may have averse "political" consequences. When designing the wikibase model, we took great care to allow for competing views and contradictions. Having e.g. census data ignored because it's a year older than information from another entity may lead to confusion and even animosity (yes, people get into fights about the population of China, or Israel, or India, because it very much depends on which regions you include as territory - this is highly political stuff).
 * 5) One of the wiki principles is: avoid magic, let the community edit content. This means here: leave it to the community if, when, and where they want to apply heuristics like "the newest value is the best". They can write a bot that changes the rank accordingly, with a record in the history, discussions on the wiki, etc.

Stas's response

 * 1)  I think we still comply with the semantics since if "preferred" is present we will just consider that value(s), and only that. If it is not present, I think we should not ignore the fact that right now we have no way to know the US population, at least by query, or have no good way to know 10 most populous countries without scanning through every population figure of every country that exists in the database.
 * 2) I'm not sure how having "best" value makes it unpredictable. It's just a form of materialized view, or an index if you will, just a bit smarter one that DB can provide natively, since the DB does not know our data but we do. Where predictability issue comes from? You still have exactly the same data and get exactly the same result as if you wrote the "10 most populous" query yourself by manually sorting population data by qualifiers for each country. No difference in data will ever happen (and, of course, you can still do the manual query by completely ignoring the best values and going to raw ones). I just propose to write part of this query for you and materialize the result, knowing the user will have to do it anyway.
 * 3) Here I see your point, but I don't think having the engine help you would prevent people from improving the data. I think, on the contrary, that having engine that is actually useful and easy to use would make more people use it and as such be driven to improve the data feeding it.
 * 4) This is easily fixed by setting one of the values as preferred. The additional heuristic only kicks in if there is no human decision, so if any humans disagree, they can always override it. Even have multiple preferred values, if desired. In any case, that'd be better than having US population simultaneously being 50 mln, 150 mln and 300 mln - I don't see any context in which that would be of any practical use.
 * 5) Well, I'm new here but I'm not sure why having an index helping to optimize for common case would be contrary to wiki values. Of course, if we expect the preferred issue to be fixed by the community before the system would go to any production use then the whole issue is irrelevant and we don't need any heuristics - we can just consider the preferred values. But if we expect it to be useful on the data that is not cleaned up yet I think it still can be useful.

In any case, the "best value" part is not integral to the rest of the model, so I'll work on the rest and we'll see what we do with it and if we need it at all after we have the rest of it. --Smalyshev (WMF) (talk) 20:46, 5 December 2014 (UTC)