Talk:Wikibase/Indexing/Data Model

Criticism (by Daniel)
I see several issues with the heuristics for "best" values described here.
 * 1) It doesn't match the spec. The wikibase data model defines the semantics of ranks such that for a query, the only "preferred" claims for a given properties should be considered, if there are any. If there are no "preferred" claims, only the "normal" claims shall be considered. The claims that are thus defined to be relevant to queries according to this are referred to as the "best" claims for that property.
 * 2) It leads to surprises. The graph database is intended to be used for queries, not searches. Queries have a well defined result set, which should be clearly predictable to the author of the query. Predictability is important; A search index may used heuristics to follow the actual content. A query index should have clearly defined behavior, and allow content to me modeled accordingly.
 * 3) Applying such heuristics takes away one of the main incentives to actually rank statements manually (resp by bot). Explicit ranking is extremely valuable, and useful for using values in infoboxes etc. One reason we don't see many "preferred" ranks on Wikidata is that they don't have much effect yet. Once people see how ranking effects query results, this will hopefully be used a lot more. The heuristics suggested here would obscure this effect.
 * 4) The heuristics may have averse "political" consequences. When designing the wikibase model, we took great care to allow for competing views and contradictions. Having e.g. census data ignored because it's a year older than information from another entity may lead to confusion and even animosity (yes, people get into fights about the population of China, or Israel, or India, because it very much depends on which regions you include as territory - this is highly political stuff).
 * 5) One of the wiki principles is: avoid magic, let the community edit content. This means here: leave it to the community if, when, and where they want to apply heuristics like "the newest value is the best". They can write a bot that changes the rank accordingly, with a record in the history, discussions on the wiki, etc.

Stas's response

 * 1)  I think we still comply with the semantics since if "preferred" is present we will just consider that value(s), and only that. If it is not present, I think we should not ignore the fact that right now we have no way to know the US population, at least by query, or have no good way to know 10 most populous countries without scanning through every population figure of every country that exists in the database.
 * 2) I'm not sure how having "best" value makes it unpredictable. It's just a form of materialized view, or an index if you will, just a bit smarter one that DB can provide natively, since the DB does not know our data but we do. Where predictability issue comes from? You still have exactly the same data and get exactly the same result as if you wrote the "10 most populous" query yourself by manually sorting population data by qualifiers for each country. No difference in data will ever happen (and, of course, you can still do the manual query by completely ignoring the best values and going to raw ones). I just propose to write part of this query for you and materialize the result, knowing the user will have to do it anyway.
 * 3) Here I see your point, but I don't think having the engine help you would prevent people from improving the data. I think, on the contrary, that having engine that is actually useful and easy to use would make more people use it and as such be driven to improve the data feeding it.
 * 4) This is easily fixed by setting one of the values as preferred. The additional heuristic only kicks in if there is no human decision, so if any humans disagree, they can always override it. Even have multiple preferred values, if desired. In any case, that'd be better than having US population simultaneously being 50 mln, 150 mln and 300 mln - I don't see any context in which that would be of any practical use.
 * 5) Well, I'm new here but I'm not sure why having an index helping to optimize for common case would be contrary to wiki values. Of course, if we expect the preferred issue to be fixed by the community before the system would go to any production use then the whole issue is irrelevant and we don't need any heuristics - we can just consider the preferred values. But if we expect it to be useful on the data that is not cleaned up yet I think it still can be useful.

In any case, the "best value" part is not integral to the rest of the model, so I'll work on the rest and we'll see what we do with it and if we need it at all after we have the rest of it. --Smalyshev (WMF) (talk) 20:46, 5 December 2014 (UTC)

Multiple Qualifiers

 * Note that this assumes each qualifier will be present only once. Wikibase allows multiple qualifiers with the same property. We need a different solution, but since it is preferable to have these data indexable, we should not be using complex structures here.
 * Could you provide an example with the same qualifier used more than once on the same claim? I'd like to see the semantics of it to figure out different solution. There are a number of options that could still leave it indexable but I'm not sure how that works so I'd need some examples. --Smalyshev (WMF) (talk) 21:13, 5 December 2014 (UTC)
 * Found such case: https://www.wikidata.org/wiki/Q801 - "head of state" has multiple start/ends for some. Not sure yet how to handle it as starts and ends should match and that means they need to be kept in ordered structure. Titan has multivalues, but looks like only on vertices, not on edges. WiIl check further. --Smalyshev (WMF) (talk) 07:45, 6 December 2014 (UTC)

Qualifiers as properties or edges

 * Qualifiers can reference other items. This should be modeled as an edge, but it's not possible to attach an edge to an edge. To allow this, qualifiers would need to be nodes in their own right.
 * The query engine allows to go from string to vertex named after string very easily with transform clause, as far as I can see, so not having edges on edges won't be a problem for querying. Modeling qualifier as a vertex though might be a problem since qualifier is attached to claim, which is now an edge. Given the multiple qualifiers issue above, we may have to convert claims to vertices too. If claims are vertices, qualifiers can be edges or vertices too. Not sure what is best since this will produce a lot more edges and may impact performance. --Smalyshev (WMF) (talk) 21:13, 5 December 2014 (UTC)
 * Actually, Titan can attach special kind of edge to an edge - http://s3.thinkaurelius.com/docs/titan/0.5.2/advanced-schema.html so maybe qualifiers can be made work this way. This is a Titan-specific feature so needs to be checked how it influences querying in Gremlin. --Smalyshev (WMF) (talk) 21:28, 7 December 2014 (UTC)

Importing deprecated data

 * Although deprecated statements will probably not be queried that often, we should try to import and index all data.
 * We can import deprecated data, but if we want to avoid putting "and exclude deprecated data" condition on every clause of every query, we probably should store them somewhere separate - like with edges marked 'P31_deprecated' or something like that maybe, so they won't be part of regular queries. Would that work? We can have DSL clauses that would say "include deprecated" in the query language, but I think we don't want to make users to explicitly exclude deprecated clauses in default queries. Which is why I think if we want to keep deprecated data, we should separate it from regular ones. --Smalyshev (WMF) (talk) 21:19, 5 December 2014 (UTC)