Talk:Wikibase/Indexing/Data Model

Criticism (by Daniel)
I see several issues with the heuristics for "best" values described here.
 * 1) It doesn't match the spec. The wikibase data model defines the semantics of ranks such that for a query, the only "preferred" claims for a given properties should be considered, if there are any. If there are no "preferred" claims, only the "normal" claims shall be considered. The claims that are thus defined to be relevant to queries according to this are referred to as the "best" claims for that property.
 * 2) It leads to surprises. The graph database is intended to be used for queries, not searches. Queries have a well defined result set, which should be clearly predictable to the author of the query. Predictability is important; A search index may used heuristics to follow the actual content. A query index should have clearly defined behavior, and allow content to me modeled accordingly.
 * 3) Applying such heuristics takes away one of the main incentives to actually rank statements manually (resp by bot). Explicit ranking is extremely valuable, and useful for using values in infoboxes etc. One reason we don't see many "preferred" ranks on Wikidata is that they don't have much effect yet. Once people see how ranking effects query results, this will hopefully be used a lot more. The heuristics suggested here would obscure this effect.
 * 4) The heuristics may have averse "political" consequences. When designing the wikibase model, we took great care to allow for competing views and contradictions. Having e.g. census data ignored because it's a year older than information from another entity may lead to confusion and even animosity (yes, people get into fights about the population of China, or Israel, or India, because it very much depends on which regions you include as territory - this is highly political stuff).
 * 5) One of the wiki principles is: avoid magic, let the community edit content. This means here: leave it to the community if, when, and where they want to apply heuristics like "the newest value is the best". They can write a bot that changes the rank accordingly, with a record in the history, discussions on the wiki, etc.

Stas's response

 * 1)  I think we still comply with the semantics since if "preferred" is present we will just consider that value(s), and only that. If it is not present, I think we should not ignore the fact that right now we have no way to know the US population, at least by query, or have no good way to know 10 most populous countries without scanning through every population figure of every country that exists in the database.
 * 2) I'm not sure how having "best" value makes it unpredictable. It's just a form of materialized view, or an index if you will, just a bit smarter one that DB can provide natively, since the DB does not know our data but we do. Where predictability issue comes from? You still have exactly the same data and get exactly the same result as if you wrote the "10 most populous" query yourself by manually sorting population data by qualifiers for each country. No difference in data will ever happen (and, of course, you can still do the manual query by completely ignoring the best values and going to raw ones). I just propose to write part of this query for you and materialize the result, knowing the user will have to do it anyway.
 * 3) Here I see your point, but I don't think having the engine help you would prevent people from improving the data. I think, on the contrary, that having engine that is actually useful and easy to use would make more people use it and as such be driven to improve the data feeding it.
 * 4) This is easily fixed by setting one of the values as preferred. The additional heuristic only kicks in if there is no human decision, so if any humans disagree, they can always override it. Even have multiple preferred values, if desired. In any case, that'd be better than having US population simultaneously being 50 mln, 150 mln and 300 mln - I don't see any context in which that would be of any practical use.
 * 5) Well, I'm new here but I'm not sure why having an index helping to optimize for common case would be contrary to wiki values. Of course, if we expect the preferred issue to be fixed by the community before the system would go to any production use then the whole issue is irrelevant and we don't need any heuristics - we can just consider the preferred values. But if we expect it to be useful on the data that is not cleaned up yet I think it still can be useful.

In any case, the "best value" part is not integral to the rest of the model, so I'll work on the rest and we'll see what we do with it and if we need it at all after we have the rest of it. --Smalyshev (WMF) (talk) 20:46, 5 December 2014 (UTC)


 * I think this is mostly a matter of perspective and priority: to you (I suppose) the most important thing is to have something that returns useful results asap, for use by Grok and others. For me it's more important to be consistent with our data model, and integrate community processes, even if it takes a couple of months longer that way. I think this needs discussion on the product level, it's not just an engineering decision. -- Daniel Kinzler (WMDE) (talk) 18:27, 8 December 2014 (UTC)
 * As far as I understood the queries that Wikigrok would need, they would not involve properties where such a heuristically best value would be possible (like place of birth, nationality). Additionally most of the Wikigrok examples involve things that have no claim (of any rank) for a Property, but another Property of a specific value (rank preferred or normal). (Example: no alma mater and instance of human; for each of those humans make a pass over the linked Wikipedia articles to get the Wiki links; for each of those links check if they refer to a Wikidata.org item that is instance of University. See Extension:MobileFrontend/WikiGrok/Claim_suggestions.) --Jan Zerebecki 19:29, 8 December 2014 (UTC)
 * Btw, when I referred to the "wiki principle", I wasn't referring to Wikipedia values, but rather to the more general principle of wikis: everything is editable, nothing is automatic. Of course we could make Wikipedia's "featured article" on the frontpage update automatically by writing a MediaWiki extension to do it. Or implement a workflow for article deletion discussions in software. But we never did, for good reasons. -- Daniel Kinzler (WMDE) (talk) 18:27, 8 December 2014 (UTC)
 * And yea, the whole idea of Wikidata is kind of against the "nothing is automatic" thing. But we do try to avoid magic under the hood. -- Daniel Kinzler (WMDE) (talk) 18:29, 8 December 2014 (UTC)
 * So, for now I am implementing runtime preferred, latest and current clauses that would allow to apply the heuristic parts to the query at runtime. If it proves to be a big hurdle on the performance, we'll revisit optimizing those with e.g. additional edges on import. --Smalyshev (WMF) (talk) 21:26, 11 December 2014 (UTC)

Multiple Qualifiers

 * Note that this assumes each qualifier will be present only once. Wikibase allows multiple qualifiers with the same property. We need a different solution, but since it is preferable to have these data indexable, we should not be using complex structures here.
 * Could you provide an example with the same qualifier used more than once on the same claim? I'd like to see the semantics of it to figure out different solution. There are a number of options that could still leave it indexable but I'm not sure how that works so I'd need some examples. --Smalyshev (WMF) (talk) 21:13, 5 December 2014 (UTC)
 * Found such case: https://www.wikidata.org/wiki/Q801 - "head of state" has multiple start/ends for some. Not sure yet how to handle it as starts and ends should match and that means they need to be kept in ordered structure. Titan has multivalues, but looks like only on vertices, not on edges. WiIl check further. --Smalyshev (WMF) (talk) 07:45, 6 December 2014 (UTC)
 * That would be better handled as separate statements, I think. But of course, we still need to decide how to handle such a thing when we encounter it. I think pick the first one, and list statements with multi-value qualifiers, would be sensible. Except if we find a legit use case for this. I can't think of any, but the software doesn't make assumptions about it. -- Daniel Kinzler (WMDE) (talk) 18:18, 8 December 2014 (UTC)

Qualifiers as properties or edges

 * Qualifiers can reference other items. This should be modeled as an edge, but it's not possible to attach an edge to an edge. To allow this, qualifiers would need to be nodes in their own right.
 * The query engine allows to go from string to vertex named after string very easily with transform clause, as far as I can see, so not having edges on edges won't be a problem for querying. Modeling qualifier as a vertex though might be a problem since qualifier is attached to claim, which is now an edge. Given the multiple qualifiers issue above, we may have to convert claims to vertices too. If claims are vertices, qualifiers can be edges or vertices too. Not sure what is best since this will produce a lot more edges and may impact performance. --Smalyshev (WMF) (talk) 21:13, 5 December 2014 (UTC)
 * Actually, Titan can attach special kind of edge to an edge - http://s3.thinkaurelius.com/docs/titan/0.5.2/advanced-schema.html so maybe qualifiers can be made work this way. This is a Titan-specific feature so needs to be checked how it influences querying in Gremlin. --Smalyshev (WMF) (talk) 21:28, 7 December 2014 (UTC)
 * I think this is an issue in principle, but not much of one in practice. Can't think of a use case that wouldn't be covered by a string match off-hand. -- Daniel Kinzler (WMDE) (talk) 18:20, 8 December 2014 (UTC)

Importing deprecated data

 * Although deprecated statements will probably not be queried that often, we should try to import and index all data.
 * We can import deprecated data, but if we want to avoid putting "and exclude deprecated data" condition on every clause of every query, we probably should store them somewhere separate - like with edges marked 'P31_deprecated' or something like that maybe, so they won't be part of regular queries. Would that work? We can have DSL clauses that would say "include deprecated" in the query language, but I think we don't want to make users to explicitly exclude deprecated clauses in default queries. Which is why I think if we want to keep deprecated data, we should separate it from regular ones. --Smalyshev (WMF) (talk) 21:19, 5 December 2014 (UTC)
 * I agree these should be not seen by a query that does not explicitly select deprecated terms. So extending this to the rest of the ranks then we have P31_deprecated, P31_normal, P31_preferred, P31 (same as preferred with fall back to normal; wikibase source code uses the term best for this, this is what is displayed per default in Templates), P31_heuristically (what you described above as best value, if someone needs this). Does this fit with what you had in mind? --Jan Zerebecki 17:54, 8 December 2014 (UTC)

Representing inexact values

 * time and quantity are not exact values - both have a "main" value and an uncertainty interval. Without that interval, quantities would have to match to 127 decimal points, and times would have to match to the second. If we do not represent the uncertainty intervals, queries become impractical. For globe-coordinate this would be handled by a circular Geoshape with the diameter derived from the globe-coordinate's precision.
 * I think we will import all the value parts as additional value parts, and this way they can be queried against or otherwise used too. I'll update the spec accordingly soon. --Smalyshev (WMF) (talk) 18:38, 8 December 2014 (UTC)