Wikibase/Indexing/Data Model

This page describes the proposed data model to represent Wikibase data in graph database, such as Gremlin-compatible database.

Note: this is a draft, not a final solution, and is subject to change without any notice at any moment.

Terminology
The data model is based on Wikibase Data Model and the terms like "item", "claim", "property" are derived from that data model. Please keep the terminology compatible with the Wikibase glossary.

Vertex names are listed as ''italic. ''Property names are listed as bold.

TBD: make sure to clearly distinguish between Titan properties (key/values) on edges and vertices, and Wikibase Property nodes, and Wikibase Statements about specific Wikibase Properties.

Vertices
Each data item (this pertains to both Q and P items) is represented as a vertex in the graph. Each item has an unique vertex, which has property wikibaseId with its wikibase ID. It also has properties named labelLNG, where LNG is the language code (such as labelEn for the English label) and may have more edges as necessary.

TBD: define which other properties we want on vertices. Descriptions? Aliases? Wikilinks?

Edges
Claims are represented by edges originating from the vertex owning the claim. The target vertex depends on specific claim (see below). The additional data describing the claim - such as qualifiers, references, ranks, etc. - can be attached to the edge.

Representing link between items
Link between two items is represented as an edge going from owning item to claimed item, with edge label being the property represented by the claim.

For example, if "USA" (Q30) is claimed to be "instance of" (P31) a "country" (Q6256) this produces an edge from vertex Q30 to vertex Q6256 labeled P31.

Representing property value
Property value - i.e., claim having non-item value, such as string, number, date, location, etc. - is represented by an edge from the owning item to the vertex representing the property, with with edge label being the property represented by the claim. The data item is stored in the property of the edge named after the claim property with suffix value.

For example, if "USA" (Q30) is claimed to have a population (P1082) of 318,697,314 people, this produces an edge from vertex Q30 to vertex P1082 labeled P1082 and having P1082value property of 318,697,314.

The separate names of the values for the edges will allow to create a typed index (such as fulltext or geospatial index) on the values.

Representing data types
The data for the values is stored as presented in imported data, with the exception of the following types which are processed: TBD: time and quantity are not exact values - both have a "main" value and an uncertainty interval. Without that interval, quantities would have to match to 127 decimal points, and times would have to match to the second. If we do not represent the uncertainty intervals, queries become impractical. For globe-coordinate this would be handled by a circular Geoshape with the diameter derived from the globe-coordinate's precision.
 * globe-coordinate - is represented in Titan as Geoshape object. Note that this is a Titan-specific representation which may need to be changed for other backends.
 * monolingualtext - is currently represented as "language:text" string. We may want to seek better representation.
 * time - is represented as Java Date value. TBD: Note that currently values not representable as Java Date are stored as "somevalue".
 * quantity - only the amount is stored

Representing pseudo-values
The pseudo-values "novalue" and "somevalue" are represented as edges to special vertices novalue and unknown respectively. The value property of the edge is set to null. This way the property still can have typed index without actually mixing data values with placeholder values.

Representing ranks
The rank of the claim is represented by the property rank of the edge. The claims with rank "deprecated" are ignored on import and are not represented in the indexing data set.

Representing qualifiers
The qualifiers for the claim are represented by the properties of the edge named after the qualifier property name, with "q" suffix.

For example, the qualifier "point in time" (P585) will be represented by the edge property P585q.

TBD: Note that this assumes each qualifier will be present only once. If this is not true, we need a different solution, but since it is preferable to have these data indexable, we should not be using complex structures here.

TBD: Qualifiers can reference other items. This should be modeled as an edge, but it's not possible to attach an edge to an edge. To allow this, qualifiers would need to be nodes in their own right.

Preferred/best value representation
For some properties, the can be multiple claims but only one value (or subset of values) of the claim is preferred. For example, for property "population" on the "USA" there are a number of values, but only one represents the current US population. Or for a person, there could be number of companies he has been employed at, but only subset of those are the current employers.

It is proposed to have an additional edge(s) for such properties, to make the queries against current values easier. These edges have names like the regular property edges but with _best suffix appended. So, for US population the edge would be named P1082_best.

Such edges would correspond to claims that have either one the following properties: If there are no preferred ranks and no suitable qualifiers on any of the claims, then no values are considered to be the "best". If some of the claims have the qualifiers and some do not, we consider the data with no time point defined as being one of the "best" values, if there are any other "best" values, otherwise there are no "best values". If there are values that fit both criteria 2 and 3, the start date is compared to the point in time value, and the latest one is considered the "best".
 * 1) Have the rank "preferred". If one of the claims has this rank, only ranking is considered and the following criteria are ignored.
 * 2) If the claims on the property have the "point in time" (P585) qualifier, the latest value is chosen as the "best".
 * 3) If the claims on the property have "start date" (P580) and "end date" (P582) qualifiers, the value that has start date in the past and does not have end date or has one in the future, is chosen.

If there are no values that fit the "best" criteria, no additional edges are created.

Note that multiple values can be chosen as "best", and the "best" values are always the subset of the all property values represented by the regular edges.

TBD: this proposal is not strictly necessary for implementing all the above and represents the performance-oriented enhancement and the means to simplify common queries. If the data would be properly ranked for all relevant values, the additional logic may not be necessary and usage of the rank property may be enough, but currently we may want to keep it considering the current state of the data.

TBD: We may also opt to replace two edge names with edges named the same but having additional parameter - such as bestValue - that would encapsulate the logic described above.

Criticism (by Daniel)
I see several issues with the heuristics for "best" values described here.
 * 1) It doesn't match the spec. The wikibase data model defines the semantics of ranks such that for a query, the only "preferred" claims for a given properties should be considered, if there are any. If there are no "preferred" claims, only the "normal" claims shall be considered. The claims that are thus defined to be relevant to queries according to this are referred to as the "best" claims for that property.
 * 2) It leads to surprises. The graph database is intended to be used for queries, not searches. Queries have a well defined result set, which should be clearly predictable to the author of the query. Predictability is important; A search index may used heuristics to follow the actual content. A query index should have clearly defined behavior, and allow content to me modeled accordingly.
 * 3) Applying such heuristics takes away one of the main incentives to actually rank statements manually (resp by bot). Explicit ranking is extremely valuable, and useful for using values in infoboxes etc. One reason we don't see many "preferred" ranks on Wikidata is that they don't have much effect yet. Once people see how ranking effects query results, this will hopefully be used a lot more. The heuristics suggested here would obscure this effect.
 * 4) The heuristics may have averse "political" consequences. When designing the wikibase model, we took great care to allow for competing views and contradictions. Having e.g. census data ignored because it's a year older than information from another entity may lead to confusion and even animosity (yes, people get into fights about the population of China, or Israel, or India, because it very much depends on which regions you include as territory - this is highly political stuff).
 * 5) One of the wiki principles is: avoid magic, let the community edit content. This means here: leave it to the community if, when, and where they want to apply heuristics like "the newest value is the best". They can write a bot that changes the rank accordingly, with a record in the history, discussions on the wiki, etc.

Illustration
Here is the example (partial) representation of the vertex "USA" and its properties: