Wikibase/Indexing/Data Model

This page describes the proposed data model to represent Wikibase data in graph database, such as Gremlin-compatible database.

Note: this is a draft, not a final solution, and is subject to change without any notice at any moment. Watchlist this page if you want the updates.

This is version 2 of the model, previous version v1 can be found in history.

Terminology
The data model is based on Wikibase Data Model and the terms like "item", "claim", "property" are derived from that data model. Please keep the terminology compatible with the Wikibase glossary.

To distinguish between Wikibase Properties/Entites and Titan properties and other constructs, the references to Wikibase terms Property, Item , Entity are capitalized.

Titan vertex names are listed as ''italic. ''Titan property (key) names are listed as bold.

Vertices
Each Wikibase Entity (this pertains to both Q and P Entities) is represented as a vertex in the graph. Each Entity has an unique vertex, which has property wikibaseId with its wikibase ID (as is, e.g. Q42 or P31). The vertex also has properties named labelLNG, where LNG is the language code (such as labelEn for the English label) and may have more properties as necessary.

TBD: define which other properties we want on vertices. Descriptions? Aliases? Sitelinks? Badges?

Claims
Claims on Entities are represented by edges, with the edge leading to either vertex representing another Entity or to the vertex representing Property, depending on if the value is scalar or link to another Entity. See below on the details of the representation.

The claim edges have wikibaseId property matching the Claim ID in the data set. The ID value is assumed to be constant for the lifetime of the claim, and change to the claim is assumed to produce different ID.

Representing link between Entities
Link between two Entities is represented as an edge going from owning Entity to the claimed Entity, with edge label being marked by the Property represented by the claim.

For example, if "USA" (Q30) is claimed to be "instance of" (P31) a "country" (Q6256) this produces an edge from vertex Q30 to vertex Q6256 labeled P31.

Additionally, the claiming vertex has a set (multi-value) property named after the Property with the suffix link (i.e. in the example above, P31link) which contains the ids of all Entities linked to this vertex. This is done in order to speed up queries like "list of humans" or "list of countries". Note that this list does not distinguish between claims and ignores qualifiers, if you need this distinction edges should be used.

Representing Property value
Property value - i.e., claim having non-Entity value, such as string, number, date, location, etc. - is represented by the property of the claim edge. The data value is stored in the property of the edge named after the claim Property with suffix value.

For example, if "USA" (Q30) is claimed to have a population (P1082) of 318,697,314 people, this produces an edge from vertex Q30 to the vertex P1082 labeled P1082 and having P1082value property of 318,697,314.

The separate names of the property values for the edges will allow to create a typed index (such as fulltext or geospatial index) on the values.

Representing data types
The data for the values is stored as presented in imported data, with the exception of the following types which are processed: Along with the data value, for non-primitive types the accompanying data are stored in separate property, suffixed with _all, e.g. for property P1082 accompanying data are stored in P1082_all. The data is stored as a map, as it appears in the input.
 * globe-coordinate - is represented in Titan as Geoshape object. Note that this is a Titan-specific representation which may need to be changed for other backends.
 * monolingualtext - is currently represented as "language:text" string. We may want to seek better representation.
 * time - is represented as long value specifying number of seconds from 1970-01-01 00:00 UTC. The values for years 1 AD to 291999999 AD (inclusive) are represented with per-second precision, the values outside of this range are represented as whole years, where year is defined as 31557600 seconds (365.25 days).
 * quantity - only the amount is stored. TBD:  current stored as string, we may want to find better representation.

If the data is not representable as described above (i.e. invalid date, value can not be parsed, etc.) it is represented as " " (see below).

Representing pseudo-values
The pseudo-values " " and " " are represented as edges to special vertices novalue and unknown respectively. The value property of the edge (see above) is unset. This way the property still can have typed index without actually mixing data values with placeholder values.

Representing ranks
The rank of the claim is represented by the property rank of the claim edge. The claims with rank "deprecated" are currently ignored on import and are not represented in the indexing data set.

TBD: Although deprecated statements will probably not be queried that often, we should try to import and index all data.

Representing qualifiers
The qualifiers are modeled the same way as properties, attached to the claim edge, but the claim value is stored with the suffix q and the accompanying data with the suffix q_all. For qualifiers linking to Entities, the wikibase ID is stored.

For example, qualifier "point in time" (P585) attached to the claim about the US population would produce the property P585q containing the date value and P585q_all with accompanying date value.

If a claim has multiple instances of the same qualifier, the clones of the claim are created, such that each clone has one value of the qualifier.

Exception is the pair of claims P580 (start time) and P582 (end time) which are treated as pair - i.e. for each pair of star/end time one clone is created, not two.

TBD: refrences (sources)

Preferred/best value representation
Note: this section proved to be controversial, so it is not yet implemented.

For some properties, there can be multiple claims but only one value (or subset of values) of the claim is preferred. For example, for property "population" on the "USA" there are a number of values, but only one represents the current US population. Or for a person, there could be number of companies he has been employed at, but only subset of those are the current employers.

It is proposed to have an additional edge(s) for such properties, to make the queries against current values easier. These edges have names like the regular property edges but with _best suffix appended. So, for US population the edge would be named P1082_best.

Such edges would correspond to claims that have either one the following properties: If there are no preferred ranks and no suitable qualifiers on any of the claims, then no values are considered to be the "best". If some of the claims have the qualifiers and some do not, we consider the data with no time point defined as being one of the "best" values, if there are any other "best" values, otherwise there are no "best values". If there are values that fit both criteria 2 and 3, the start date is compared to the point in time value, and the latest one is considered the "best".
 * 1) Have the rank "preferred". If one of the claims has this rank, only ranking is considered and the following criteria are ignored.
 * 2) If the claims on the property have the "point in time" (P585) qualifier, the latest value is chosen as the "best".
 * 3) If the claims on the property have "start date" (P580) and "end date" (P582) qualifiers, the value that has start date in the past and does not have end date or has one in the future, is chosen.

If there are no values that fit the "best" criteria, no additional edges are created.

Note that multiple values can be chosen as "best", and the "best" values are always the subset of the all property values represented by the regular edges.

TBD: this proposal is not strictly necessary for implementing all the above and represents the performance-oriented enhancement and the means to simplify common queries. If the data would be properly ranked for all relevant values, the additional logic may not be necessary and usage of the rank property may be enough, but currently we may want to keep it considering the current state of the data.

TBD: We may also opt to replace two edge names with edges named the same but having additional parameter - such as bestValue - that would encapsulate the logic described above.

See also the discussion at Talk:Wikibase/Indexing/Data_Model

Illustration
Here is the example (partial) representation of the vertex "USA" and its properties:

Implementation
Current code implementing the model can be found at https://github.com/smalyshev/wikidata-gremlin/tree/titan_flat