Wikibase/Indexing/Data Model

This page describes the proposed data model to represent Wikibase data in graph database, such as Gremlin-compatible database.

Note: this is a draft, not a final solution, and is subject to change without any notice at any moment.

Terminology
Vertex names are listed as ''italic. ''Property names are listed as bold.

Vertices
Each data item (this pertains to both Q and P items) is represented as a vertex in the graph. Each item has an unique vertex, which has property wikibaseId with its wikibase ID. It also has properties named labelLNG, where LNG is the language code (such as labelEn for the English label) and may have more edges as necessary.

Edges
Claims are represented by edges originating from the vertex owning the claim. The target vertex depends on specific claim (see below). The additional data describing the claim - such as qualifiers, references, ranks, etc. - can be attached to the edge.

Representing link between items
Link between two items is represented as an edge going from owning item to claimed item, with edge label being the property represented by the claim.

For example, if "USA" (Q30) is claimed to be "instance of" (P31) a "country" (Q6256) this produces an edge from vertex Q30 to vertex Q6256 labeled P31.

Representing property value
Property value - i.e., claim having non-item value, such as string, number, date, location, etc. - is represented by an edge from the owning item to the vertex representing the property, with with edge label being the property represented by the claim. The data item is stored in the property of the edge named after the claim property with suffix value.

For example, if "USA" (Q30) is claimed to have a population (P1082) of 318,697,314 people, this produces an edge from vertex Q30 to vertex P1082 labeled P1082 and having P1082value property of 318,697,314.

The separate names of the values for the edges will allow to create a typed index (such as fulltext or geospatial index) on the values.

Representing data types
The data for the values is stored as presented in imported data, with the exception of the following types which are processed:
 * globe-coordinate - is represented in Titan as Geoshape object. Note that this is a Titan-specific representation which may need to be changed for other backends.
 * monolingualtext - is currently represented as "language:text" string. We may want to seek better representation.
 * time - is represented as Java Date value. TBD: Note that currently values not representable as Java Date are stored as "somevalue".
 * quantity - only the amount is stored

Representing pseudo-values
The pseudo-values "novalue" and "somevalue" are represented as edges to special vertices novalue and unknown respectively. The value property of the edge is set to null. This way the property still can have typed index without actually mixing data values with placeholder values.

Representing ranks
The rank of the claim is represented by the property rank of the edge. The claims with rank "deprecated" are ignored on import and are not represented in the indexing data set.

Representing qualifiers
The qualifiers for the claim are represented by the properties of the edge named after the qualifier property name, with "q" suffix.

For example, the qualifier "point in time" (P585) will be represented by the edge property P585q.

TBD: Note that this assumes each qualifier will be present only once. If this is not true, we need a different solution, but since it is preferable to have these data indexable, we should not be using complex structures here.