Wikibase/Indexing/RDF Dump Format

Old version, for dumps currently being generated (until the new ontology fix lands in production), is here.

Introduction
This page described the RDF dump & export format produced by Wikidata and used for export and indexing purposes. Note that while it is close to the format used by the Wikidata Toolkit, it is not the same code and not the same format. While we strive to keep divergence to a minimum, there may be differences and one should use documentation only for the format that is actually being consumed.

Data Model
The RDF format is based on the Wikibase data model and represents an export format for it. That means, in particular, that if/when the data model changes, the export format will be changed accordingly. This document will be updated for such changes. The following description assumes familiarity with the data model and the terminology used.

The following description uses prefixes to describe the IRIs of the RDF resources mentioned. See the Prefixes chapter for the full description. All examples below are given in Turtle format.

Header
For the RDF dump, there is the header node  containing information about the license, the software version of the generator and the date the data was produced. In single-entity export, this data is attached to the data node (see below).

Example header: wikibase:Dump a schema:Dataset ; cc:license  ; schema:softwareVersion "0.0.1" ; schema:dateModified "2015-03-21T06:03:55Z"^^xsd:dateTime.

Entity representation
The entity is described in two nodes - data node and entity node. For entity Q1, data node is  and entity node is.

Data node describes the metadata about the entity record in the Wikibase. It has type of  and contains information about the entity revision, last modification date and links to the entity node with   predicate. Example: wdata:Q2 schema:version "59"^^xsd:integer ; schema:dateModified "2015-03-18T22:38:36Z"^^xsd:dateTime ; a schema:Dataset ; schema:about wd:Q2. Entity node describes the actual entity data and has type  or   depending on the kind of entity. Other entity types can be introduced in the future.

Entity description includes the following: Example of the entity definition: wd:Q3 a wikibase:Item ; rdfs:label "The Universe"@en ; skos:prefLabel "The Universe"@en ; schema:name "The Universe"@en ; schema:description "The Universe is big"@en ; skos:altLabel "everything"@en ; wdt:P2 wd:Q3 ; wdt:P7 "value1", "value2" ; p:P2 wds:Q3-24bf3704-4c5d-083a-9b59-1881f82b6b37, wds:Q3-45abf5ca-4ebf-eb52-ca26-811152eb067c ; p:P7 wds:Q3-4cc1f2d1-490e-c9c7-4560-46c3cce05bb7.
 * Entity labels - the main name of the entity. Labels are defined as,   and   predicates with objects being language-tagged string literals.
 * Entity aliases - the secondary names of the entity. Aliases are defined as skos:altLabel predicates with objects being language-tagged string literals.
 * Entity description - the longer description of the entity. Defined as  predicates with objects being language-tagged string literals.
 * "Truthy" statements (see below)
 * Predicates linking it to full statements

Properties
Entities that represent properties additionally feature the property type using  predicate.

Each property is also linked to the predicates that are derived from it. Example: wd:P22 a wikibase:Property ; rdfs:label "Item property"@en ; wikibase:propertyType wikibase:WikibaseItem ; wikibase:directClaim wdt:P22 ; wikibase:claim p:P22 ; wikibase:statementProperty ps:P22 ; wikibase:statementValue psv:P22 ; wikibase:qualifier pq:P22 ; wikibase:qualifierValue pqv:P22 ; wikibase:reference pr:P22 ; wikibase:referenceValue prv:P22.

Statement types
The RDF format represents statements in two forms - "truthy" and full statements.

Truthy statements
Truthy statements represent values that have the best non-deprecated rank for given property. I.e., if there is a preferred statement for property P2, then only preferred statements for P2 will be considered truthy. Otherwise, all normal-rank statements for P2 are considered truthy.

Truthy statement predicates have prefix  with the property name (e.g.  ) and the object is the simple value (see below) for the statement. The qualifiers are ignored.

Full statements
Full statements represent all data about the statement in the system. Full statement is represented as separate node, with prefix  with the id of the statement (e.g.  ). There is no guaranteed format or meaning to the statement id.

The statements are linked to the entity with the predicate with prefix  and the name of the property (e.g.  ).

Statement representation
The statement node represents single statement about the entity. It has type. The statement can contain the rank, the simple value (see below) of the statement, the link to the full value, the qualifiers and the references.

The statement rank is represented by the predicate  and the object being one of: ,  ,.

The statement that has the best rank for the property (i.e., preferred if there are any preferred statements in the property, otherwise normal) is also marked with  of.

The simple value is represented by the predicate with prefix  and the name of the property (e.g.  ) and the object being the simple value.

The full value (if required by the type) is represented by the predicate with prefix  (e.g.  ) and the object being the full value node.

The statement always has only one value, but can have multiple qualifiers and references.

Qualifiers
The qualifiers are represented by predicates with prefix  and the name of the property (e.g.  ) and the object being the simple value of the qualifier.

The full value (if required by the type) is represented by the predicate with prefix  (e.g.  ) and the object being the full value node.

Reference representation
References represent provenance information about statements.

Reference is represented as node, with prefix  and label being the hash of the reference (e.g.   ). The same reference (i.e. reference having the same properties with the same values) will be usually represented with one node, though duplicate reference nodes are possible in the data. The type of the node is a.

The reference values are represented the same as statement values, with simple values using predicates with  prefix (e.g.  ) and full values with prefix   (e.g.  ) and the object being the full value node. Unlike statements, references can have any number of values.

Example of the reference node: wdref:d95dde070543a0e0115c8d5061fce6754bb82280 a wikibase:Reference ; pr:P7 "Some data" ; pr:P8 "1976-01-12T00:00:00Z"^^xsd:dateTime ; prv:P8 wdv:b74072c03a5ced412a336ff213d69ef1.

Value representation
In the RDF format, the values are represented as two forms - simple value and full value. Simple value is always a literal or IRI, and is used as direct value that is convenient to search, index and match. The full value contains additional information about the value, such as ranges, precision, calendar used, etc. Note that while for many queries simple values will be enough, for other, more complex values, only full values will be adequate.

Full values are represented as nodes having prefix  and the label being the hash of the value contents (e.g.  ). There is no guarantee of the value of the hash except for the fact that different values will be represented by different hashes, and same value mentioned in different places will have the same hash. Value node has type. The content of the node is defined by the type of the value (see below).

Example of the value node: wdv:b74072c03a5ced412a336ff213d69ef1 a wikibase:Value ; wikibase:timeValue "+00000001976-01-12T00:00:00Z" ; wikibase:timePrecision "11"^^xsd:integer ; wikibase:timeTimezone "0"^^xsd:integer ; wikibase:timeCalendarModel .

String
String is represented as a string literal. Strings only have simple value.

CommonsMedia
Commons media is represented as a an IRI with the full resource URL. It has only simple value.

URL
URL is represented as a an IRI matching the URL. It has only simple value.

WikibaseEntityid
The entity is represented by the  prefixed link, e.g.  . It has only simple value.

Monolingualtext
The text is represented as a string literal with language tag. It has only simple value.

Globecoordinate
The simple value of the coordinate is the WKT string with the coordinates, with type, e.g.:

The full value has latitude, longitude and precision as decimal, and the globe as IRI.

Example: v:a10564107110b2d5739b8fe235cddf73 a wikibase:Value ; wikibase:geoLatitude "12.933333333333"^^xsd:decimal ; wikibase:geoLongitude "35.3"^^xsd:decimal ; wikibase:geoPrecision "0.000277778"^^xsd:decimal ; wikibase:geoGlobe .

Quantity
The simple value of the quantity is the specified amount, as a decimal literal.

The full value includes amount, upper and lower bound, and unit (currently always "1").

Example: v:cb213eea7a0b90d1d7f65c6eabfab9da a wikibase:Value ; wikibase:quantityAmount "+123"^^xsd:decimal ; wikibase:quantityUpperBound "+124"^^xsd:decimal ; wikibase:quantityLowerBound "+122"^^xsd:decimal ; wikibase:quantityUnit "1".

Time
The simple value of the time value is either datetime value of type, if the value can be converted to Gregorian date, or a string as represented in the database, if not.

The full value includes original time string, precision and timezone as integers and calendar model as IRI.

Example: v:85374678f22bda99efb44a5617d76e51 a wikibase:Value ; wikibase:timeValue "+1948-04-12T00:00:00Z" ; wikibase:timePrecision "11"^^xsd:integer ; wikibase:timeTimezone "0"^^xsd:integer ; wikibase:timeCalendarModel .

Special Values
Wikibase data model has two special values - somevalue (unknown) and novalue, specifying value that is known to exist but its exact value is unknown and the value that is known to not exist.

Somevalue
Unknown value is represented as RDF blank node: wds:Q3-45abf5ca-4ebf-eb52-ca26-811152eb067c a wikibase:Statement ; ps:P2 _:genid1 ; wikibase:rank wikibase:NormalRank.

Novalue
Novalue representation is currently TBD and temporarily represented as. This may change in the future.

Sitelinks
The links are represented as set of predicates describing the link URL. The type of the node is  and it linked with the entity via   predicate.

Badges are described with  predicates.

Example:  a schema:Article ; schema:about wd:Q3 ; schema:inLanguage "en" ; wikibase:badge wd:Q5.

Prefixes used
The prefixes are used in RDF formats that allow short prefixes (such as Turtle and RDF). For other formats, the full URL is used.

All prefix URLs that do not contain hostname are prefixed with the hostname of the generating wiki. All prefix URLs that contain hostname are fixed and do not depend on generating wiki.

Standard prefixes used:

Full list of prefixes
This list can be used for queries in SPARQL: PREFIX rdf: . PREFIX xsd: . PREFIX rdfs: . PREFIX skos: . PREFIX schema: . PREFIX cc: . PREFIX geo: . PREFIX prov: . PREFIX wikibase: . PREFIX wdata: . PREFIX wd: . PREFIX wdt: <http://wikidata.org/prop/direct/>. PREFIX wds: <http://wikidata.org/entity/statement/>. PREFIX p: <http://wikidata.org/prop/>. PREFIX wdref: <http://wikidata.org/reference/>. PREFIX wdv: <http://wikidata.org/value/>. PREFIX ps: <http://wikidata.org/prop/statement/>. PREFIX psv: <http://wikidata.org/prop/statement/value/>. PREFIX pq: <http://wikidata.org/prop/qualifier/>. PREFIX pqv: <http://wikidata.org/prop/qualifier/value/>. PREFIX pr: <http://wikidata.org/prop/reference/>. PREFIX prv: <http://wikidata.org/prop/reference/value/>.

Ontology
This compiles the list of all objects and predicates that are internal to the format. For the meaning of the prefixes, see the prefixes list.

Predicates
Italicized names mean that any property name can be substituted instead of example name P123. The following predicates are used in deep values for the values of specific types. All these predicates have the domain of  and the range depending on type below.