User:Henning (WMDE)/Wikibase/Data Model

From MediaWiki.org
Jump to navigation Jump to search
Warning Warning: This is a living document, describing the conceptual data model behind WIKIBASE. It is not a specification of any concrete data model implementation, mapping, or serialization.

Goals and requirements[edit]

The Data Model has the goal to clarify which information is stored in WIKIBASE by providing:

  • Conceptual clarity: It should be clear what WIKIBASE can (and what it cannot) capture. It is not possible to capture all statements that could be made about the world (not even all that are important or reasonable). A balance must be found between expressive power and complexity/usability.
  • Technical documentation: Almost every layer of WIKIBASE has to work with the data that is captured. Therefore, it is essential to have a common understanding of what data is and what structure it should be captured in. As, internally, the data can be represented quite differently (in objects, in a syntactic format, in a user interface, etc.), it is important that a unique and unambiguous reading is shared across each of such representations.

The Data Model in a nutshell[edit]

The main purpose of WIKIBASE is to describe the nature of subjects using basic statements. An example for such a statement would be:

As of April 30 2014, the city of Hamburg has a population of 1,748,237 according to the web site of the Federal Statistics Office Germany, retrieved on March 10 2015.

By splitting this statement into segments, each segment can be mapped to components of the WIKIBASE Data Model: Items represent particular subjects that statements should be aggregated about. Such an Item could represent the city of Hamburg. To be able to identify the Item representing Hamburg, Label, Description and Aliases may be assigned to the Item. One or more external pages containing information on Hamburg may be connected to the Item using Site Links while Statements describe the nature of the Item’s subject.

Being a statement’s most specific information, the claim “[…] has a population of 1,748,237 […]”, is called a Main Snak. While a Snak, basically, is a simple pairing of a Property and a value, the Main Snak is the core information of a Claim. The information recorded by such a Claim can be limited or refined by applying Qualifiers. Just like the Main Snak, Qualifiers are Snaks and specify the Main Snak’s circumstances. As to the example, the statement’s circumstance is that the population statistic was recorded on April 30 2014. The statement’s origin may be specified using a Reference that points to the respective publication by the Federal Statistics Office Germany. The resulting structure may be:

The Claim of a Statement may be formulated in different ways with the most common one being a Property-Value-Snak which is a single assignment of a value (1,478,237) to a Property (population). Values may be of any nature defined by a Property’s Data Type: Numbers, strings, dates and times, geographic coordinates and, for the full power of semantics, other Items. Consequently, the Item representing Hamburg may reference an Item representing the Federal Republic of Germany which the city is located in. Just like Items, Properties may be created by the operator(s) of a WIKIBASE Repository.

Apart from the Property-Value-Snak, other Snak types allow explicitly specifying some Property having no value as well as specifying some, yet unknown, value. For example, it is very uncommon for a country to not have a capital. Nauru does, in fact, not have one. Using a Property-No-Value-Snak may be used to express such an abnormality.

The concepts of WIKIBASE[edit]

Relationship between the different data model abstraction levels of Wikibase.

This document describes the logical data model of WIKIBASE which is the representation of the concepts modelled from the real world. This logical data model does not mandate the class structure of the actual implementation manifesting in a physical data model.

The data structures in the UML class diagram below are described using the following features:

  • Classes, represented as boxes.
  • Abstract classes (conceptual classes that are not directly instantiated in data), represented as classes with names in italics.
  • Class inheritance, represented by arrows with empty triangles as heads, pointing to the super class.
  • Class attributes, represented by “name: type” entries in classes.
  • Associations, represented by lines that may end with an arrow expressing directionality.
  • Compositions, represented by lines with filled diamonds on the side of the class that composes a particular number of objects of the other class.
UML diagram of the Wikibase Data Model.

Entity[edit]

Being rendered as individual pages in a WIKIBASE Repository, Entities are the topmost concept of WIKIBASE and feature an individual numeric ID that is prefixed with a letter as per Entity type. These Entity IDs define the URIs of the Entity pages: <Base URL of the WIKIBASE Repository>/entity/<Entity ID>. An Entity may be implemented as Item or Property.

Item[edit]

Subjects that statements should be aggregated about are represented by Items. Each Item features a Label, an optional Description and optional Aliases that ease identifying and distinguishing Items. Apart from gathering Statements about its subject, Items may be connected to external pages using Site Links. Items may represent an individual subject as well as some class of subjects as demonstrated in the following examples:

  • Some location, i.e. the city of Hamburg, Germany.
  • Some person, i.e. the author George Orwell.
  • Some object, i.e. the book “Nineteen Eighty-Four” by George Orwell.
  • Some event, i.e. the Eurovision Song Contest 1993.
  • Some performance, i.e. the Greek entry one’s at the Eurovision Song Contest 1993.

The scope of the concept of Items, in terms of what subjects shall be represented by Items, is to be defined by the operator(s) of a WIKIBASE Repository.

As different subjects may be labeled the same (i.e. there is more than just one particular city in the world named “Hamburg”), an Item’s Label may act as identifier only in combination with the Item’s Description. Therefore, across all Items, the combination of an Item’s Label and Description (unless the Description is empty) is constrained to be unique per language.

Since pages linked to an Item via Site Links are supposed to describe the exact subject represented by the Item only and no more than one single page should exist on the external Site on that subject, the combination of a Site and a page forming a basic Site Link is constrained to be unique across all Items.

An Item’s Entity ID prefix is “Q”.

Property[edit]

By assigning values to Properties, these may be used to form Statements – on Items as well as on other Properties, or even on the same Property. In that sense, Properties are used to describe relationship between Entities as well as to capture values about a particular subject represented by an Item. Each Property is assigned a Data Type that constrains the kind of value users are allowed to enter for the particular Property. Each Property features a Label, an optional Description and optional Aliases that ease identifying and distinguishing Properties.

Examples for Properties:

  • height (featuring a Data Type constraining input to a number)
  • geographic location (featuring a Data Type constraining input to geographic coordinates)
  • date of birth (featuring a Data Type constraining input to specifying a date)
  • capital (featuring a Data Type constraining input to specifying a reference to an Item)
  • author of (featuring a Data Type constraining input to specifying a reference to an Item)

Each Property may be addressed by its Label or one of its Aliases. Consequently, across all Properties, each Property’s Label and each Alias are constrained to be unique per language.

A Property’s Entity ID prefix is “P”.

Site Link[edit]

A Site Link represents a link to an external page and, hence, directly connects an Item to that external page. It features a site ID referencing a particular Site, a page name as well as one or more Badges pointing to other Items. An external page is supposed to capture the exact same subject being represented by the corresponding Item while no other page capturing the same subject should exist on the same Site. This assumption is enforced by constraining each combination of a site ID and a page name to be unique across all Items.

Statement[edit]

Statements describe the nature of the subject represented by an Item or Property by aggregating a Claim and optional References pointing to the source of the Claim. In addition, a Statement features a Rank that may be used to put emphasis on or diminish emphasis of the Statement.

Claim[edit]

Essentially, a Claim is a plain Property-value pair forming a Snak. The statement of this Main Snak may be refined or limited by applying additional Qualifier Snaks. Examples for Claims:

In each case, there are alternative ways to capture the respective information. Policies on how information is supposed to be captured should be defined by the operator(s) of a WIKIBASE Repository.

Rank[edit]

Ranks provide a simple filtering criterion when there are more than one Statement for some Property. Figuratively, Ranks apply weight to Statements. Their purpose is to apply focus on most up-to-date as well as most correct Statements in regard to the visualization in Entity renderings as well as in regard to the result of Queries that do not bypass the ranking mechanism by intention. Other than the term suggests, Ranks are not supposed to be used for rating or trying to capture a Statement’s quality, reliability or likelihood. All of these are expressed directly by a Statement’s References. Statements of each Rank may be valid and reliable as to their References.

Apart from the default normal Rank, Statements may be marked with preferred or deprecated Rank:

Normal rank[edit]

Being the Rank Statements are assigned with by default, the normal rank represents a neutral state as it does not add weight to or remove weight from a Statement. When issuing a Query on an Item or Property for Statements, the normal ranked Statements are returned for each queried Property not featuring preferred ranked Statements.

Examples for applying a normal rank:

  • Coordinates of a particular location. As long as there is only one set of coordinates specified, there is no need to apply any Rank other than normal.
  • An football player’s past team membership. While the current team membership may be assigned the preferred Rank, the past membership should be assigned the normal Rank.
  • A person being parent to several children. All may remain on normal Rank as long as the expression of the Statement of each child being a child of that person resides on the same level of assurance (there is no more or less of “correct”). (Assigning the preferred Rank to all of those Statements would be correct as well.) Regarding this example, disputed parentage, however, may make ranking complicated.
Preferred rank[edit]

When issuing a Query on an Item or Property for Statements, by default, only the preferred Statement(s) is/are returned, provided that the Properties queried feature preferred Statements. (If there are no preferred Statements for a Property, the normal ranked Statements of that Property are returned.) This mechanism provides some sort of convenience since there is no need to figure out the value which most likely would be expected to be returned by the query. Consequently, the preferred Rank should be assigned to most current Statements and/or Statements that represent scientific consensus.

Examples for applying a preferred Rank:

  • An Item representing a city may feature a list of its current and former mayors. The current mayor would receive the preferred rank.
  • There are several ways to measure the length of a river resulting in different river length according to the method used. On an Item representing a river, the result of the most common method should probably receive the preferred rank.
  • A football player is currently playing in two teams of a football club, the top team and a youth team. While the player was playing for other teams before, the current teams may receive the preferred Rank in contrast to the membership to former teams having assigned the normal Rank.

Just like Ranks are not for rating quality, their nature is not to determine right or wrong. There may be multiple preferred Statements when there is no consensus. A theoretical example in that manner is a politically disputed status of geographic regions.

Deprecated rank[edit]

The deprecated Rank is used to mark Statements that are known to include errors or that represent outdated knowledge that has proven wrong. Marking Statements deprecated instead of simply deleting them maintains integrity aiming at making users aware of to not (re-)add the Statement with another Rank. When issuing a Query on an Item or Property for Statements, deprecated Statements will never be returned unless those are requested specifically. While creating Statements without any Reference may, in general, be problematic, having no or no proper Reference does not by itself qualify a Statement for being assigned with the deprecated Rank. The Rank attributes the Claim only, not the combination of a Claim and its References.

Examples for applying a deprecated Rank:

  • The earth being the center of the cosmos once was subject of scientific discourse. Although that can be backed by historic sources from that time, the geocentric model is deprecated.
  • An Item representing a city may feature an incorrect population figure that was published in a historical document. Backed by the source, the Statement is not wrong since the figure is accurate according to the historical document. However, since the historical document is known be erroneous, the deprecated Rank should be applied to the Statement.
  • Some literature suggests that a person was born in a specific state. However, that state did not yet exist when the person was born. A Statement ranked deprecated may be used to capture that information.

Examples for when to not apply a deprecated Rank:

  • A football player left a particular team. The Statement referencing the team membership is not deprecated as it once was true. Instead the Statement’s Rank may be reset to normal and a Qualifiers may be added specifying the date the player has left the team.
  • A Statement not featuring any Reference may not automatically be regarded deprecated.

The concept of Ranks is intentionally left coarse and simple. The three levels translate to different treatments in data access, user interface (e.g., what is displayed by default), and export (one could, i.e., have an export with only the preferred and normal Statements). More fine-grained ranking would not allow such a clear interpretation and would thus increase the user interface complexity unnecessarily. Having only two Ranks (or no Ranks at all), on the other hand, would make it harder to cope with Statements that are not trusted or that are known to contain wrong claims.

Reference[edit]

Aggregated in Statements, References accompany Claims and capture information about the source the Claim originates from. A Reference consists of a list of Snaks describing the source. These Snaks may, for example, reference another Item representing a book, specify the page number, reference an URL or formulate the source of a Statement in some other way. WIKIBASE does neither enforce a particular schema of a Reference’s structure, nor any constraints or requirements for References. How References should be formulated is to be defined by the operator(s) of a WIKIBASE Repository. Examples:

The latter example demonstrates the complex structure sourcing Claims may generate. The more meaningful alternative to specifying such a complex structure every time it is referenced, is to create an Item for the subject containing the citations. This Item may then be referenced by the Reference:

Snak[edit]

Snaks are the basic information structures used to describe subjects represented by Entities. They are an integral part of each Claim and Statement alike. Different types of Snaks may be used to express various meanings. However, every Snak type shares an individual reference to a Property. Since a Snak, essentially, is just a pair of a Property and a value, it does not refer to any subject by itself. This reference derives from a Snak’s context, e.g., by being encapsulated in a Statement on an Item.

Property-Value-Snak[edit]

A Property-Value-Snak captures a particular value of a Property with the value being required to conform to the Property’s Data Type. Many basic kinds of data are naturally expressed by assigning value to Properties. Examples:

Property-No-Value-Snak[edit]

A Property-No-Value-Snak describes a Property explicitly featuring no value in contrast to having a specific value, some value or a value not being relevant. A Property-No-Value-Snak may be used to emphasize that a Property’s value has not just been left out (or not entered yet) but that it really does not exist. Examples:

  • children (Property) = no value (on the subject of an emperor).
  • current team membership (Property) = no value (on the subject of a sportsman).
  • capital (Property) = no value (on the subject of a county).
  • number of teeth (Property) = no value (on the subject of a species).
  • web site (Property) = no value (on the subject of a company).

Such Statements should only be created when an incompleteness could be expected otherwise. It is not intended to store irrelevant information (e.g., “The Pacific Ocean has no children”).

Property-Some-Value-Snak[edit]

A Property-Some-Value-Snak describes a Property featuring some, yet unknown, value. In a way, such a Snak acts like a placeholder, yet, explicitely capturing that placeholder may be useful. Examples:

  • date of death (Property) = unknown (on the subject of a historic person that lived long ago but whose date of death is unknown, e.g. Ambrose Bierce).
  • mail address (Property) = unknown (on the subject of a company).
  • real name (Property) = unknown (on the subject of artist whose real name is not known, stressing that his common name is a pseudonym).
  • location (Property) = unknown (on the subject of some lost painting, e.g. “Good Neighbours” by John William Waterhouse).
  • recipe (Property) = unknown (on the subject of some beverage, e.g. the French liqueur Chartreuse).

Such Statements should only be created if not even parts of the information that should be captured is known. E.g., if a only the year of death is known, a Property-Value-Snak should be used to capture that information. While WIKIBASE does not support constraints on unknown values (i.e., “William of Ockham died in 1347 or 1348”), it does support precision on some types of values (“William of Ockham died in the 1340s”) as well as different (possibly conflicting) values from multiple sources.

Example[edit]

Given there is a Property date of death and an Item that represents some person, assigning different Snak types with the Property date of death as a Statement’s Main Snak results in different meanings:

  • Property-No-Value-Snak: The person definitely has no date of death (specifically stating / stressing the person is alive).
  • Property-Some-Value-Snak: The person deceased but neither the person’s exact date of death nor parts of it nor a time range are known.
  • Property-Value-Snak: The person’s exact date of death, parts of it or a time range the death occurred in are known and specified as the Snak’s value.

Note: Not specifying any Snak for date of death at all results in the meaning of the Property's value being unknown or not relevant: A missing person that may be alive or dead may not receive a Snak capturing date of death by purpose while defining a Snak with a Property distance to state border simply may not be relevant in the scope of an Item representing a person.

WIKIBASE has no native support for distinguishing between unknown and irrelevant as well as there is no native way to specify probability as demonstrated in the following theoretical example:

Several historical persons are referenced they might have been Robin Hood. There is no method to reflect probability or uncertainty applying some disputed flag to a Property was as in “Robin Hood was Robin of Loxley”. Instead, a custom Property like might have been would need to be created to construct “Robin Hood might have been Robin of Loxley”.

In a similar sense, it is not supported to specify a set of value alternatives as demonstrated in the following theoretical example:

In a historical handwriting, a person’s year of birth may either be identified as July 1st 1089 or July 1st 1099. This information may not be captured in a single Snak. Multiple alternatives originating from one source cannot be represented properly at all as adding multiple Statements to a single Property–may it be date of birth or probable date of birth–may communicate there are two sources. As to using a general date of birth Property, this procedure would, even more, result in incorrect Query results as the person would appear to be born in both, 1089 and 1099.

The matter of handling of uncertainty, probability and alternatives originating from a single source should be addressed by the operator(s) of a WIKIBASE Repository.

Data Type[edit]

Like a common data type, a WIKIBASE Data Type is a classification identifying one of various types of data and determines the possible values that may be assigned to that Data Type. Each Property is assigned a Data Type. A Data Type references a Data Value whose nature values of the Data Type must comply to. Depending on a Data Value’s complexity, not all Data Types are primitive in the sense that their values consist of only one single value of a type that is commonly found in programming languages. In addition to having to conform to a specific Data Value, a Data Type may define additional constraints. There are various common Data Types and each must be handled specifically by the software as, for example, the different nature of Data Types requires different user interface integration. Therefore, the set of Data Types supported by WIKIBASE can only be extended by software developers, not by editors of a WIKIBASE Repository.

More information about the Data Types supported by WIKIBASE.

Data Value[edit]

A Data Value is a container for value conforming to the specific nature defined by the Data Value. This concept derives from the requirement to capture complex values: While there would be no need to encapsulate a plain string, values consisting of multiple aspects, like a coordinate location consisting of latitude, longitude and additional attributes, need to be captured by a dedicated optimized structure.

More information about the Data Values supported by WIKIBASE.

See also[edit]