Wikidata query service/User Manual

From MediaWiki.org
Jump to: navigation, search
A 5-min Wikidata Query Service tutorial

Wikidata Query Service (WDQS) is a software package and public service designed to provide a SPARQL endpoint which allows you to query against the Wikidata data set.

Please note that the service is currently in beta mode, which means that details of the data set or the service provided can change without prior warning. This page or other relevant documentation pages will be updated accordingly; it is recommended that you watch them if you are using the service.

You can see examples of the SPARQL Queries on the SPARQL examples page.

Data set

The Wikidata Query Service operates on a data set from Wikidata.org, represented in RDF as described in the RDF dump format documentation. The service's data set does not exactly match the data set produced by RDF dumps, mainly for performance reasons; the documentation describes a small set of differences.

You can download a weekly dump of the same data from https://dumps.wikimedia.org/wikidatawiki/entities/

Basics - Understanding SPO (Subject, Predicate, Object) also known as a Semantic Triple

spo or "subject, predicate, object" is known as a triple, or commonly referred to in Wikdata as a statement about data.

The statement "The sky has the color blue", consists of a subject ("the sky"), a predicate ("has the color"), and an object ("blue").

spo is also used as a form of basic syntax layout for querying RDF data structures, or any graph database or triplestore, such as the Wikidata Query Service (WDQS), which is powered by Blazegraph, a high performance graph database.

Advanced uses of a triple (spo) even including using triples as objects or subjects of other triples!

Basics - Understanding Prefixes

WDQS understands many shortcut abbreviations, known as prefixes. Some are internal to Wikidata

wd, wdt, p, ps, bd, etc.

and many others are commonly used external prefixes, like

rdf, skos, owl, schema, etc.

?s is a prefix for a statement, or triple, or you could even think of it as the subject in an spo triple.
In the following query, we are asking for items where there is a statement of "P279 = Q7725634" or in fuller terms, selecting subjects that have a predicate of "subclass of" with an object of = "literary work".

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wds: <http://www.wikidata.org/entity/statement/>
PREFIX wdv: <http://www.wikidata.org/value/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bd: <http://www.bigdata.com/rdf#>

# The below SELECT query does the following:
# Selects all the items(?s subjects) and their descriptions(?desc)
# that have(WHERE) the statement(?s subject) has a direct property(wdt:) = P279 <subclasses of>
# with a value of entity(wd:) = Q7725634 <Literary Work>
# and Optionally return the label and description using the Wikidata label service

SELECT ?s ?desc WHERE {
  ?s wdt:P279 wd:Q7725634 .
  OPTIONAL {
     ?s rdfs:label ?desc filter (lang(?desc) = "en").
   }
 }

Extensions

The service supports the following extensions to standard SPARQL capabilities:

Label service

You can fetch the label, alias, or description of entities you query, with language fallback, using the specialized service with the URI <http://wikiba.se/ontology#label>. The service is very helpful when you want to retrieve labels, as it reduces the complexity of SPARQL queries that you would otherwise need to achieve the same effect.

The service can be used in one of the two modes: manual and automatic.

In automatic mode, you only need to specify the service template, e.g.:

 PREFIX wikibase: <http://wikiba.se/ontology#>
 SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" .
 }

and WDQS will automatically generate labels as follows:

  • If an unbound variable in SELECT is named ?NAMELabel, then WDQS produces the label (rdfs:label) for the entity in variable ?NAME.
  • If an unbound variable in SELECT is named ?NAMEAltLabel, then WDQS produces the alias (skos:altLabel) for the entity in variable ?NAME.
  • If an unbound variable in SELECT is named ?NAMEDescription, then WDQS produces the description (schema:description) for the entity in variable ?NAME.

In each case, the variable in ?NAME should be bound, otherwise the service fails.

You specify your preferred language(s) for the label with one or more of bd:serviceParam wikibase:language "language-code" triples. Each string can contain one or more language codes, separated by commas. WDQS considers languages in the order in which you specify them. If no label is available in any of the specified languages, the Q-id of the entity (without any prefix) is its label.

Example, showing the list of US presidents and their spouses:

SELECT ?p ?pLabel ?w ?wLabel WHERE {
   wd:Q30 p:P6/ps:P6 ?p .
   ?p wdt:P26 ?w .
   SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
   }
 }

Try it!

In this example WDQS automatically creates the labels ?pLabel and ?wLabel for properties

In the manual mode, you explicitly bind the label variables within the service call, but WDQS will still provide language resolution and fallback. Example:

  SELECT *
   WHERE {
     SERVICE wikibase:label {
       bd:serviceParam wikibase:language "fr,de,en" .
       wd:Q123 rdfs:label ?q123Label .
       wd:Q123 skos:altLabel ?q123Alt .
       wd:Q123 schema:description ?q123Desc .
       wd:Q321 rdf:label ?q321Label .
    }
  }

This will consider labels and descriptions in French, German and English, and if none are available, will use the Q-id as the label.

Geospatial search

The service allows to search for items with coordinates located within certain radius of the center of within certain bounding box.

Search around point

Example:

# Airports within 100km from Berlin
#defaultView:Map
SELECT ?place ?placeLabel ?location ?dist WHERE {
  # Berlin coordinates
  wd:Q64 wdt:P625 ?berlinLoc . 
  SERVICE wikibase:around { 
      ?place wdt:P625 ?location . 
      bd:serviceParam wikibase:center ?berlinLoc . 
      bd:serviceParam wikibase:radius "100" . 
      bd:serviceParam wikibase:distance ?dist.
  } 
  # Is an airport
  ?place wdt:P31/wdt:P279* wd:Q1248784 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" . 
  }
} ORDER BY ASC(?dist)

Try it!

The first line of the around service call must have format ?item predicate ?location, where the result of the search will bind ?item to items within the specified location and ?location to their coordinates. The parameters supported are:

Predicate Meaning
wikibase:center The point around which search is performed. Must be bound for search to work.
wikibase:radius Distance from the center. Currently the distance is always in kilometers, other units are not supported yet.
wikibase:globe The globe which is being searched. Optional, default it's Earth (wd:Q2).
wikibase:distance The variable receiving distance information

Search within box

Example of box search:

# Schools between San Jose, CA and Sacramento, CA
#defaultView:Map
SELECT * WHERE {
  wd:Q16553 wdt:P625 ?SJloc .
  wd:Q18013 wdt:P625 ?SCloc .
  SERVICE wikibase:box {
      ?place wdt:P625 ?location .
      bd:serviceParam wikibase:cornerSouthWest ?SJloc .
      bd:serviceParam wikibase:cornerNorthEast ?SCloc .
    }
  ?place wdt:P31/wdt:P279* wd:Q3914 .
}

Try it!

or:

#Schools between San Jose, CA and San Francisco, CA
#defaultView:Map
SELECT ?place ?location WHERE {
wd:Q62 wdt:P625 ?SFloc .
wd:Q16553 wdt:P625 ?SJloc .
SERVICE wikibase:box {
    ?place wdt:P625 ?location .
    bd:serviceParam wikibase:cornerWest ?SFloc .
    bd:serviceParam wikibase:cornerEast ?SJloc .
  }
?place wdt:P31/wdt:P279* wd:Q3914 .
}

Try it!

Coordinates may be specified directly:

# Schools between San Jose, CA and Sacramento, CA
#same as previous
#defaultView:Map
SELECT * WHERE {
SERVICE wikibase:box {
    ?place wdt:P625 ?location .
    bd:serviceParam wikibase:cornerWest "Point(-121.872777777 37.304166666)"^^geo:wktLiteral .
    bd:serviceParam wikibase:cornerEast "Point(-121.486111111 38.575277777)"^^geo:wktLiteral .
  }
?place wdt:P31/wdt:P279* wd:Q3914 .
}

Try it!

The first line of the box service call must have format ?item predicate ?location, where and the result of the search will bind ?item to items within the specified location and ?location to their coordinates. The parameters supported are:

Predicate Meaning
wikibase:cornerSouthWest The south-west corner of the box.
wikibase:cornerNorthEast The north-east corner of the box.
wikibase:cornerWest The western corner of the box.
wikibase:cornerEast The eastern corner of the box.
wikibase:globe The globe which is being searched. Optional, default it's Earth (wd:Q2).

wikibase:cornerSouthWest and wikibase:cornerNorthEast should be used together, as well as wikibase:cornerWest and wikibase:cornerEast, and can not be mixed. If wikibase:cornerWest and wikibase:cornerEast predicates are used, then the points are assumed to be the coordinates of the diagonal of the box, and the corners are derived accordingly.

Distance function

The function geof:distance returns distance between two points, in kilometers. Example usage:

# Airports within 100km from Berlin
SELECT ?place ?placeLabel ?location ?dist WHERE {

  # Berlin coordinates
  wd:Q64 wdt:P625 ?berlinLoc . 
  SERVICE wikibase:around { 
      ?place wdt:P625 ?location . 
      bd:serviceParam wikibase:center ?berlinLoc . 
      bd:serviceParam wikibase:radius "100" . 
  } 
  # Is an airport
  ?place wdt:P31/wdt:P279* wd:Q1248784 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" . 
  }
  BIND(geof:distance(?berlinLoc, ?location) as ?dist) 
} ORDER BY ?dist

Try it!

# Places around 0°,0° 
SELECT *
{
  SERVICE wikibase:around { 
      ?place wdt:P625 ?location . 
      bd:serviceParam wikibase:center "Point(0 0)"^^geo:wktLiteral .
      bd:serviceParam wikibase:radius "250" . 
  } 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . ?place rdfs:label ?placeLabel }
  BIND(geof:distance("Point(0 0)"^^geo:wktLiteral, ?location) as ?dist) 
} 
ORDER BY ?dist

Try it!

Automatic prefixes

Most prefixes that are used in common queries are supported by the engine without the need to explicitly specify them.

Extended dates

The service supports date values of type xsd:dateTime in the range of about 290B years in the past and in the future, with one-second resolution. WDQS stores dates as the 64-bit number of seconds since the Unix epoch.

Wikimedia service

Wikimedia runs the public service instance of WDQS, which is available for use at http://query.wikidata.org/.

The runtime of the query on the public endpoint is limited to 30 seconds. That is true both for the GUI and the public SPARQL endpoint. If you need to run longer queries, please contact the Discovery team.

GUI

The GUI at the home page of http://query.wikidata.org/ allows you to edit and submit SPARQL queries to the query engine. The results are displayed as an HTML table. Note that every query has a unique URL which can be bookmarked for later use. Going to this URL will put the query in the edit window, but will not run it - you still have to click "Execute" for that.

One can also generate a short URL for the query via a URL shortening service by clicking the "Generate short URL" link on the right - this will produce the shortened URL for the current query.

The "Add prefixes" button generates the header containing standard prefixes for SPARQL queries. The full list of prefixes that can be useful is listed in the RDF format documentation. Note that most common prefixes work automatically, since WDQS supports them out of the box.

The GUI also features a simple entity explorer which can be activated by clicking on the "🔍" symbol next to the entity result. Clicking on the entity Q-id itself will take you to the entity page on wikidata.org.

Default views

If you run the query in the WDQS GUI, you can choose which view to present by specifying a comment: #defaultView:viewName at the beginning of the query. Supported views are:

  • Table - default view, displays the results as a table of values
  • Map - displays coordinate points if any present in the result
  • ImageGrid - displays images present in the result as a grid
  • BubbleChart - displays bubble chart for numbers found in the result
  • TreeMap - displays hierarchical tree map for numbers found in the result
  • Timeline - for results having dates, displays timeline placing each row at appropriate time
  • Dimensions - displays rows as lines between points on the scales representing each column
  • Graph - displays result as a connected graph, using linkTo column

SPARQL endpoint

SPARQL queries can be submitted directly to the SPARQL endpoint with a GET request to https://query.wikidata.org/sparql?query=SPARQL (POST and other method requests will be denied with a "403 Forbidden"). The result is returned as XML by default, or as JSON if either the query parameter format=json or the header Accept: application/sparql-results+json are provided.

JSON format is standard SPARQL 1.1 Query Results JSON Format.

Supported formats

The following output formats are currently supported by the SPARQL endpoint:

Format HTTP Header Query parameter Description
XML Accept: application/sparql-results+xml format=xml XML result format, is returned by default. As specified in https://www.w3.org/TR/rdf-sparql-XMLres/
JSON Accept: application/sparql-results+json format=json JSON result format, as in: https://www.w3.org/TR/sparql11-results-json/
TSV Accept: text/tab-separated-values As specified in https://www.w3.org/TR/sparql11-results-csv-tsv/
CSV Accept: text/csv As specified in https://www.w3.org/TR/sparql11-results-csv-tsv/
Binary RDF Accept: application/x-binary-rdf-results-table

Query timeout

There is a hard query deadline configured which is set to 30 seconds.

Every query will timeout when it takes more time to execute than this configured deadline.
You may want to optimize the query or report a problematic query here Wikidata_query_service/Problematic_queries

Standalone service

As the service is open source software, it is also possible to run the service on any user's server, by using the instructions provided below.

The hardware recommendations can be found in Blazegraph documentation.

Installing

In order to install the service, it is recommended that you download the full service package as a ZIP file, e.g. from Maven Central, with group ID org.wikidata.query.rdf and artifact ID "service" , or clone the source distribution at https://github.com/wikimedia/wikidata-query-rdf/ and build it with "mvn package". The package ZIP will be in the dist/target directory under service-VERSION-dist.zip .

The package contains the Blazegraph server as a .war application, the libraries needed to run the updater service to fetch fresh data from the wikidata site, scripts to make various tasks easier, and the GUI in the gui subdirectory. If you want to use the GUI, you will have to configure your HTTP server to serve it.

By default, only the SPARQL endpoint at http://localhost:9999/bigdata/namespace/wdq/sparql is configured, and the default Blazegraph GUI is available at http://localhost:9999/bigdata/. Note that in the default configuration, both are accessible only from localhost. You will need to provide external endpoints and an appropriate access control if you intend to access them from outside.

Loading data

Further install procedure is described in detail in the Getting Started document which is part of the distribution, and involves the following steps:

  1. Download recent RDF dump from https://dumps.wikimedia.org/wikidatawiki/entities/ (the RDF one is the one ending in .ttl.gz).
  2. Pre-process data with the munge.sh script. This creates a set of TTL files with preprocessed data, with names like wikidump-000000001.ttl.gz, etc. See options for the script below.
  3. Start Blazegraph service by running the runBlazegraph.sh script.
  4. Load the data into the service by using loadData.sh. Note that loading data is usually significantly slower than pre-processing, so you can start loading as soon as several preprocessed files are ready. Loading can be restarted from any file by using the options as described below.
  5. After all the data is loaded, start the Updater service by using runUpdater.sh.

Scripts

The following useful scripts are part of the distribution:

munge.sh

Pre-process data from RDF dump for loading.

Option Required? Explanation
-f filename Yes Filename of the RDF dump
-d directory No Directory where the processed files will be written, default is current directory
-l language No If specified, only labels for the given language will be retained. Use this option if you need only one language, as it may improve performance, reduce the database size and simplify queries.
-s No If specified, the data about sitelinks is excluded. Use this option if you do not need to query sitelinks, as this may improve performance and reduce the database size.

Example:

./munge.sh -f data/wikidata-20150427-all-BETA.ttl.gz -d data -l en -s

loadData.sh

Load processed data into Blazegraph. Requires curl to be installed.

Option Required? Explanation
-n namespace Yes Specifies the graph namespace into which the data is loaded, which should be wdq for WDQS data
-d directory No Directory where processed files are stored, by default the current directory
-h host No Hostname of the SPARQL endpoint, by default localhost
-c context No Context URL of the SPARQL endpoint, by default bigdata - usually doesn't need to be changed for WDQS
-s start No Number of the processed file to start with, by default 1
-e end No Number of the processed file to end with

Example:

./loadData.sh -n wdq -d `pwd`/data

runBlazegraph.sh

Run the Blazegraph service.

Option Required? Explanation
-d directory No Home directory of the Blazegraph installation, by default the same directory where the script is
-c context No Context URL of the SPARQL endpoint, by default bigdata - usually doesn't need to be changed for WDQS
-p port No Port number of the SPARQL service, by default 9999

Example:

./runBlazegraph.sh

runUpdate.sh

Run the Updater service.

Option Required? Explanation
-n namespace Yes Specifies the graph namespace into which the data is loaded, should be wdq for WDQS data
-h host No Hostname of the SPARQL endpoint, by default localhost
-c context No Context URL of the SPARQL endpoint, by default bigdata - usually doesn't need to be changed for WDQS
-l language No If specified, only labels for given language will be retained. Use this option if you need only one language, as it may improve performance, reduce the database size and simplify queries.
-s No If specified, the data about sitelinks is excluded. Use this option of you do not need to query sitelinks, as this may improve performance and reduce the database size.

It is recommended that the settings for the -l and -s options (or absence thereof) be the same for munge.sh and runUpdate.sh, otherwise data may not be updated properly.

Example:

./runUpdate.sh -n wdq

Missing features

Below are features which are currently not supported:

  • Redirects are only represented as owl:sameAs triple, but do not express any equivalence in the data and have no special support.
  • SERVICE requests to outside URLs are not allowed in queries.

Contacts

If you notice anything wrong with the service, you can contact the Discovery team by email on the list wikimedia-search@lists.wikimedia.org or on the IRC channel #wikimedia-discovery.

Bugs can also be submitted to Phabricator and tracked on the Discovery Phabricator board.

See Also