Proyecto Phabricator: #WDQS

Wikidata Servicio de consulta/Manual del Usuario

From MediaWiki.org
Jump to navigation Jump to search
This page is a translated version of the page Wikidata Query Service/User Manual and the translation is 21% complete.

Other languages:
Deutsch • ‎Ελληνικά • ‎English • ‎español • ‎español de América Latina • ‎français • ‎हिन्दी • ‎italiano • ‎日本語 • ‎한국어 • ‎polski • ‎português do Brasil • ‎தமிழ் • ‎ไทย • ‎中文
Un Tutorial de 5 minutos sobre el Servicio de Consulta en "Wikidata"

Wikidata Query Service es un paquete de software y servicio público diseñado para proveer en lenguaje SPARQL, el cual te permite consultar los datos de Wikidata.

Por favor nota que el servicio está actualmente en modo "beta", lo cual significa que los detalles del conjunto de datos ("data set") o el servicio proporcionado puede cambiar sin previo aviso.

Esta pàgina u otras pàginas de documentaciòn relacionada seràn actualizadas consiguientemente; se recomienda que las mires si estàs utilizando el servicio.

Puedes ver ejemplos de consultas SPARQL en la página de ejemplos de SPARQL.

Datos

El Servicio de Consulta de Wikidata opera en datos fijados desde Wikidata.org representados como RDF descritos en la RDF dump format documentación.

El grupo de datos del servicio no empareja exactamente los datos producidos por vertederos RDF, principalmente por razones de desempeño; la documentación describe un pequeño grupo de diferencias.

Puedes descargar un vertedero semanal de las misma información desde:

https://dumps.wikimedia.org/wikidatawiki/entities/

== Básicas - Entendiendo "SPO"(Tema, Predicado, Objeto) también conocido como Triple Semántico

SPO o "Tema, predicado, objeto" es también conocido como un triple, o comúnmente referido en Wikidata como un "estado" de la información.

La declaración "La capital de los Estados Unidos es Washington DC." consta de "Estados Unidos" como tema (Q30), "capital" como predicado (P36), y "Washington DC" de objeto (Q61). Esta declaración puede ser representada como los tres "URIs"ː

<http://www.wikidata.org/entity/30>  <http://www.wikidata.org/prop/direct/P36>  <http://www.wikidata.org/entity/Q61> .

Gracias a los prefijos (señalados abajo), la misma declaración puede ser escrita en una forma más concisa. Denotar que el punto al final, representa el fin de la declaración.

wd:Q30  wdt:P36  wd:Q61 .

El /entity/ (wd:) representa una entidad en Wikidata (X-número de valores). El /prop/direct/ (wdt:) es una propiedad "verdadera" — un valor que esperaríamos mas a menudo cuando observamos la declaración. Las propiedades verdaderas son necesarias porque algunas declaraciones podríaan ser mas ciertas que otras. Por ejemplo, la declaración "La capital de EE.UU. es la ciudad de New York" también es cierta — pero solo si lo miras por el contexto de la historia de EE.UU.. WDQS utiliza rangos para determinar cuales declaraciones deberían ser utilizadas como "verdaderas".

Además de las declaraciones verdaderas, WDQS almacena todas las declaraciones (las verdaderas y las que no), pero ellas no usan el mismo prefijo wdt:. Capital de EU tiene tres valores: DC, Filadelfia y Nueva York. Y cada uno de estos valores tienen "calificadores" -información adicional, tales como fechas de inicio y de fin, lo cual reduce el alcance de cada declaración. Para almacenar esta información en el triplestore, WDQS introduce una asignatura de declaración "mágica", la cual es esencialmente un número aleatorio.

wd:Q30  p:P36  <random_URI_1> .         # US "indirect capital" is <X>
<random_URI_1>  ps:P36  wd:Q61 .        # The X's "real capital value" is  Washington DC
<random_URI_1>  ps:P580  "1800-11-17" . # The X's start date is 1800-11-17

Ver el Tutorial SPARQL calificado para más información.

Spo también es usado como una forma de sintaxis básica para consultar RDF estructuras de datos o cualquier bases de datos de grafos o triplestore, tal como el servicio Query de Wikidata (WDQS por sus siglasen inglés), el cual es impulsado por Blazegraph, una base de datos de grafos de alto rendimiento.

¡El uso avanzado de un triple (spo) aun incluye uso de triples como objetos o sujetos de otros triples!

Básicos-Prefijos comprensivos

Los sujetos y predicados (primeros y segundos valores del triple) siempre debe ser almacenado como URI. Por ejemplo, si el tema es Universe (Q1), será almacenado como <https://www.wikidata.org/wiki/Q1>. Los prefijos nos permiten escribir esa url larga de una manera más corta: wd:Q1. A diferencia de los sujetos y predicados, el objeto (tercer valor del triple) también puede ser una URI, por ejemplo, un número o una cadena.

El servicio Query de Wikidata entiende muchas abreviaturas, conocidas como prefijos. Algunas son internas a Wikidata, por ejemplo wd, wdt, p, ps, bd, y muchas otras son comunmente usados como prefijos externos, como "rdf", "skos", "owl" o "schema".

En el siguiente query, estamos solicitando items donde hay una declaración de "P279 = Q7725634" o, a grandes rasgos, seleccionando temas que tienen un predicado de "subclase de" con un objeto de "obra literaria". Las variables de salida son:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wds: <http://www.wikidata.org/entity/statement/>
PREFIX wdv: <http://www.wikidata.org/value/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bd: <http://www.bigdata.com/rdf#>

# The below SELECT query does the following:
# Selects all the items(?s subjects) and their descriptions(?desc)
# that have(WHERE) the statement(?s subject) has a direct property(wdt:) = P279 <subclasses of>
# with a value of entity(wd:) = Q7725634 <Literary Work>
# and Optionally return the label and description using the Wikidata label service

SELECT ?s ?desc WHERE {
  ?s wdt:P279 wd:Q7725634 .
  OPTIONAL {
     ?s rdfs:label ?desc filter (lang(?desc) = "en").
   }
 }

Extensiones

El servicio soporta las siguientes extensiones de las capacidades estándar de SPARQL:

Servicio de etiqueta

Puedes recuperar la etiqueta, el "alias" o la descripción de las entidades de tu query, con el respaldo del idioma, usando el servicio especializado con la URI <http://wikiba.se/ontology#label>.

El servicio puede ser usado en uno de dos modos: manual y automático.

En modo automático, tú solo necesitas especificar la plantilla de servicio. Por ejemplo:

Búsqueda geoespacial

El servicio permite buscar items con coordenadas localizadas dentro de cierto radio del centro dentro de cierto cuadro delimitado.

Búsqueda alrededor del punto

Ejemplo:

# Airports within 100km from Berlin
#defaultView:Map
SELECT ?place ?placeLabel ?location ?dist WHERE {
  # Berlin coordinates
  wd:Q64 wdt:P625 ?berlinLoc . 
  SERVICE wikibase:around { 
      ?place wdt:P625 ?location . 
      bd:serviceParam wikibase:center ?berlinLoc . 
      bd:serviceParam wikibase:radius "100" . 
      bd:serviceParam wikibase:distance ?dist.
  } 
  # Is an airport
  ?place wdt:P31/wdt:P279* wd:Q1248784 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" . 
  }
} ORDER BY ASC(?dist)

Try it!

La primera línea de la llamada de servicio around debe tener el formato ?item predicate ?location, donde el resultado de la búsqueda va a unir el ?item a los items dentro de la locación especificada y ?location a sus coordenadas. Los parámetros de apoyo son:

Predicado Significado
wikibase:center El punto a partir del cual se realiza la búsqueda. Debe hacer que la búsqueda funcione.
wikibase:radius Distancia del centro. Actualmente la distancia siempre está en kilómetros, otras unidades aún no son compatibles.
wikibase:globe El globo que está siendo buscado. De forma opcional, el default es Earth (wd:Q2).
wikibase:distance La variable recibiendo información de la distancia.

Búsqueda dentro de la caja

Ejemplo de la búsqueda de cajas:


# Schools between San Jose, CA and Sacramento, CA
#defaultView:Map
SELECT * WHERE {
  wd:Q16553 wdt:P625 ?SJloc .
  wd:Q18013 wdt:P625 ?SCloc .
  SERVICE wikibase:box {
      ?place wdt:P625 ?location .
      bd:serviceParam wikibase:cornerSouthWest ?SJloc .
      bd:serviceParam wikibase:cornerNorthEast ?SCloc .
    }
  ?place wdt:P31/wdt:P279* wd:Q3914 .
}

Try it!

o:


#Schools between San Jose, CA and San Francisco, CA
#defaultView:Map
SELECT ?place ?location WHERE {
wd:Q62 wdt:P625 ?SFloc .
wd:Q16553 wdt:P625 ?SJloc .
SERVICE wikibase:box {
    ?place wdt:P625 ?location .
    bd:serviceParam wikibase:cornerWest ?SFloc .
    bd:serviceParam wikibase:cornerEast ?SJloc .
  }
?place wdt:P31/wdt:P279* wd:Q3914 .
}

Try it!

Las coordenadas pueden ser especificadas directamente:


# Schools between San Jose, CA and Sacramento, CA
#same as previous
#defaultView:Map
SELECT * WHERE {
SERVICE wikibase:box {
    ?place wdt:P625 ?location .
    bd:serviceParam wikibase:cornerWest "Point(-121.872777777 37.304166666)"^^geo:wktLiteral .
    bd:serviceParam wikibase:cornerEast "Point(-121.486111111 38.575277777)"^^geo:wktLiteral .
  }
?place wdt:P31/wdt:P279* wd:Q3914 .
}

Try it!

La primera linea de la llamada de servicio box debe tener el formato ?item predicate ?location, donde el resultado de la búsqueda va a unir el ?item a los intems dentro de la locación especificada y ?location a sus coordenadas. Los parámetros soportados son:

Predicado Significado
wikibase:cornerSouthWest La esquina inferior izquierda de la caja.
wikibase:cornerNorthEast La esquina superior derecha de la caja.
wikibase:cornerWest La esquina izquierda de la caja.
wikibase:cornerEast La esquina derecha de la caja.
wikibase:globe El globo que está siendo buscado, Como opción, el default es Earth (wd:Q2).

wikibase:cornerSouthWest y wikibase:cornerNorthEast deben usarse juntos, así como wikibase:cornerWest y wikibase:cornerEast, y no deben ser mezclado. Si se usan los predicados wikibase:cornerWest y wikibase:cornerEast, entonces los puntos son asumidos para ser las coordenadas de la diagonal de la caja, y en consecuencia las esquinas se derivan.

Funciones extendidas

Función de distancia

La función geof:distance regresa la distancia entre dos puntos de la Tierra, en kilómetros. Ejemplo de uso:


# Airports within 100km from Berlin
SELECT ?place ?placeLabel ?location ?dist WHERE {

  # Berlin coordinates
  wd:Q64 wdt:P625 ?berlinLoc . 
  SERVICE wikibase:around { 
      ?place wdt:P625 ?location . 
      bd:serviceParam wikibase:center ?berlinLoc . 
      bd:serviceParam wikibase:radius "100" . 
  } 
  # Is an airport
  ?place wdt:P31/wdt:P279* wd:Q1248784 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" . 
  }
  BIND(geof:distance(?berlinLoc, ?location) as ?dist) 
} ORDER BY ?dist

Try it!


# Places around 0°,0° 
SELECT *
{
  SERVICE wikibase:around { 
      ?place wdt:P625 ?location . 
      bd:serviceParam wikibase:center "Point(0 0)"^^geo:wktLiteral .
      bd:serviceParam wikibase:radius "250" . 
  } 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . ?place rdfs:label ?placeLabel }
  BIND(geof:distance("Point(0 0)"^^geo:wktLiteral, ?location) as ?dist) 
} 
ORDER BY ?dist

Try it!

Funciones de piezas de coordenadas

Las funciones geof:globe, geof:latitude y geof:longitude devuelven las partes de una URI de coordenada-globo, y en consecuencia la latitud y longitud.

Descodificar funciones de URL

Function wikibase:decodeUri decodes (i.e. reverses percent-encoding) given URI string. This may be necessary when converting Wikipedia titles (which are encoded) into actual strings. This function is an opposite of SPARQL encode_for_uri function.

Automatic prefixes

Most prefixes that are used in common queries are supported by the engine without the need to explicitly specify them.

Extended dates

The service supports date values of type xsd:dateTime in the range of about 290B years in the past and in the future, with one-second resolution. WDQS stores dates as the 64-bit number of seconds since the Unix epoch.

Blazegraph extensions

Blazegraph platform on top of which WDQS is implemented has its own set of SPARQL extension. Among them several graph traversal algorithms which are documented on Blazegraph Wiki, including BFS, shortest path, CC and PageRank implementations.

Please also refer to the Blazegraph documentation on query hints for information about how to control query execution and various aspects of the engine.

Federation

We allow SPARQL Federated Queries to call out to a selected number of external databases. Supported endpoints are:

URL Owner (docs)
http://sparql.europeana.eu/ Europeana
http://data.cervantesvirtual.com/openrdf-sesame/repositories/data Biblioteca Virtual Miguel de Cervantes
http://datos.bne.es/sparql Biblioteca Nacional de España
http://edan.si.edu/saam/sparql Smithsonian American Art Museum
http://data.bnf.fr/sparql Bibliothèque nationale de France
http://dbpedia.org/sparql DBPedia
http://vocab.getty.edu/sparql.json Getty Vocabularies
http://rdf.insee.fr/sparql INSEE
http://dati.emilia-romagna.it/sparql Istituto per i beni artistici, culturali e naturali
http://dati.camera.it/sparql Italian Chamber of Deputies
http://nomisma.org/query Nomisma.org
http://data.plan4all.eu/sparql Smart Points of Interest
http://opendatacommunities.org/sparql UK Department for Communities and Local Government
http://statistics.data.gov.uk/sparql UK Office for National Statistics
http://data.ordnancesurvey.co.uk/datasets/os-linked-data/apis/sparql UK ordnance survey
http://linkeddata.uriburner.com/sparql URI Burnder
http://sparql.wikipathways.org/ WikiPathways
http://tools.wmflabs.org/mw2sparql/sparql MW2SPARQL
http://collection.britishart.yale.edu/sparql/ Yale Center for British Art
http://linkedgeodata.org/sparql Linked Geodata
http://sisinflab.poliba.it/semanticweb/lod/losm/sparql Linked Open Street Map
http://etna.istc.cnr.it/framester2/sparql Framester
http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc AGROVOC

Example federated query:

SELECT ?workLabel WHERE {
  wd:Q165257 wdt:P2799 ?id 
  BIND(uri(concat("http://data.cervantesvirtual.com/person/", ?id)) as ?bvmcID)
  SERVICE <http://data.cervantesvirtual.com/openrdf-sesame/repositories/data> {
    ?bvmcID <http://rdaregistry.info/Elements/a/otherPFCManifestationOf> ?work .
    ?work rdfs:label ?workLabel        
  }
}

Try it!

Please note that the databases listed above use ontologies that may be very different from the Wikidata one. Please refer to the owner documentation links above to learn about the ontologies and data access to these databases.

Mediawiki API

Please see full description on Mediawiki API Service documentation page.

Mediawiki API Service allows to call out to Mediawiki API from SPARQL, and receive the results from inside the SPARQL query. Example (finding category members):


SELECT * WHERE {
  wd:Q27119725 wdt:P910 ?category .
  ?link schema:about ?category; schema:isPartOf <https://en.wikipedia.org/>; schema:name ?title .
  SERVICE wikibase:mwapi {
	 bd:serviceParam wikibase:api "Generator" .
     bd:serviceParam wikibase:endpoint "en.wikipedia.org" .
     bd:serviceParam mwapi:gcmtitle ?title .
     bd:serviceParam mwapi:generator "categorymembers" .
     bd:serviceParam mwapi:gcmprop "ids|title|type" .
     bd:serviceParam mwapi:gcmlimit "max" .
    # out
    ?subcat wikibase:apiOutput mwapi:title  .
    ?ns wikibase:apiOutput "@ns" .
    ?item wikibase:apiOutputItem mwapi:item .
  }
}

Try it!

Wikimedia service

Wikimedia runs the public service instance of WDQS, which is available for use at http://query.wikidata.org/.

The runtime of the query on the public endpoint is limited to 60 seconds. That is true both for the GUI and the public SPARQL endpoint. If you need to run longer queries, please contact the Discovery team.

GUI

The GUI at the home page of http://query.wikidata.org/ allows you to edit and submit SPARQL queries to the query engine. The results are displayed as an HTML table. Note that every query has a unique URL which can be bookmarked for later use. Going to this URL will put the query in the edit window, but will not run it - you still have to click "Execute" for that.

One can also generate a short URL for the query via a URL shortening service by clicking the "Generate short URL" link on the right - this will produce the shortened URL for the current query.

The "Add prefixes" button generates the header containing standard prefixes for SPARQL queries. The full list of prefixes that can be useful is listed in the RDF format documentation. Note that most common prefixes work automatically, since WDQS supports them out of the box.

The GUI also features a simple entity explorer which can be activated by clicking on the "🔍" symbol next to the entity result. Clicking on the entity Q-id itself will take you to the entity page on wikidata.org.

Default views

Main article: wikidata:Special:MyLanguage/Wikidata:SPARQL query service/Wikidata Query Help/Result Views

If you run the query in the WDQS GUI, you can choose which view to present by specifying a comment: #defaultView:viewName at the beginning of the query.

SPARQL endpoint

SPARQL queries can be submitted directly to the SPARQL endpoint with a GET or POST request to https://query.wikidata.org/sparql?query=SPARQL. The result is returned as XML by default, or as JSON if either the query parameter format=json or the header Accept: application/sparql-results+json are provided. POST requests also accept the query in the body of the request, instead of URL, allowing to run larger queries without hitting URL length limit. (Note that the POST body must still be query=SPARQL, not just SPARQL, and the SPARQL query must still be URL-escaped.)

JSON format is standard SPARQL 1.1 Query Results JSON Format.

It is recommended to use GET for smaller queries and POST for larger queries, as POST queries are not cached.

Supported formats

The following output formats are currently supported by the SPARQL endpoint:

Format HTTP Header Query parameter Description
XML Accept: application/sparql-results+xml format=xml XML result format, is returned by default. As specified in https://www.w3.org/TR/rdf-sparql-XMLres/
JSON Accept: application/sparql-results+json format=json JSON result format, as in: https://www.w3.org/TR/sparql11-results-json/
TSV Accept: text/tab-separated-values As specified in https://www.w3.org/TR/sparql11-results-csv-tsv/
CSV Accept: text/csv As specified in https://www.w3.org/TR/sparql11-results-csv-tsv/
Binary RDF Accept: application/x-binary-rdf-results-table

Query timeout

There is a hard query deadline configured which is set to 60 seconds.

Every query will timeout when it takes more time to execute than this configured deadline. You may want to optimize the query or report a problematic query here.

Also note that currently access to the service is limited to 5 parallel queries per IP. These limits are subject to change depending on resources and usage patterns.

Namespaces

The data on Wikidata Query Service contains the main namespace, wdq, to which queries to the main SPARQL endpoint are directed, and other auxiliary namespaces, listed below. To query data from different namespace, use endpoint URL https://query.wikidata.org/bigdata/namespace/NAMESPACENAME/sparql.

Categories

DCAT-AP

The DCAT-AP data for Wikidata is available as SPARQL in namespace dcatap.

The SPARQL endpoint for accessing it is: https://query.wikidata.org/bigdata/namespace/dcatap/sparql

The source for the data is: https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf

Example query to retrieve data:


PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?url ?date ?size WHERE {
  <https://www.wikidata.org/about#catalog> dcat:dataset ?dump .
  ?dump dcat:distribution [
    dct:format "application/json" ;
    dcat:downloadURL ?url ;
    dct:issued ?date ;
    dcat:byteSize ?size 
  ] .
}

Try it!

Linked Data Fragments endpoint

We also support querying the database using Triple Pattern Fragments interface. This allows to cheaply and efficiently browse triple data where one or two components of the triple is known and you need to retrieve all triples that match this template. See more information at the Linked Data Fragments site.

The interface can be accessed by the URL: https://query.wikidata.org/bigdata/ldf. Example requests:

Note that only full URLs are currently supported for the subject, predicate and object parameters.

By default, HTML interface is displayed, however several data formats are available, defined by Accept HTTP header.

Accept Format
text/html Default HTML browsing interface
text/turtle Turtle format
application/ld+json JSON-LD format
application/n-triples N-Triples format
application/rdf+xml RDF/XML format

The data is returned in pages, page size being 100 triples. The pages are numbered starting from 1, and page number is defined by page parameter.

Standalone service

As the service is open source software, it is also possible to run the service on any user's server, by using the instructions provided below.

The hardware recommendations can be found in Blazegraph documentation.

If you plan to run the service against non-Wikidata Wikibase instance, please see further instructions.

Installing

In order to install the service, it is recommended that you download the full service package as a ZIP file, e.g. from Maven Central, with group ID org.wikidata.query.rdf and artifact ID "service" , or clone the source distribution at https://github.com/wikimedia/wikidata-query-rdf/ and build it with "mvn package". The package ZIP will be in the dist/target directory under service-VERSION-dist.zip .

The package contains the Blazegraph server as a .war application, the libraries needed to run the updater service to fetch fresh data from the wikidata site, scripts to make various tasks easier, and the GUI in the gui subdirectory. If you want to use the GUI, you will have to configure your HTTP server to serve it.

By default, only the SPARQL endpoint at http://localhost:9999/bigdata/namespace/wdq/sparql is configured, and the default Blazegraph GUI is available at http://localhost:9999/bigdata/. Note that in the default configuration, both are accessible only from localhost. You will need to provide external endpoints and an appropriate access control if you intend to access them from outside.

Using snapshot versions

If you want to install an un-released snapshot version (usually this is necessary if released version has a bug which is fixed but new release is not available yet) and do not want to compile your own binaries, you can use either:

Loading data

Further install procedure is described in detail in the Getting Started document which is part of the distribution, and involves the following steps:

  1. Download recent RDF dump from https://dumps.wikimedia.org/wikidatawiki/entities/ (the RDF one is the one ending in .ttl.gz).
  2. Pre-process data with the munge.sh script. This creates a set of TTL files with preprocessed data, with names like wikidump-000000001.ttl.gz, etc. See options for the script below.
  3. Start Blazegraph service by running the runBlazegraph.sh script.
  4. Load the data into the service by using loadData.sh. Note that loading data is usually significantly slower than pre-processing, so you can start loading as soon as several preprocessed files are ready. Loading can be restarted from any file by using the options as described below.
  5. After all the data is loaded, start the Updater service by using runUpdater.sh.

Loading categories

If you also want to load category data, please do the following:

  1. Create namespace, e.g. categories: createNamespace.sh categories
  2. Load data into it: forAllCategoryWikis.sh loadCategoryDump.sh categories

Note that these scripts only load data from Wikimedia wikis according to Wikimedia settings. If you need to work with other wiki, you may need to change some variables in the scripts.

Scripts

The following useful scripts are part of the distribution:

munge.sh

Pre-process data from RDF dump for loading.

Option Required? Explanation
-f filename Yes Filename of the RDF dump
-d directory No Directory where the processed files will be written, default is current directory
-l language No If specified, only labels for the given language will be retained. Use this option if you need only one language, as it may improve performance, reduce the database size and simplify queries.
-s No If specified, the data about sitelinks is excluded. Use this option if you do not need to query sitelinks, as this may improve performance and reduce the database size.

Example:

./munge.sh -f data/wikidata-20150427-all-BETA.ttl.gz -d data -l en -s

loadData.sh

Load processed data into Blazegraph. Requires curl to be installed.

Option Required? Explanation
-n namespace Yes Specifies the graph namespace into which the data is loaded, which should be wdq for WDQS data
-d directory No Directory where processed files are stored, by default the current directory
-h host No Hostname of the SPARQL endpoint, by default localhost
-c context No Context URL of the SPARQL endpoint, by default bigdata - usually doesn't need to be changed for WDQS
-s start No Number of the processed file to start with, by default 1
-e end No Number of the processed file to end with

Example:

./loadData.sh -n wdq -d `pwd`/data

runBlazegraph.sh

Run the Blazegraph service.

Option Required? Explanation
-d directory No Home directory of the Blazegraph installation, by default the same directory where the script is
-c context No Context URL of the SPARQL endpoint, by default bigdata - usually doesn't need to be changed for WDQS
-p port No Port number of the SPARQL service, by default 9999
-o options No Add options to the command line

Example:

./runBlazegraph.sh

Inside the script, there are two variables that one may want to edit:

# Q-id of the default globe
DEFAULT_GLOBE=2
# Blazegraph HTTP User Agent for federation
USER_AGENT="Wikidata Query Service; https://query.wikidata.org/";

Also, the following environment variables are checked by the script (all of them are optional):

Variable Default Explanation
HOST localhost Hostname for binding the Blazegraph service
PORT 9999 Port for binding the Blazegraph service
DIR directory where the script is located Directory where config files are stored
HEAP_SIZE 16g Java heap size for Blazegraph
MEMORY -Xms${HEAP_SIZE} -Xmx${HEAP_SIZE} Full Java memory settings for Blazegraph
GC_LOGS see the source GC logging settings
CONFIG_FILE RWStore.properties Blazegraph configuration file location
BLAZEGRAPH_OPTS empty Additional options, are passed as-is to the Java command line

runUpdate.sh

Run the Updater service.

Option Required? Explanation
-n namespace Yes Specifies the graph namespace into which the data is loaded, should be wdq for WDQS data
-h host No Hostname of the SPARQL endpoint, by default localhost
-c context No Context URL of the SPARQL endpoint, by default bigdata - usually doesn't need to be changed for WDQS
-l language No If specified, only labels for given language will be retained. Use this option if you need only one language, as it may improve performance, reduce the database size and simplify queries.
-s No If specified, the data about sitelinks is excluded. Use this option of you do not need to query sitelinks, as this may improve performance and reduce the database size.
-t secs No Timeout when communicating to Blazegraph, in seconds.

It is recommended that the settings for the -l and -s options (or absence thereof) be the same for munge.sh and runUpdate.sh, otherwise data may not be updated properly.

Example:

./runUpdate.sh -n wdq

Also, the following environment variables are checked by the script (all of them are optional):

Variable Default Explanation
UPDATER_OPTS empty Additional options, are passed as-is to the Java command line

Updater options

The following options works with Updater app.

They should be given to the runUpdater.sh script as additional options after --, e.g.: runUpdater.sh -- -v.

Options for the Updater
Option Long option Meaning
-v --verbose Verbose mode
-s TIMESTAMP --start TIMESTAMP Start data collection from certain timestamp, in 2015-02-11T17:11:08Z or 20150211170100 format.
--keepTypes Keep all type statements
--ids ID1,ID2,... Update certain IDs and exit
--idrange ID1-ID2 Update range of IDs and exit
-d SECONDS --pollDelay SECONDS How long to sleep when no new data is available
-t NUMBER --threadCount NUMBER How many threads to use when fetching Wikibase data
-b NUMBER --batchSize NUMBER Now many changes to fetch from RecentChanges API
-V --verify Verify data after loading (SLOW! For debug use only)
-T SECONDS --tailPollerOffset SECONDS Use secondary trailing poller with given offset behind the main one
--entityNamespaces NUMBER,NUMBER,... List of Wikibase namespaces to check for changes
--wikibaseScheme SCHEME URL scheme (http, https) to use when talking to Wikibase
--wikibaseHost HOSTNAME Hostname to use when talking to Wikibase
-I --init If specified together with start time, this time is marked in the database as most recent change time, and future requests would be using it as starting point even if no newer data has been found.

Configurable properties

The following properties are configurable via adding them to the script run command in the scripts above:

Name Meaning Default
wikibaseServiceWhitelist Filename of remote service whitelist. Applies to Blazegraph. whitelist.txt
org.wikidata.query.rdf.blazegraph.mwapi.MWApiServiceFactory.config Config file for MWAPI integration mwservices.json
wikibaseHost Hostname of the wikibase instance. Applies to both Blazegraph and Updater. www.wikidata.org
org.wikidata.query.rdf.blazegraph.inline.literal.WKTSerializer.noGlobe Default globe value for coordinates that have no globe. "2" would mean that entity Q2 is the default globe. "0" means no default globe. Applies to Blazegraph. 0
org.wikidata.query.rdf.tool.rdf.RdfRepository.timeout Timeout when communicating with RDF repository, in seconds. Applies to Updater. -1
org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.timeout Timeout when communicating with wikibase repository, in milliseconds. Applies to Updater. 5000
http.userAgent User agent that the service would use while calling other services
http.proxyHost http.proxyPort

https.proxyHost https.proxyPort

Proxy settings used while calling other services
wikibaseMaxDaysBack How many days back we can request the recent changes data from Updater. If the database is more than this number of days older, it should be reloaded from more recent dump. 30


Missing features

Below are features which are currently not supported:

  • Redirects are only represented as owl:sameAs triple, but do not express any equivalence in the data and have no special support.

Contacts

If you notice anything wrong with the service, you can contact the Discovery team by email on the list discovery@lists.wikimedia.org or on the IRC channel #wikimedia-discovery.

Bugs can also be submitted to PhabricatorPhabricator and tracked on the Discovery Phabricator board.

See also