This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date.
Wikidata Query Service Beta
Wikidata query service beta deployment
The purpose of this deployment is to provide test grounds for the query service and collect basic usage patterns. The service runs at http://wdqs-test.wmflabs.org/ (offline).
db01 is internal host for experiments.
If you need access to it ping any member of wikidata-query project on Labs. Each is an xlarge instance with 160G storage.
The code comes from https://github.com/wikimedia/wikidata-query-rdf/. See https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md for detailed description of how to build and set up stuff. This is already done on the beta host, so it's for information/disaster recovery purposes only.
All necessary data except for nginx configs (see below) is contained in
service-*-dist.zip deployment package, which is what is deployed at
/srv/wdqs/blazegraph. Deployment can be done by puppet role below. Note that puppet role does not start Updater service.
Puppet is using self-hosted puppetmaster at
Configuration for puppetmaster:
- set the puppetmaster to
- check role
Configuration for clients:
- set the puppetmaster to
- enable role
Blazegraph is deployed in
/srv/wdqs/blazegraph, running under user
blazegraph. If the service is stopped or crashes, to restart it, run:
# ./runBlazegraph.sh | tee $(date +%s).log
/srv/blazegraph. Preserving logs at least for some time is recommended in case some unexpected failure happens. No log rotation scheme in place so far, so just delete the old ones once you're sure nobody needs them anymore.
Some interesting settings may be found in
/srv/wdqs/blazegraph/blazegraph/WEB-INF/web.xml - namely
queryTimeout. Changing those probably requires restart. Note that if you restart the Blazegraph service you may also need to restart the updater as it may give up if the Blazegraph is offline for too long (see below).
The Blazegraph instance has a GUI workbench accessible at
http://localhost:9999/. It is not for public access, as it allows full write access to the database. One can access it by configuring port forwarding while logging in to the host via ssh.
The updater is the service that is constantly pulling Wikidata and synchronizing it with current database. If it stops, query service is still functional but contains data up to the last successful update. This service can be run under any user, as it communicates with Blazegraph only via REST API and does not store any persistent data by itself, everything is stored in Blazegraph. Currently runs under
Can be run as:
# ./runUpdater -n wdq
/srv/wdqs/blazegraph. However, running it as a service:
service wdqs-updater start - is recommended.
The updater log is configured by
updater-logs.xml. The updater logs progress information like this:
20:32:55.850 [main] INFO org.wikidata.query.rdf.tool.Update - Polled up to 2015-05-19T09:11:50Z at (2.6, 2.7, 2.8) updates per second and (2085.5, 2096.2, 2202.2) milliseconds per second
The date is the point in the main database to which it is updated, the first set of numbers is number of entities updated per second, the second - how far in catching up with the main data it got in a second. These numbers are relevant only if the service is behind the main DB.
If there is no updates, the updater will sleep and then re-check the wikidata site. It can also be stopped and re-started in any moment without affecting query service functionality. If blazegraph service is down, it will retry for a short time, then exit.
External access to the service is provided at the URL http://wdqs-beta.wmflabs.org/.
The access is performed via nginx proxy, configs are in
/etc/nginx/sites-enabled/wdqs. Only GET requests to URLs starting with
/bigdata/ are proxied to the Blazegraph.
The root document for http://wdqs-beta.wmflabs.org/ is the WDQS Beta GUI, which is served from
/srv/wdqs/blazegraph/gui/. It is located in
gui/ subdirectory in the sources.
Access logs are in
/var/log/nginx. Searching for "
/bigdata/namespace/wdq/sparql" would provide the list of queries that were attempted from the GUI.
See also sample nginx config that can be used for the service.
The logs for SPARQL requests are available at labs Logstash: https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/wdqs
This section should eventually find better place, for now this is the list of related tools: