Wikibase/Indexing/WDQS Beta

From MediaWiki.org
Jump to navigation Jump to search

The WDQS Service is now in production beta, this page describes pre-production/testing setup. If you are interested in information about production beta, please see the WDQS User Manual.

The purpose of this deployment is to provide test grounds for the query service and collect basic usage patterns. The service runs at http://wdqs-test.wmflabs.org/ (offline).

Deployment hosts[edit]

wdq-beta.eqiad.wmflabs. and db01.eqiad.wmflabs.

wdq-beta serves http://wdqs-test.wmflabs.org/, db01 is internal host for experiments.

If you need access to it ping any member of wikidata-query project on Labs. Each is an xlarge instance with 160G storage.

Source code[edit]

The code comes from https://github.com/wikimedia/wikidata-query-rdf/. See https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md for detailed description of how to build and set up stuff. This is already done on the beta host, so it's for information/disaster recovery purposes only.

All necessary data except for nginx configs (see below) is contained in service-*-dist.zip deployment package, which is what is deployed at /srv/wdqs/blazegraph. Deployment can be done by puppet role below. Note that puppet role does not start Updater service.

Puppet deployment[edit]

Puppet is using self-hosted puppetmaster at wdqs-puppetmaster.

Configuration for puppetmaster:

  • check role::puppet::self
  • set the puppetmaster to wdqs-puppetmaster
  • check role puppetmaster::autosigner
  • set  puppetmaster_autoupdate to true

Configuration for clients:

  • check role::puppet::self
  • set the puppetmaster to wdqs-puppetmaster
  • enable role role::wdqs

Blazegraph deployment[edit]

Blazegraph is deployed in /srv/wdqs/blazegraph, running under user blazegraph. If the service is stopped or crashes, to restart it, run:

 # ./runBlazegraph.sh | tee $(date +%s).log

from /srv/blazegraph. Preserving logs at least for some time is recommended in case some unexpected failure happens. No log rotation scheme in place so far, so just delete the old ones once you're sure nobody needs them anymore.

Some interesting settings may be found in /srv/wdqs/blazegraph/blazegraph/WEB-INF/web.xml - namely queryThreadPoolSize and queryTimeout. Changing those probably requires restart. Note that if you restart the Blazegraph service you may also need to restart the updater as it may give up if the Blazegraph is offline for too long (see below).

The Blazegraph instance has a GUI workbench accessible at http://localhost:9999/. It is not for public access, as it allows full write access to the database. One can access it by configuring port forwarding while logging in to the host via ssh.

Updater deployment[edit]

The updater is the service that is constantly pulling Wikidata and synchronizing it with current database. If it stops, query service is still functional but contains data up to the last successful update. This service can be run under any user, as it communicates with Blazegraph only via REST API and does not store any persistent data by itself, everything is stored in Blazegraph. Currently runs under smalyshev.

Can be run as:

# ./runUpdater -n wdq

from /srv/wdqs/blazegraph. However, running it as a service: service wdqs-updater start - is recommended.

The updater log is configured by updater-logs.xml. The updater logs progress information like this:

20:32:55.850 [main] INFO  org.wikidata.query.rdf.tool.Update - Polled up to 2015-05-19T09:11:50Z at (2.6, 2.7, 2.8) 
updates per second and (2085.5, 2096.2, 2202.2) milliseconds per second 

The date is the point in the main database to which it is updated, the first set of numbers is number of entities updated per second, the second - how far in catching up with the main data it got in a second. These numbers are relevant only if the service is behind the main DB.

If there is no updates, the updater will sleep and then re-check the wikidata site. It can also be stopped and re-started in any moment without affecting query service functionality. If blazegraph service is down, it will retry for a short time, then exit.

Web access[edit]

External access to the service is provided at the URL http://wdqs-beta.wmflabs.org/.

The access is performed via nginx proxy, configs are in /etc/nginx/sites-enabled/wdqs. Only GET requests to URLs starting with /bigdata/ are proxied to the Blazegraph.

The root document for http://wdqs-beta.wmflabs.org/ is the WDQS Beta GUI, which is served from /srv/wdqs/blazegraph/gui/. It is located in gui/ subdirectory in the sources.

Access logs are in /var/log/nginx. Searching for "/bigdata/namespace/wdq/sparql" would provide the list of queries that were attempted from the GUI.

See also sample nginx config that can be used for the service.

Monitoring[edit]

The logs for SPARQL requests are available at labs Logstash: https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/wdqs

The Graphite monitoring is available on http://graphite.wmflabs.org/, e.g.: http://graphite.wmflabs.org/dashboard/#wdq-beta

Other tools[edit]

This section should eventually find better place, for now this is the list of related tools: