Wikidata Query Service/Implementation

Dependencies
WDQS depends on two customized packages that are stored in Wikimedia Archiva:  and. Check main pom.xml for current versions.

If changes are required, the new packages have to be built and deployed to Archiva, then WDQS binaries should be rebuilt.

Rebuilding Blazegraph

 * 1) Use gerrit repo
 * 2) Commit fixes (watch for extra whitespace changes!)
 * 3) Run:
 * 4) Check on local install that fixes work
 * 5) Run to deploy:

Rebuilding LDFServer

 * 1) Check out http://github.com/smalyshev/Server.Java
 * 2) Make new branch and fixes
 * 3) Push the new branch to the origin
 * 4) Run
 * 5) Bump the version in WDQS main pom.xml to new version

Labs Deployment (beta)
Note that currently deployment is via git-fat (see below) which may require some manual steps after checkout. This can be done as follows: See also Wikidata Query service beta.
 * 1) Check out   repository and update   submodule to current   branch.
 * 2) Run   to instantiate the binaries if necessary.
 * 3) rsync the files to deploy directory

Production Deployment
Production deployment is done via git deployment repository. The procedure is as follows: The puppet role that needs to be enabled for the service is.
 * 1)   the source repository.
 * 2)   in the source repository - this deploys the artifacts to archiva. Note that for this you will need repositories   and   configured in   with archiva username/password.
 * 3) Install new files (which will be also in  ) to deploy repo above. Commit them. Note that since git-fat uses archiva as primary storage, there can be a delay between files being deployed to archiva and them appearing on rsync and ready for git-fat deployment.
 * 4) Use   to deploy the new build.

It is recommended to test deployment checkout on beta (see above) before deploying it in production.

GUI deployment
GUI deployment files are in repository  branch. It is a submodule of  which is linked as   subdirectory.

To build deployment GUI version, use  in gui subdir. This will generate patch for deploy repo that needs to be merged in gerrit (currently manually). Then update submodule  on   to latest   head and commit/push the change. Deploy as per above.

Services
Service  runs the Blazegraph server.

Service  runs the updater. Depends on wdqs-blazegraph.

Maintenance mode
In order to put the server in the maintenance mode, create file  - this will make all HTTP requests return 503 and the LB will take this server out of rotation. Note that Icinga monitoring will alert about such server being down, so you need to take the measures to prevent it if you are going to do maintenance of the server.

Non-Wikidata deployment
WDQS can be run as a service for any Wikibase instance, not just Wikidata. You can still follow the instructions in the documentation, but you may need to make some additional configurations. Please refer to Standalone Wikibase documentation for full description of the steps necessary.

Hardware
We're currently running on three servers in eqiad:,  ,   and three servers in codfw:  ,   and. Those two clusters are in active/active mode (traffic is sent to both), but due to how we route traffic with GeoDNS, the eqiad cluster sees most of the traffic.

Server specs are similar to the following:


 * CPU: dual Intel(R) Xeon(R) CPU E5-2620 v3
 * Disk: 800GB raw raided space SSD
 * RAM: 128GB

The internal cluster has,   and   in eqiad and  ,   and   in codfw. The hardware is the same as above.

Releasing to Maven
Release procedure described here: http://central.sonatype.org/pages/ossrh-guide.html

Releasing new version

 * 1) Set the version with
 * 2) Commit the patch and merge it
 * 3) Tag the version:
 * 4) Deploy the files to OSS:  . You will need GPG key configured to sign the release.
 * 5) Proceed with the release as described in OSS guide.
 * 6) Set the version back to snapshot:
 * 7) Commit the patch and merge it

Updating specific ID
If there is a need to update specific ID data manually, this can be done using (for ID Q12345):

The runUpdate.sh script is located in the root of WDQS deployment directory. Note that each server needs to be updated separately, they do not share databases.

Resetting start time
By default, the Updater will use timestamp of the last recorded update, or the dump if no updates happened yet. Use  option to reset start time. Start time is recorded when first change is consumed, so if you are dealing with a wiki that does not update often, to explicitly reset the data at the start use  option to the updater.

Resetting Kafka offset
If Updater uses Kafka as change source, it will record Kafka offsets for the latest updates consumes, and resume with them when restarted. To reset these offsets, run this query: DELETE { ?z rdf:first ?head ; rdf:rest ?tail. } WHERE { [] wikibase:kafka ?list. ?list rdf:rest* ?z. ?z rdf:first ?head ; rdf:rest ?tail. } ; DELETE WHERE {  wikibase:kafka ?o. }; Replace Wikidata URL in the last query with your instance URL if your dataset is not based on Wikidata.

Units support
For support of the unit conversion, the configuration of unit conversion is stored in  This config is generated by script, e.g.: If the config is changed, after new config is in place, the database should be updated (unless new dump is going to be loaded) by running: This will generate an RDF file which then will need to be loaded into the database.

Monitoring
Icinga group

Grafana dashboard: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service

Grafana frontend dashboard: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service-frontend

WDQS dashboard: http://discovery.wmflabs.org/wdqs/

Prometheus metrics collected: https://github.com/wikimedia/operations-debs-prometheus-blazegraph-exporter/blob/master/prometheus-blazegraph-exporter#L79

Data reload procedure
Please see: https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Data_reload_procedure

Usage constraints
Wikidata Query Service has a public endpoint available at https://query.wikidata.org. As anyone is free to use this endpoint, the traffic sees a lot a variability and thus the performance of the endpoint can vary quite a lot.

Current restrictions are:


 * Query timeout of 60 seconds
 * One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds
 * One client is allowed 30 error queries per minute
 * Clients exceeding the limits above are throttled

We also have an internal endpoint, which is serving WMF internal workloads. The endpoint is at http://wdqs-internal.discovery.wmnet /sparql. The new internal endpoint is subject to the following constraints:


 * 30 secs timeout
 * requiring user-agent to be set
 * only allowing internal access
 * must be used only for synchronous user facing traffic, no batch jobs
 * requests are expected to be cheap

The constraints are subject to evolve once we see what the actual use cases are and how the cluster behaves. If you have a question about how to or whether to use it, please contact us.

Contacts
If you need more info, talk to User:Smalyshev_(WMF) or anybody from Discovery team.