WDQS Production

The following is the list of items that need to be completed for putting Wikidata Query Service into production: Severity key: Complexity is a (very) rough estimate of how many work days it may take to get it working. The estimate concerns only amount of actual work to be done and does not include waiting for resource allocation, bureaucracy, etc.
 * MUST - we can not go to production before this is done.
 * SHOULD - ideally, we need this for production-quality service, but if we can't deliver it right now, we can proceed without it but should prioritize it right after we're done with MUSTs.
 * MAY - we should have it eventually, but we can survive for some time without it.

Detailed description of each item follows.

Packaging to ops standards
Currently the packaging is a single ZIP which is built by maven, uploaded to maven central and is supposed to be downloaded and deployed manually. We want to keep Maven option, but we also need something that matches what Ops usually work with. We need to figure out how to make this package and deploy it.

We should talk to ops about it. We could go with deb packaging but it might be simpler to deploy using git-fat.

Converting services to Debian-model services
WDQS has two Java services - Blazegraph and Updater - which right now are run by manual scripts, with no log rotation, no watchdog/restart, etc. This needs to be changed to make these more standard supported services. Also database, log, configs, etc. locations should be reviewed and possibly changed to match standards.

I (manybubbles) sees two options:
 * 1) systemd has facilities to take std-out and std-err and dump into nicely rotated log files I believe
 * 2) use the standard Java loggers and enabled stuff like size based rolling policy.

Whatever ops wants, though I suspect it'll be systemd.

Preparing puppet scripts to ops standards
After the above is finished, the puppet scripts now living in private repo should be updated and fixed according to ops standards and included in WMF's standard puppet.

This should be assigned to an opsen, probably. It'd be way, way faster than for the discovery team.

Automated initial loading from dumps
Currently the only way to load dump into the query service is to follow a manual procedure. We may want to have more automated way of doing it.

Maybe build it into the updater? If it can't find the load then it does the download, etc? That'd be super duper convenient for external users. We have a script to do it already, so maybe just have puppet run the script if it can't see any data or something? You just have to make the script idempotent -or- make a check_if_required style script. Not super complex. It'd be interesting to test this using vagrant I think. We do this with Cirrus in Vagrant and that works, at least.

Automated version upgrade
Currently there is no way to move from one version of the service to another automatically - one needs to install new version, manually shut down the old one and transfer DB manually. We need to make it automatic.

On the other hand Elasticsearch upgrades also require a manual shutdown and restart and that is fine, even good. They don't require any manual fiddling with the files though. We should make the updater restart itself automatically on upgrade. Certainly if its deb packaged.

Backup/disaster recovery story
There is no recovery option currently except for "set up completely new server and reload from scratch". We need to review this and decide if we want to have better options.

External hardening
The service should be properly firewalled, not modifiable from outside (i.e. SPAQRL UPDATE) requests blocked and not be able to issue SERVICE requests to outside sources. External user should not be able to modify the data on the service or cause the service to call out to external processes or over the network.

Internal hardening
The service user should not be able to consume more than defined share of the resources and cause DoS to other users. The queries should be limited by time and memory space and should timeout/abort when the limits are reached.

Security review completed
We need to complete security review of WDQS setup. See: https://phabricator.wikimedia.org/T90115

Size, request & obtain hardware
Determine which hardware we need, request it from ops and set it up.

Devise performance monitoring criteria
Devise a set of metrics that we need to collect from running service in order to monitor its performance.

Set up service alerts
Set up a system that monitors entry points for Blazegraph, GUI and health of Updater service and alerts when any of them goes down.

Connect to performance monitoring services
Create scripts to measure the metrics described above and send them to graphite or other metric collection tool.

Connect to analytic log collection services
Set up log collection and connect it to existing analytic systems.

Negative dates handling
Right now year 0 and negative dates are not handled consistently by the WDQS service due to the fact that custom logic is not used for date calculations. We need to fix that. See: https://phabricator.wikimedia.org/T94539

Redirects handling
Some entities are redirects to other entities. The semantics of it should be that these entity IDs are completely interchangeable. This currently does not work. See: https://phabricator.wikimedia.org/T96490

Geocoordinates handling
Right now we store geographic data but unable to do any geographic searches at all. We need to be able to at least have distance between two points and ideally have some index that allows us to do geographic searches.

Labels handling
In order to obtain label for the item, users have to perform cumbersome SPARQL queries that are easy to mishandle. We should define custom function which would produce labels in preferred language with fallback. See: https://phabricator.wikimedia.org/T97079

User-facing documentation
We may want to create better service description and organize existing documentation into consistent documentation package for the service which allows the user quickly get up to speed with the service.