WDQS Production

From MediaWiki.org
Jump to navigation Jump to search
To Production and Beyond!
To Production and Beyond!

The following is the list of items that need to be completed for putting Wikidata Query Service into production:

# Item Severity Complexity Who can do it Depends on Phab task
1 Productization
11 Packaging to ops standards MUST 2 Ops 12 T103897
12 Converting services to Debian-model services MUST 2 Stas+Ops - T103904
13 Preparing puppet scripts to ops standards MUST 1 Stas+Ops 11,12 T95679
14 Automated initial loading from dumps MAY -
15 Automated version upgrade MAY 1 12
16 Backup/disaster recovery story SHOULD Ops - T103906
17 External hardening MUST T103907
18 Internal harderning SHOULD T103908
19 Security review completed MUST 11, 12 T90115
110 Size, request & obtain hardware MUST 1 - T86561
2 Monitoring 1
21 Devise performance monitoring criteria SHOULD 1 - T103922
22 Set up service alerts MUST 1 1 T103911
23 Connect to performance monitoring services SHOULD 1 21 T103931
24 Connect to analytic log collection services MUST 1 1 T98030
3 Features
31 Negative dates handling SHOULD 4 Stas - T94539
32 Redirects handling MAY 10 Stas - T96490
33 Geocoordinates handling SHOULD 20 Blazegraph? -
34 Labels handling SHOULD - T97079
35 User-facing documentation SHOULD 3 - T103932

Severity key:

  • MUST - we can not go to production before this is done.
  • SHOULD - ideally, we need this for production-quality service, but if we can't deliver it right now, we can proceed without it but should prioritize it right after we're done with MUSTs.
  • MAY - we should have it eventually, but we can survive for some time without it.

Complexity is a (very) rough estimate of how many work days it may take to get it working. The estimate concerns only amount of actual work to be done and does not include waiting for resource allocation, bureaucracy, etc.

Detailed description of each item follows.

Productization[edit]

Packaging to ops standards[edit]

Currently the packaging is a single ZIP which is built by maven, uploaded to maven central and is supposed to be downloaded and deployed manually. We want to keep Maven option, but we also need something that matches what Ops usually work with. We need to figure out how to make this package and deploy it.

We should talk to ops about it. We could go with deb packaging but it might be simpler to deploy using git-fat.

Converting services to Debian-model services[edit]

WDQS has two Java services - Blazegraph and Updater - which right now are run by manual scripts, with no log rotation, no watchdog/restart, etc. This needs to be changed to make these more standard supported services. Also database, log, configs, etc. locations should be reviewed and possibly changed to match standards.

I (manybubbles) sees two options:

  1. systemd has facilities to take std-out and std-err and dump into nicely rotated log files I believe
  2. use the standard Java loggers and enabled stuff like size based rolling policy.

Whatever ops wants, though I suspect it'll be systemd.

Preparing puppet scripts to ops standards[edit]

After the above is finished, the puppet scripts now living in private repo should be updated and fixed according to ops standards and included in WMF's standard puppet.

This should be assigned to an opsen, probably. It'd be way, way faster than for the discovery team.

Automated initial loading from dumps[edit]

Currently the only way to load dump into the query service is to follow a manual procedure. We may want to have more automated way of doing it.

Maybe build it into the updater? If it can't find the load then it does the download, etc? That'd be super duper convenient for external users. We have a script to do it already, so maybe just have puppet run the script if it can't see any data or something? You just have to make the script idempotent -or- make a check_if_required style script. Not super complex. It'd be interesting to test this using vagrant I think. We do this with Cirrus in Vagrant and that works, at least.

Automated version upgrade[edit]

Currently there is no way to move from one version of the service to another automatically - one needs to install new version, manually shut down the old one and transfer DB manually. We need to make it automatic.

On the other hand Elasticsearch upgrades also require a manual shutdown and restart and that is fine, even good. They don't require any manual fiddling with the files though. We should make the updater restart itself automatically on upgrade. Certainly if its deb packaged.

Backup/disaster recovery story[edit]

There is no recovery option currently except for "set up completely new server and reload from scratch". We need to review this and decide if we want to have better options.

External hardening[edit]

The service should be properly firewalled, not modifiable from outside (i.e. SPAQRL UPDATE) requests blocked and not be able to issue SERVICE requests to outside sources. External user should not be able to modify the data on the service or cause the service to call out to external processes or over the network.

Internal hardening[edit]

The service user should not be able to consume more than defined share of the resources and cause DoS to other users. The queries should be limited by time and memory space and should timeout/abort when the limits are reached.

Security review completed[edit]

We need to complete security review of WDQS setup. See: https://phabricator.wikimedia.org/T90115

Size, request & obtain hardware[edit]

Determine which hardware we need, request it from ops and set it up.

Monitoring[edit]

Devise performance monitoring criteria[edit]

Devise a set of metrics that we need to collect from running service in order to monitor its performance.

Set up service alerts[edit]

Set up a system that monitors entry points for Blazegraph, GUI and health of Updater service and alerts when any of them goes down.

Connect to performance monitoring services[edit]

Create scripts to measure the metrics described above and send them to graphite or other metric collection tool.

Connect to analytic log collection services[edit]

Set up log collection and connect it to existing analytic systems.

Features[edit]

Negative dates handling[edit]

Right now year 0 and negative dates are not handled consistently by the WDQS service due to the fact that custom logic is not used for date calculations. We need to fix that. See: https://phabricator.wikimedia.org/T94539

Redirects handling[edit]

Some entities are redirects to other entities. The semantics of it should be that these entity IDs are completely interchangeable. This currently does not work. See: https://phabricator.wikimedia.org/T96490

Geocoordinates handling[edit]

Right now we store geographic data but unable to do any geographic searches at all. We need to be able to at least have distance between two points and ideally have some index that allows us to do geographic searches.

Labels handling[edit]

In order to obtain label for the item, users have to perform cumbersome SPARQL queries that are easy to mishandle. We should define custom function which would produce labels in preferred language with fallback. See: https://phabricator.wikimedia.org/T97079

User-facing documentation[edit]

We may want to create better service description and organize existing documentation into consistent documentation package for the service which allows the user quickly get up to speed with the service.