Wikimedia Developer Summit/2016/T114019/Minutes

Intro

 * Session name: Dumps 2.0 for realz (planning/architecture)
 * Meeting goal: Make headway on the question: "what should the xml/sql/other dumps infrastructure look like in order to meet current/future user needs, and how can we get there?"
 * Meeting style: Depending on participants, either Problem-solving or Strawman
 * Phabricator task link: https://phabricator.wikimedia.org/T114019

Topics for discussion

 * use cases for the dumps, known and desired
 * where we currently fall short or are expected to fall short in the future
 * an ideal architecture for dumps that would address the main issues would look like... what?

Hoo discussion notes

 * wants api for the user to find out which dumps are in progress, complete, one-click download of latest wiki etc.
 * desired: way to capture data from history that is not revision/page metadata and would only be present in a full dump of the specific date (e.g. category changes, we don't have a history of those)

Adam Wight discussion notes

 * I don't have anything architecturey yet, but from a risk management perspective, the migration path seems a bit unclear. I tried to imagine a low-investment way to experiment with Dumps 2.0, in.
 * Would symmetric nodes be better than a master scheduler node? How much complexity that would add?  This would give us extra fault tolerance and make it simple to run on additional clusters when needed.  One easy (but not high availability) division of labor would be to have a master scheduler assign an entire dump to a worker scheduler, which can then subcontract chunks to other workers in turn.  And pay them very little indeed... :S
 * We should use multiple brokers for high availability. Redis Cluster would give us that for free, otherwise it seems to be supported by Celery somehow.
 * Can someone who has survived a Celery integration please vouch that it's easy and fun? ?
 * I like JCrespo's idea of using WMF Labs views as our sanitization, mostly because that lets us reuse existing logic. Dumping from a paused Labs slave seems ideal, if the network and filesystem considerations aren't too horrible.  Presumably, the Labs sanitization happens on the main cluster, so we should be able to replicate a slave from there, to avoid crossing network segments?
 * We need some statistics about the time it takes to suppress revisions in oversight situations. Rather than risk mirroring these bad changes, we might want to delay incremental (and full) dumps by a day or two.  I guess that would look like, is that we would dump from an up-to-date, sanitized database, but daily incrementals for example would actually contain 2 days > age > 1 day old revisions.
 * The discussion about dump formats is really interesting, and IMO should take place independently of this framework rewrite. A lot of the dump consumption use cases sound badly broken, we should collect those stories and identify pain points.  I agree with apergos that a diversity of formats is great, but at the same time we should choose a canonical set and then put energy into improving that to cover 90% of potential uses, include indexes and stuff, and provide reference consumer libraries.
 * gnosygnu's is really onto something, historical dumps should only change when the formats are updated.
 * Just a tiny detail, I think it's important that our runner framework interfaces with chunk processing jobs at the command-line, to minimize the amount of glue we need. Jobs will be written in multiple languages, lets not mess with bindings.  On that note, the object store interface should be something simple to access from any language, maybe Redis?