Wikimedia Developer Summit/2016/T114019/Minutes/Questions

Questions to be resolved for moving forward on dumps 2.0.[edit]

I've done this in the robla style: list the questions without answers, even though some of these questions may have answers or the beginnings of answers already in the session notes.

These are the questions I think we need to answer in the first go-around, and as we dig into these some of the other topics discussed in the dev summit session will come up, and we can incorporate that information here.

In theory each of these questions would be converted into a task to be answered.

PLEASE ADD/EDIT/ARGUE FOR REMOVAL of questions in this list.

These are ordered not by priority but by the process, from input to output, so to speak. Feel free to tweak the list order.

How can we be sure that all projects get dumped in a timely fashion without unfair waits in a queue?
What already available software can we build on for a job scheduler?
How can we break up page content dumps into quick-running pieces, without manual intervention?
What sort of internal storage do we want to use for all dumps (filesystem and format), so that production of final output(s) is fast?
How can we produce a project dump which reflects a consistent state of the dbs?
How can we avoid retrieval, rereading and rewriting of old revisions during project content dumps production?
How can we filter out private information from sql tables from the projects so they can be dumped?
How can we get and use a stream of mediawiki events for facilitating incrementals of the project content dumps?
How can we handle old events when they arrive in the stream?
How can we make sure that out of band changes to the dbs are reflected in the dumps?
When changes are made to the database schemas, how can we easily update dump processing to handle this?
How can we provide incrementals for sql table dumps from the projects?
What sort of user-available monitoring of dump production do we want, and how can we provide it?
Some folks stream their output into a script for processing, others process it in parallel feeding it to hadoop, still others shovel it into a db; what output formats and converters can we provide to support all these cases?
Should we consider how to provide select subsets of a project dump dependent on user request?
What should we do about PSI that ends up in a project dump and is later removed from the source?
How can we provide good bandwidth for downloading to all comers without connection/bw caps and without impacting dumps performance?
How can we easily incorporate dumps other than the standard project metadata/content dumps into our downloadable datasets, with the same monitoring and scheduling? Examples: Flow, Wikidata dumps, HTML dumps.
How can we speed up the wikidata entity dumps?
How can we provide support for OpenZIM file production?
How much of our dumps infrastructure must be easily useable by third parties?

And finally:

How can the new system be developed in a timely way without adversely impacting maintenance and support of the current system?