Talk:Wikimedia Developer Summit/2016/T114019/Minutes/Questions

I tried to break the work into separate streams, but leaving your order intact. Do you think this is a helpful way to look at the project?

Platform
 * What already available software can we build on for a job scheduler?
 * What sort of internal storage do we want to use for all dumps (filesystem and format), so that production of final output(s) is fast?
 * How can we get and use a stream of mediawiki events for facilitating incrementals of the project content dumps?
 * What sort of user-available monitoring of dump production do we want, and how can we provide it?

Performance
 * How can we be sure that all projects get dumped in a timely fashion without unfair waits in a queue?
 * How can we break up page content dumps into quick-running pieces, without manual intervention?
 * How can we avoid retrieval, rereading and rewriting of old revisions during project content dumps production?
 * How can we provide good bandwidth for downloading to all comers without connection/bw caps and without impacting dumps performance?
 * How can we speed up the wikidata entity dumps?

Accuracy
 * How can we produce a project dump which reflects a consistent state of the dbs?
 * How can we make sure that out of band changes to the dbs are reflected in the dumps?
 * How can we filter out private information from sql tables from the projects so they can be dumped?
 * How can we handle old events when they arrive in the stream?
 * When changes are made to the database schemas, how can we easily update dump processing to handle this?
 * What should we do about sensitive personal information (SPI) that ends up in a project dump and is later removed from the source?

Flexibility
 * How can we provide incrementals for sql table dumps from the projects?
 * Some folks stream their output into a script for processing, others process it in parallel feeding it to hadoop, still others shovel it into a db; what output formats and converters can we provide to support all these cases?
 * Should we consider how to provide select subsets of a project dump dependent on user request?
 * How can we easily incorporate dumps other than the standard project metadata/content dumps into our downloadable datasets, with the same monitoring and scheduling (e.g. HTML dumps)?
 * How can we provide support for OpenZIM file production?
 * How much of our dumps infrastructure must be easily useable by third parties?

And finally:
 * How can the new system be developed in a timely way without adversely impacting maintenance and support of the current system?

Adamw (talk) 06:58, 1 March 2016 (UTC)