Talk:Wikimedia Developer Summit/2016/T114019/Minutes/Questions

From mediawiki.org

I tried to break the work into separate streams, but leaving your order intact. Do you think this is a helpful way to look at the project?

Platform

  • What already available software can we build on for a job scheduler?
  • What sort of internal storage do we want to use for all dumps (filesystem and format), so that production of final output(s) is fast?
  • How can we get and use a stream of mediawiki events for facilitating incrementals of the project content dumps?
  • What sort of user-available monitoring of dump production do we want, and how can we provide it?

Performance

  • How can we be sure that all projects get dumped in a timely fashion without unfair waits in a queue?
  • How can we break up page content dumps into quick-running pieces, without manual intervention?
  • How can we avoid retrieval, rereading and rewriting of old revisions during project content dumps production?
  • How can we provide good bandwidth for downloading to all comers without connection/bw caps and without impacting dumps performance?
  • How can we speed up the wikidata entity dumps?

Accuracy

  • How can we produce a project dump which reflects a consistent state of the dbs?
  • How can we make sure that out of band changes to the dbs are reflected in the dumps?
  • How can we filter out private information from sql tables from the projects so they can be dumped?
  • How can we handle old events when they arrive in the stream?
  • When changes are made to the database schemas, how can we easily update dump processing to handle this?
  • What should we do about sensitive personal information (SPI) that ends up in a project dump and is later removed from the source?

Flexibility

  • How can we provide incrementals for sql table dumps from the projects?
  • Some folks stream their output into a script for processing, others process it in parallel feeding it to hadoop, still others shovel it into a db; what output formats and converters can we provide to support all these cases?
  • Should we consider how to provide select subsets of a project dump dependent on user request?
  • How can we easily incorporate dumps other than the standard project metadata/content dumps into our downloadable datasets, with the same monitoring and scheduling (e.g. HTML dumps)?
  • How can we provide support for OpenZIM file production?
  • How much of our dumps infrastructure must be easily useable by third parties?

And finally:

  • How can the new system be developed in a timely way without adversely impacting maintenance and support of the current system?

Adamw (talk) 06:58, 1 March 2016 (UTC)Reply

Sure. I'm trying to figure out how we would break these out into phabricator tasks. Let me make an umbrella task for each of these sections and split out the questions into blocking tasks for each one. I'll announce it on the main ticket too. Then if people want to change/add/rearrange questions, they can do it right in Phabricator. -- ArielGlenn (talk) 21:34, 1 March 2016 (UTC)Reply