Wikimedia Developer Summit/2017/A Monitoring API for the Dumps

Session Overview
Title: A Monitoring API for the Dumps

Day & Time: Tue Jan 10 1:10p

Room: Chapel Hill

Phabricator Task Link: https://phabricator.wikimedia.org/T147177

Facilitator(s): User:ArielGlenn

Note-Taker(s): Who was our wonderful note-taker? Please add your name and get credit!

Remote Moderator:

Advocate:

Detailed Summary
(booooo)

Purpose
Talk about monitoring Dumps "Dumps rewrite"

Agenda

 * 15 minutes: Introduction and description of current dumps monitoring
 * 35 minutes: Brainstorm ideas for improvement
 * 15 minutes: Q&A on related dump issues
 * 5 minutes: Wrap-up

Style
Problem-solving/Brainstorm

Discussion Topics

 * There's really no api now, though scattered data is available for folks that can scrape the html, parse one of the dump text files, check rss feeds, etc.
 * Human readability of all dumps-related info is very important.
 * Related to the above, descriptions of dump content and information in multiple languages would be good to have.
 * How do we reach more dumps users? Outreach surveys are hard.
 * Several ideas for the dumps rewrite were proposed, apart from the monitoring discussion.

Chronology
I have added this here because ETHERPADS ARE NOT SUPPOSED TO BE USED FOR PERMANENT STORAGE. The previous was a message from your friendly ops team representative. Thank you.

What are dumps:
 * BIG, slow to generate and process
 * there are so-called INCREMENTAL dumps, not necessarily advertised
 * discuss on xmldatadumps-l mailing list

What will the dumps become
 * Somehow be smaller?
 * Generate them more quickly
 * Break them up
 * Parallel jobs generating small files
 * Rerunnable

What does the current API look like?
 * There is none, except a flatfile: "dumpruninfo.txt"
 * People are parsing the flatfile, because no option
 * RSS feed with some dump info
 * Some stats in "main index.html"

What would people like to see?
 * not getting throttled
 * smaller is better
 * XML is easy to read
 * maybe, statistics about dump usage popularity, etc., could be used internally by ops
 * Wikidata has some support for such stats (apache logs)
 * multilingual interface with descriptions, but still machine readable
 * no mandate to keep archive of all versions
 * consider pgp signing dumps
 * dumps that span multiple wikis (e.g. Wikidata, and some other wiki) or selected subsets in vari
 * maybe on-demand or self-service dump?
 * compatible formats to mix-and-match
 * a different format for dumps, like a git repo? to facilitate mix-n-match
 * replace or add to dumpruninfo.txt with XML or JSON
 * swift in ES??? as possible obj store test? (in ESAMS)
 * outreach surveys? survey on the dumps page?
 * a stable link for the LATEST

Contact info/followup:
 * email: ariel@wikimedia.org
 * irc: apergos
 * mailing list for discussion and announcements: xmldatadumps-l
 * phab workboard for dumps rewrite: Dumps-Rewrite
 * phab ticket for this session: T147177

Action Items
 * summarize list of desired features of api
 * solicit more input from folks on mailing lists
 * add json formatted equivalent to current dumps and advertise this
 * follow up on T155060 (stats for downloaders)
 * talk with Bryan Davis about outreach survey to dumps users (labs? stats100* people? other?)
 * TBD