Wikimedia Developer Summit/2017/A Monitoring API for the Dumps

Session Overview[edit]

Title: A Monitoring API for the Dumps

Day & Time: Tue Jan 10 1:10p

Room: Chapel Hill

Phabricator Task Link: https://phabricator.wikimedia.org/T147177

Facilitator(s): User:ArielGlenn

Note-Taker(s): Who was our wonderful note-taker? Please add your name and get credit!

Remote Moderator:

Advocate:

Session Summary[edit]

Detailed Summary[edit]

[1] (booooo)

Purpose[edit]

Talk about monitoring Dumps "Dumps rewrite"

Agenda[edit]

15 minutes: Introduction and description of current dumps monitoring
35 minutes: Brainstorm ideas for improvement
15 minutes: Q&A on related dump issues
5 minutes: Wrap-up

Style[edit]

Problem-solving/Brainstorm

Discussion Topics[edit]

There's really no api now, though scattered data is available for folks that can scrape the html, parse one of the dump text files, check rss feeds, etc.
Human readability of all dumps-related info is very important.
Related to the above, descriptions of dump content and information in multiple languages would be good to have.
How do we reach more dumps users? Outreach surveys are hard.
Several ideas for the dumps rewrite were proposed, apart from the monitoring discussion.

Chronology[edit]

I have added this here because ETHERPADS ARE NOT SUPPOSED TO BE USED FOR PERMANENT STORAGE. The previous was a message from your friendly ops team representative. Thank you.

What are dumps:

BIG, slow to generate and process
there are so-called INCREMENTAL dumps, not necessarily advertised
discuss on xmldatadumps-l mailing list

What will the dumps become

Somehow be smaller?
Generate them more quickly
Break them up
Parallel jobs generating small files
Rerunnable

What does the current API look like?

There is none, except a flatfile: "dumpruninfo.txt"
People are parsing the flatfile, because no option
RSS feed with some dump info
Some stats in "main index.html"

What would people like to see?

not getting throttled
smaller is better
XML is easy to read
maybe, statistics about dump usage popularity, etc., could be used internally by ops
Wikidata has some support for such stats (apache logs)
multilingual interface with descriptions, but still machine readable
no mandate to keep archive of all versions
consider pgp signing dumps
dumps that span multiple wikis (e.g. Wikidata, and some other wiki) or selected subsets in vari
maybe on-demand or self-service dump?
compatible formats to mix-and-match
a different format for dumps, like a git repo? to facilitate mix-n-match
replace or add to dumpruninfo.txt with XML or JSON
swift in ES??? as possible obj store test? (in ESAMS)
outreach surveys? survey on the dumps page?
a stable link for the LATEST

Contact info/followup:

email: ariel@wikimedia.org
irc: apergos
mailing list for discussion and announcements: xmldatadumps-l
phab workboard for dumps rewrite: Dumps-Rewrite
phab ticket for this session: T147177

Action Items

summarize list of desired features of api
solicit more input from folks on mailing lists
add json formatted equivalent to current dumps and advertise this
follow up on T155060 (stats for downloaders)
talk with Bryan Davis about outreach survey to dumps users (labs? stats100* people? other?)
TBD