Wikimedia Developer Summit/2017/A Monitoring API for the Dumps
Session Overview[edit]
Title: A Monitoring API for the Dumps
Day & Time: Tue Jan 10 1:10p
Room: Chapel Hill
Phabricator Task Link: https://phabricator.wikimedia.org/T147177
Facilitator(s): User:ArielGlenn
Note-Taker(s): Who was our wonderful note-taker? Please add your name and get credit!
Remote Moderator:
Advocate:
Session Summary[edit]
Detailed Summary[edit]
[1] (booooo)
Purpose[edit]
Talk about monitoring Dumps "Dumps rewrite"
Agenda[edit]
- 15 minutes: Introduction and description of current dumps monitoring
- 35 minutes: Brainstorm ideas for improvement
- 15 minutes: Q&A on related dump issues
- 5 minutes: Wrap-up
Style[edit]
Problem-solving/Brainstorm
Discussion Topics[edit]
- There's really no api now, though scattered data is available for folks that can scrape the html, parse one of the dump text files, check rss feeds, etc.
- Human readability of all dumps-related info is very important.
- Related to the above, descriptions of dump content and information in multiple languages would be good to have.
- How do we reach more dumps users? Outreach surveys are hard.
- Several ideas for the dumps rewrite were proposed, apart from the monitoring discussion.
Chronology[edit]
I have added this here because ETHERPADS ARE NOT SUPPOSED TO BE USED FOR PERMANENT STORAGE. The previous was a message from your friendly ops team representative. Thank you.
What are dumps:
- BIG, slow to generate and process
- there are so-called INCREMENTAL dumps, not necessarily advertised
- discuss on xmldatadumps-l mailing list
What will the dumps become
- Somehow be smaller?
- Generate them more quickly
- Break them up
- Parallel jobs generating small files
- Rerunnable
What does the current API look like?
- There is none, except a flatfile: "dumpruninfo.txt"
- People are parsing the flatfile, because no option
- RSS feed with some dump info
- Some stats in "main index.html"
What would people like to see?
- not getting throttled
- smaller is better
- XML is easy to read
- maybe, statistics about dump usage popularity, etc., could be used internally by ops
- Wikidata has some support for such stats (apache logs)
- multilingual interface with descriptions, but still machine readable
- no mandate to keep archive of all versions
- consider pgp signing dumps
- dumps that span multiple wikis (e.g. Wikidata, and some other wiki) or selected subsets in vari
- maybe on-demand or self-service dump?
- compatible formats to mix-and-match
- a different format for dumps, like a git repo? to facilitate mix-n-match
- replace or add to dumpruninfo.txt with XML or JSON
- swift in ES??? as possible obj store test? (in ESAMS)
- outreach surveys? survey on the dumps page?
- a stable link for the LATEST
Contact info/followup:
- email: ariel@wikimedia.org
- irc: apergos
- mailing list for discussion and announcements: xmldatadumps-l
- phab workboard for dumps rewrite: Dumps-Rewrite
- phab ticket for this session: T147177
Action Items
- summarize list of desired features of api
- solicit more input from folks on mailing lists
- add json formatted equivalent to current dumps and advertise this
- follow up on T155060 (stats for downloaders)
- talk with Bryan Davis about outreach survey to dumps users (labs? stats100* people? other?)
- TBD