Wikimedia Developer Summit/2017/A Monitoring API for the Dumps

From mediawiki.org

Session Overview[edit]

Title: A Monitoring API for the Dumps

Day & Time: Tue Jan 10 1:10p

Room: Chapel Hill

Phabricator Task Link: https://phabricator.wikimedia.org/T147177

Facilitator(s): User:ArielGlenn

Note-Taker(s): Who was our wonderful note-taker? Please add your name and get credit!

Remote Moderator:

Advocate:

Session Summary[edit]

Detailed Summary[edit]

[1] (booooo)

Purpose[edit]

Talk about monitoring Dumps "Dumps rewrite"

Agenda[edit]

  • 15 minutes: Introduction and description of current dumps monitoring
  • 35 minutes: Brainstorm ideas for improvement
  • 15 minutes: Q&A on related dump issues
  • 5 minutes: Wrap-up

Style[edit]

Problem-solving/Brainstorm

Discussion Topics[edit]

  • There's really no api now, though scattered data is available for folks that can scrape the html, parse one of the dump text files, check rss feeds, etc.
  • Human readability of all dumps-related info is very important.
  • Related to the above, descriptions of dump content and information in multiple languages would be good to have.
  • How do we reach more dumps users? Outreach surveys are hard.
  • Several ideas for the dumps rewrite were proposed, apart from the monitoring discussion.

Chronology[edit]

I have added this here because ETHERPADS ARE NOT SUPPOSED TO BE USED FOR PERMANENT STORAGE. The previous was a message from your friendly ops team representative. Thank you.

What are dumps:

  • BIG, slow to generate and process
  • there are so-called INCREMENTAL dumps, not necessarily advertised
  • discuss on xmldatadumps-l mailing list

What will the dumps become

  • Somehow be smaller?
  • Generate them more quickly
  • Break them up
  • Parallel jobs generating small files
  • Rerunnable

What does the current API look like?

  • There is none, except a flatfile: "dumpruninfo.txt"
  • People are parsing the flatfile, because no option
  • RSS feed with some dump info
  • Some stats in "main index.html"

What would people like to see?

  • not getting throttled
  • smaller is better
  • XML is easy to read
  • maybe, statistics about dump usage popularity, etc., could be used internally by ops
  • Wikidata has some support for such stats (apache logs)
  • multilingual interface with descriptions, but still machine readable
  • no mandate to keep archive of all versions
  • consider pgp signing dumps
  • dumps that span multiple wikis (e.g. Wikidata, and some other wiki) or selected subsets in vari
  • maybe on-demand or self-service dump?
  • compatible formats to mix-and-match
  • a different format for dumps, like a git repo? to facilitate mix-n-match
  • replace or add to dumpruninfo.txt with XML or JSON
  • swift in ES??? as possible obj store test? (in ESAMS)
  • outreach surveys? survey on the dumps page?
  • a stable link for the LATEST

Contact info/followup:

  • email: ariel@wikimedia.org
  • irc: apergos
  • mailing list for discussion and announcements: xmldatadumps-l
  • phab workboard for dumps rewrite: Dumps-Rewrite
  • phab ticket for this session: T147177

Action Items

  • summarize list of desired features of api
  • solicit more input from folks on mailing lists
  • add json formatted equivalent to current dumps and advertise this
  • follow up on T155060 (stats for downloaders)
  • talk with Bryan Davis about outreach survey to dumps users (labs? stats100* people? other?)
  • TBD