Analytics/Epics/Speed up Report Card Generation

From mediawiki.org

Goals[edit]

We've been asked for two changes to the updates of the Monthly Report

  1. Finish faster, that is have the monthly data updated sooner
  2. Update more often, that is change the reporting cadence

Detailed Tracking Links[edit]

Development (Mingle)

Reports and Data Sources[edit]

The Report Card has the following reports

Core Reports[edit]

Name Source
Unique Visitors per Region comScore + Erik's scripts
Pageviews for all Wikimedia Projects Webstatscollector output + Erik's scripts
Pageviews to Mobile Site Webstatscollector output + Erik's scripts
New Editors Per Month for All Wikimedia Projects Dumps + Erik's scripts
Active Wikimedia Editors for All Wikimedia Projects (5+ edits per month) Dumps + Erik's scripts

Secondary Reports[edit]

Name Source
Wikimedia Projects Reach by Region comScore + Erik's scripts
Commons Binaries (log scale) Dumps + Erik's scripts
Wikipedia Articles Dumps + Eriks's scripts
New Wikipedia Articles per Day Dumps + Erik's scripts
Wikipedia Edits per Month Dumps + Erik's scripts
Very Active Editors for All Wikimedia Projects (100+ edits per month) Dumps + Erik's scripts
Wikivoyage Editors Dumps + Erik's scripts

Annual Plan YYYY-yy Targets[edit]

Name Source
Mobile Page Views - Projections and Targets Webstatscollector output + Erik's scripts + Howie's target data
Active Editors (5+) Dumps + Erik's scripts + + Howie's target data + Dan's manual data entry*
Active Uploaders Commons (1+) ??? + Howie's target data + Dan's manual data entry*

* Dan's manual data entry is necessary because of a bug in bar graphs in Limn.


Users[edit]

User Description
Developers The people who write the software that produces the reports
Operators The people who ensure the software is running and the data is updated
Communications WMF personnel who use the data in monthly reports
Management WMF who make decisions based on the results of the data
Community The wikipedians who look at the data to assess their success and the health of the community

Prioritized Use Cases[edit]

  1. As a Manager, I want the reports to be finished as quickly as possible
Q What does this mean? The different technologies used have different capabilities (e.g. page views can be updated daily, anything dump related will take longer)
  1. As a Manager, I want the reports to be updated at a daily cadence when possible
  2. As a member of the Communications team, I want the reports to be updated in time for the monthly Foundation report
  3. As a Developer, I want an automated method to update the dashboards
  4. As an Operator, I want any automation be be monitored and alerted

Implementation Details[edit]

The data used comes from the webstats collector output and the database XML dumps. The webstats collector output can be updated on a daily basis -- this is just a matter of automation; EZ thinks the effort is quite reasonable.

As of November, 2013, Ken has let us know that ops will not be able to optimize the stub dumps at this time. They have a great deal of work in moving the dump infrastructure out of Tampa as well as work on the primary dumps. We're going to need to to investigate using other data sources than the dumps.

The other reports are created from the database dumps (just the stubs) that ops provides. The reason that these metrics take such a long time to run is because the dumps themselves take a long time to finish. EZ's post processing take a few days.

EZ has documented most of the dataflow and reports here: http://stats.wikimedia.org/wikistats_overview.html

comScore flow[edit]

comScore publishes new data around the 20th. A handful of reports are downloaded in csv format. Erik's script parses two of those (UV/reach per region), updates master files (comScore only publishes last 14 months), and generates new csv files for Limn.

Pageview flow[edit]

  1. pagecounts: hourly totals of page views per article (large files)
  2. projectcounts: hourly totals of page views per wiki (1 line for mobile/1 line for non-mobile) (tiny files)
  • The project files are processed daily (including CSV files) and added to a yearly tar file (which differs from original in that underreporting from several major outages has been patched from meta data). This could get us the page views faster (rather than monthly)

Dump data flow[edit]

EZ believes the work that needs to be done on the dumps is to update all of the stubs first, not round robin. We'd need to follow up with them to understand exactly what this means and if it is tractable.