Analytics/Epics/Speed up Report Card Generation
We've been asked for two changes to the updates of the Monthly Report
- Finish faster, that is have the monthly data updated sooner
- Update more often, that is change the reporting cadence
Detailed Tracking Links
Reports and Data Sources
The Report Card has the following reports
|Unique Visitors per Region||comScore + Erik's scripts|
|Pageviews for all Wikimedia Projects||Webstatscollector output + Erik's scripts|
|Pageviews to Mobile Site||Webstatscollector output + Erik's scripts|
|New Editors Per Month for All Wikimedia Projects||Dumps + Erik's scripts|
|Active Wikimedia Editors for All Wikimedia Projects (5+ edits per month)||Dumps + Erik's scripts|
|Wikimedia Projects Reach by Region||comScore + Erik's scripts|
|Commons Binaries (log scale)||Dumps + Erik's scripts|
|Wikipedia Articles||Dumps + Eriks's scripts|
|New Wikipedia Articles per Day||Dumps + Erik's scripts|
|Wikipedia Edits per Month||Dumps + Erik's scripts|
|Very Active Editors for All Wikimedia Projects (100+ edits per month)||Dumps + Erik's scripts|
|Wikivoyage Editors||Dumps + Erik's scripts|
Annual Plan YYYY-yy Targets
|Mobile Page Views - Projections and Targets||Webstatscollector output + Erik's scripts + Howie's target data|
|Active Editors (5+)||Dumps + Erik's scripts + + Howie's target data + Dan's manual data entry*|
|Active Uploaders Commons (1+)||??? + Howie's target data + Dan's manual data entry*|
* Dan's manual data entry is necessary because of a bug in bar graphs in Limn.
|Developers||The people who write the software that produces the reports|
|Operators||The people who ensure the software is running and the data is updated|
|Communications||WMF personnel who use the data in monthly reports|
|Management||WMF who make decisions based on the results of the data|
|Community||The wikipedians who look at the data to assess their success and the health of the community|
Prioritized Use Cases
- As a Manager, I want the reports to be finished as quickly as possible
- Q What does this mean? The different technologies used have different capabilities (e.g. page views can be updated daily, anything dump related will take longer)
- As a Manager, I want the reports to be updated at a daily cadence when possible
- As a member of the Communications team, I want the reports to be updated in time for the monthly Foundation report
- As a Developer, I want an automated method to update the dashboards
- As an Operator, I want any automation be be monitored and alerted
The data used comes from the webstats collector output and the database XML dumps. The webstats collector output can be updated on a daily basis -- this is just a matter of automation; EZ thinks the effort is quite reasonable.
As of November, 2013, Ken has let us know that ops will not be able to optimize the stub dumps at this time. They have a great deal of work in moving the dump infrastructure out of Tampa as well as work on the primary dumps. We're going to need to to investigate using other data sources than the dumps.
The other reports are created from the database dumps (just the stubs) that ops provides. The reason that these metrics take such a long time to run is because the dumps themselves take a long time to finish. EZ's post processing take a few days.
EZ has documented most of the dataflow and reports here: http://stats.wikimedia.org/wikistats_overview.html
comScore publishes new data around the 20th. A handful of reports are downloaded in csv format. Erik's script parses two of those (UV/reach per region), updates master files (comScore only publishes last 14 months), and generates new csv files for Limn.
- Erik Z takes the output of webstats collectors: http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-11/; documentation is http://dumps.wikimedia.org/other/pagecounts-raw/
- pagecounts: hourly totals of page views per article (large files)
- projectcounts: hourly totals of page views per wiki (1 line for mobile/1 line for non-mobile) (tiny files)
- The project files are processed daily (including CSV files) and added to a yearly tar file (which differs from original in that underreporting from several major outages has been patched from meta data). This could get us the page views faster (rather than monthly)
Dump data flow
EZ believes the work that needs to be done on the dumps is to update all of the stubs first, not round robin. We'd need to follow up with them to understand exactly what this means and if it is tractable.