Analytics/Epics/Speed up Report Card Generation

= Goals =

We've been asked for two changes to the updates of the Monthly Report
 * 1) Finish faster, that is have the monthly data updated sooner
 * 2) Update more often, that is change the reporting cadence

= Detailed Tracking Links =

Development (Mingle)

= Reports and Data Sources =

The Report Card has the following reports

Annual Plan YYYY-yy Targets
* Dan's manual data entry is necessary because of a bug in bar graphs in Limn.

= Users =

= Prioritized Use Cases =


 * 1) As a Manager, I want the reports to be finished as quickly as possible
 * Q What does this mean? The different technologies used have different capabilities (e.g. page views can be updated daily, anything dump related will take longer)


 * 1) As a Manager, I want the reports to be updated at a daily cadence when possible
 * 2) As a member of the Communications team, I want the reports to be updated in time for the monthly Foundation report
 * 3) As a Developer, I want an automated method to update the dashboards
 * 4) As an Operator, I want any automation be be monitored and alerted

= Implementation Details =

The data used comes from the webstats collector output and the database XML dumps. The webstats collector output can be updated on a daily basis -- this is just a matter of automation; EZ thinks the effort is quite reasonable.

As of November, 2013, Ken has let us know that ops will not be able to optimize the stub dumps at this time. They have a great deal of work in moving the dump infrastructure out of Tampa as well as work on the primary dumps. We're going to need to to investigate using other data sources than the dumps.

The other reports are created from the database dumps (just the stubs) that ops provides. The reason that these metrics take such a long time to run is because the dumps themselves take a long time to finish. EZ's post processing take a few days.

EZ has documented most of the dataflow and reports here: http://stats.wikimedia.org/wikistats_overview.html

comScore flow
comScore publishes new data around the 20th. A handful of reports are downloaded in csv format. Erik's script parses two of those (UV/reach per region), updates master files (comScore only publishes last 14 months), and generates new csv files for Limn.

Pageview flow

 * Erik Z takes the output of webstats collectors: http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-11/; documentation is http://dumps.wikimedia.org/other/pagecounts-raw/
 * pagecounts: hourly totals of page views per article (large files)
 * projectcounts: hourly totals of page views per wiki (1 line for mobile/1 line for non-mobile) (tiny files)


 * The project files are processed daily (including CSV files) and added to a yearly tar file (which differs from original in that underreporting from several major outages has been patched from meta data). This could get us the page views faster (rather than monthly)

Dump data flow
EZ believes the work that needs to be done on the dumps is to update all of the stubs first, not round robin. We'd need to follow up with them to understand exactly what this means and if it is tractable.