Analytics/Epics/Speed up Report Card Generation

= Monthly Report Card Project =

Goals
We've been asked for two changes to the updates of the Monthly Report
 * 1) Finish faster, that is have the monthly data updated sooner
 * 2) Update more often, that is change the reporting cadence

Reports and Data Sources
The Report Card has the following reports

Annual Plan YYYY-yy Targets
* Dan's manual data entry is necessary because of a bug in bar graphs in Limn.

Prioritized Use Cases

 * 1) As a Manager, I want the reports to be finished as quickly as possible
 * 2) As a Manager, I want the reports to be updated at a daily cadence when possible
 * 3) As a member of the Communications team, I want the reports to be updated in time for the monthly Foundation report
 * 4) As a Developer, I want an automated method to update the dashboards
 * 5) As an Operator, I want any automation be be monitored and alerted

Implementation Details
The data used comes from the webstats collector output and the database XML dumps. The webstats collector output can be updated on a daily basis -- this is just a matter of automation; EZ thinks the effort is quite reasonable.

The other reports are created from the database dumps (just the stubs) that ops provides. The reason that these metrics take such a long time to run is because the dumps themselves take a long time to finish. EZ's post processing take a few days.

EZ has documented most of the dataflow and reports here: http://stats.wikimedia.org/wikistats_overview.html

Pageview flow

 * 1) Erik Z takes the output of webstats collectors: http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-11/; documentation is http://dumps.wikimedia.org/other/pagecounts-raw/
 * pagecounts: hourly totals of page views (large files)
 * projectcounts: hourly totals of page views per wiki (1 line for mobile/1 line for non-mobile) (large files)


 * 1) The project files are processed daily (including CSV files) so this could get us the page views faster (rather than monthly)

Other flows
EZ believes the work that needs to be done on the dumps is to update all of the stubs first, not round robin. We'd need to follow up with them to understand exactly what this means and if it is tractable.