Analytics/EventLogging

= Transition Plan =

EventLogging is a widely used library in the Foundation. The Analytics team and Ori have discussed the details of the Analytics team taking over responsibility for this Extension. This document is that proposal.

Administration

 * Formalize agreement with Ori, Ops
 * Talk to RobLa/Platform
 * Figure out what ask to make of Ori in terms of regular commitment
 * Discuss this document
 * Send out support email
 * Target handover start 4/2 4/16 (Needs agreement from Analytics, Ops, Platform teams)

Schema Support
Probably the most common EventLogging support task is schema review. We'd like to make this a revolving responsibility among the users


 * Create EventLogging review group in Gerrit
 * Ask people for consent before adding them
 * Announce / request social convention of adding people to the review group once they've successfully instrumented something

Data Validation/Support
We'd also like users to take responsibility for their own data generated by EventLogging. The Analytics team isn't staffed to follow up on invalid data from a single schema but we will invest in automated tools and notifications.


 * Announce the generating invalid data is a software bug and you are expected to fix it in a prompt fashion.
 * Invite people to subscribe to eventlogging-alert
 * Provide information about notification and debugging tools

Development Support

 * Bugs reported in Bugzilla should be acknowledged and resolved.

Development/Operations Tasks
http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=vanadium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2
 * Create graphite script that shows valid and invalid events for each schema, thereby satisfying the requirement that eventlogging be in principle self-serving
 * Add alert for number of events
 * The Ganglia scripts need to be fixed.
 * A daily report should go out reporting the number of valid and invalid events logged, broken down by schema.

Operations

 * Data recovery plan - Ori thinks we shouldn't hack the EventCapsule or validation model again
 * Dario thinks we will have some high-priority data recovery needs related to DB outages
 * Create and respond to alerts
 * Once a month, the backup process (vanadium -> stat1001 -> tridge) should get a quick lookover to ensure that it is functioning.
 * Once every six months, a drill should be conducted to test system failover and recovery procedures.
 * Sean Pringle is supporting db replication
 * Task: https://rt.wikimedia.org/Ticket/Display.html?id=7081
 * Failover for Vanadium