Background

EventLogging is an extension to Mediawiki. There is a useful guide here: Extension:EventLogging/Guide.

Backlog (Draft)

Draft features/stories as of 2014-05-21. This is an attempt to start articulating the work that needs to be done from a product development perspective.

Title	Description
Monitoring system fires alert when event volume is high	5 points; Tasked on etherpad https://bugzilla.wikimedia.org/show_bug.cgi?id=65482 when throughput goes above 200, send email this is not meant for immediate support (low SLA) can use Icinga content of email says what caused alert (includes graph) use Ori's script to monitor rates
Product manager specifies sampling rate for his EL schema	https://bugzilla.wikimedia.org/show_bug.cgi?id=65500
Product manager specifies schema ownership	We need to know who owns a schema so we can fire alerts to them if the volume exceeds what db can handle. It doesn't belong in the schema
Automated process handles old data	This is a large task that needs to be better defined and then broken down. Some features related to this are: clean the data in JSON files and in the databases by default delete events after 90 days aggregate what needs to be aggregated keep performance/timing data but strip pageId and userId
User has old data for ServerSideAccountCreation	scrub or aggregate it so it is available beyond 90 days. It is used by others
User has old data for NavigationTiming	scrub or aggregate it so it is available beyond 90 days. It is used by others
Product manager extends persistence of events	suppose we're two months into a data collection job. The researcher realizes he needs the data for 180 days. Provide a mechanism to extend the persistence of a set of events. At the very least have a mechanism to aggregate or anonymize the data so the researcher can have a longer time period for his data.
User suppresses EventLogging for his actions	Define a mechanism for user to opt-out of the EventLogging process.

Transition Plan

Group:	Analytics/Engineering
Start:	2014-03-31
End:	2014-09

EventLogging is a widely used library in the Foundation. The Analytics team and Ori have discussed the details of the Analytics team taking over responsibility for this Extension. This document is that proposal.

Administration

~~Formalize agreement with Ori, Ops~~
~~Talk to RobLa/Platform~~
- ~~Figure out what ask to make of Ori in terms of regular commitment~~
- ~~Discuss this document~~
Send out support email
Target handover start ~~4/2~~ 4/16 (Needs agreement from Analytics, Ops, Platform teams)

Schema Support

Probably the most common EventLogging support task is schema review. We'd like to make this a revolving responsibility among the users

Create EventLogging review group in Gerrit
- Ask people for consent before adding them
Announce / request social convention of adding people to the review group once they've successfully instrumented something

Data Validation/Support

We'd also like users to take responsibility for their own data generated by EventLogging. The Analytics team isn't staffed to follow up on invalid data from a single schema but we will invest in automated tools and notifications.

Announce the generating invalid data is a software bug and you are expected to fix it in a prompt fashion.
Invite people to subscribe to eventlogging-alert
Provide information about notification and debugging tools

Development Support

Bugs reported in Bugzilla should be acknowledged and resolved.
Automatically purging or anonymizing data to be in line with Privacy Policy needs to be implemented

Development/Operations Tasks

Create graphite script that shows valid and invalid events for each schema, thereby satisfying the requirement that eventlogging be in principle self-serving
Add alert for number of events
A daily report should go out reporting the number of valid and invalid events logged, broken down by schema.

Operations

Operational support by analytics team: Event_logging/OperationalSupport
Data recovery plan - Ori thinks we shouldn't hack the EventCapsule or validation model again
- Dario thinks we will have some high-priority data recovery needs related to DB outages
Create and respond to alerts
Once a month, the backup process (vanadium -> stat1001 -> tridge) should get a quick lookover to ensure that it is functioning.
- Once every six months, a drill should be conducted to test system failover and recovery procedures.
Sean Pringle is supporting db replication
- Task: https://rt.wikimedia.org/Ticket/Display.html?id=7081
Failover for Vanadium