Analytics/EventLogging

From MediaWiki.org
Jump to: navigation, search

Background

EventLogging is an extension to Mediawiki. There is a useful guide here: Extension:EventLogging/Guide.

Backlog (Draft)

Draft features/stories as of 2014-05-21. This is an attempt to start articulating the work that needs to be done from a product development perspective.

Title Description
Monitoring system fires alert when event volume is high 5 points; Tasked on etherpad

https://bugzilla.wikimedia.org/show_bug.cgi?id=65482

  • when throughput goes above 200, send email
  • this is not meant for immediate support (low SLA)
  • can use Icinga
  • content of email says what caused alert (includes graph)
  • use Ori's script to monitor rates
Product manager specifies sampling rate for his EL schema https://bugzilla.wikimedia.org/show_bug.cgi?id=65500
Product manager specifies schema ownership We need to know who owns a schema so we can fire alerts to them if the volume exceeds what db can handle.
  • It doesn't belong in the schema
Automated process handles old data This is a large task that needs to be better defined and then broken down. Some features related to this are:
  • clean the data in JSON files and in the databases
  • by default delete events after 90 days
  • aggregate what needs to be aggregated
  • keep performance/timing data but strip pageId and userId
User has old data for ServerSideAccountCreation scrub or aggregate it so it is available beyond 90 days. It is used by others
User has old data for NavigationTiming scrub or aggregate it so it is available beyond 90 days. It is used by others
Product manager extends persistence of events suppose we're two months into a data collection job. The researcher realizes he needs the data for 180 days. Provide a mechanism to extend the persistence of a set of events. At the very least have a mechanism to aggregate or anonymize the data so the researcher can have a longer time period for his data.
User suppresses EventLogging for his actions Define a mechanism for user to opt-out of the EventLogging process.

Transition Plan

EventLogging is a widely used library in the Foundation. The Analytics team and Ori have discussed the details of the Analytics team taking over responsibility for this Extension. This document is that proposal.

Administration

  • Formalize agreement with Ori, Ops
  • Talk to RobLa/Platform
    • Figure out what ask to make of Ori in terms of regular commitment
    • Discuss this document
  • Send out support email
  • Target handover start 4/2 4/16 (Needs agreement from Analytics, Ops, Platform teams)

Schema Support

Probably the most common EventLogging support task is schema review. We'd like to make this a revolving responsibility among the users

  • Create EventLogging review group in Gerrit
    • Ask people for consent before adding them
  • Announce / request social convention of adding people to the review group once they've successfully instrumented something

Data Validation/Support

We'd also like users to take responsibility for their own data generated by EventLogging. The Analytics team isn't staffed to follow up on invalid data from a single schema but we will invest in automated tools and notifications.

  • Announce the generating invalid data is a software bug and you are expected to fix it in a prompt fashion.
  • Invite people to subscribe to eventlogging-alert
  • Provide information about notification and debugging tools

Development Support

  • Bugs reported in Bugzilla should be acknowledged and resolved.
  • Automatically purging or anonymizing data to be in line with Privacy Policy needs to be implemented

Development/Operations Tasks

  • Create graphite script that shows valid and invalid events for each schema, thereby satisfying the requirement that eventlogging be in principle self-serving
  • Add alert for number of events
  • A daily report should go out reporting the number of valid and invalid events logged, broken down by schema.

Operations

  • Operational support by analytics team: Event_logging/OperationalSupport
  • Data recovery plan - Ori thinks we shouldn't hack the EventCapsule or validation model again
    • Dario thinks we will have some high-priority data recovery needs related to DB outages
  • Create and respond to alerts
  • Once a month, the backup process (vanadium -> stat1001 -> tridge) should get a quick lookover to ensure that it is functioning.
    • Once every six months, a drill should be conducted to test system failover and recovery procedures.
  • Sean Pringle is supporting db replication
  • Failover for Vanadium