Wikimedia Product/Wikimedia Product Infrastructure team/Action API request analytics

Action API request analytics will be reports and/or dashboards to track usage of the MediaWiki Action API for Wikimedia production websites. This tracking is intended to be similar to the Pageviews tracking that is currently done by the Analytics team for articles in the main namespace.

Desired outcome
Data sets providing:
 * Number of user agents coming from Labs or third party services, on a monthly basis
 * Volume of API requests coming from Labs or third party services, on a monthly basis
 * Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis
 * Ranking of most requested actions/parameters, on a monthly basis

Data acquisition
Raw Action API requests will be tracked using MediaWiki structured logging, Kafka and Hive.


 * 1) ✅ Log events will be emitted by MediaWiki for each Action API request using a structured logging context that contains the data needed to populate the Hive tables.
 * 2) Monolog will be configured to route these log events to a Kafka topic.
 * 3) Camus will process events from the Kafka topic and load them into a raw data table in Hive.
 * Oozie will run a (daily?) Hive script to summarize the raw data table into various aggregate tables designed for specific reporting needs via ETL processing.
 * 1) Oozie will run a Hive script to discard the raw request data after processing to reduce the risk of leaking sensitive data due to a network break or malicious actor.
 * 2) Oozie will run Hive script to generate monthly summary data from the aggregate tables for export to interested parties.

Avro schema
{     "type": "record", "name": "ApiRequest", "namespace": "org.wikimedia.mediawiki.api", "doc": "Describes an API request made via mediawiki ApiMain", "fields": [ { "name": "dt",             "type": "string" }, { "name": "client_ip",      "type": "string" }, { "name": "user_agent",     "type": "string" }, { "name": "wiki",           "type": "string" }, { "name": "time_backend_ms", "type": "int" }, { "name": "params",         "type": { "type": "map", "values": "string" } }     ]  }


 * Implementation of matching 'ApiRequest' log channel for MediaWiki core
 * TODO Configuration patch to send 'ApiRequest' channel to Kafka

(action, param, value) tuples
We do not want to try and count all of the distinct (action, param, value) tuples that are seen in the aggregation tables. For some params we will also want to expand an embedded list of values given as a single parameter into a list of (action, param, value) tuples that should be counted individually.

For the initial ETL process we will count these tuples:
 * action=query
 * param=prop, value from exploding the '|' delimited value
 * param=list, value from exploding the '|' delimited value
 * param=meta, value from exploding the '|' delimited value
 * param=generator
 * action=flow
 * param=submodule

Monthly reports
Number of user agents coming from Labs or third party services, on a monthly basis

Volume of API requests coming from Labs or third party services, on a monthly basis

Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis

Ranking of most requested actions/parameters, on a monthly basis

Magnitude estimates from existing data
Some data on magnitude of the data set taken from the existing webrequests data for 2015-11-01:
 * Requests per day: 464,794,956
 * Distinct user agents: 337,360
 * Distinct user agents with >1,000,000 requests: 65
 * Distinct user agents with >100,000 requests: 446
 * Distinct user agents with >10,000 requests: 2,118
 * Distinct user agents with >1000 requests: 9,495
 * 50% of requests made by top 48 user agents
 * 75% of requests made by top 256 user agents
 * 95% of requests made by top 4,228 user agents
 * Top user agent: "-" (unspecified) 38,342,930 requests
 * Top user agent that is not a common web browser: "Peachy MediaWiki Bot API Version 2.0 (alpha 8)" 8,674,297 requests
 * 5 of top 10 user agents are web browsers (ajax requests for API data assumed)
 * Traffic percentages: 90% external, 9% labs, 1% internal