Requests for comment/Automata tagging in x analytics

An ongoing need from a data perspective is for us to have some consistent way of distinguishing different classes of automated requests - namely, automated requests that result in human interaction with our content versus those that do not. This proposal evaluates the options available for achieving this before recommending that a new HTTP header field,, be added to Wikimedia servers and recommended for implementation within client libraries.

Requests to Wikimedia servers
Wikimedia's servers receive ~7-8B requests for assets a day, from all over the world. Many of these requests come from human users directly through their browsers, be they mobile or desktop, but many more come from automated systems. These automated requests do sometimes have humans behind them; they cover a variety of use cases. These include:


 * 1) Automated requests made through semi-automated editing tools, where there is a single user behind each unique source of requests;
 * 2) Automated requests made through fully-automated tools, such as research code or "bots", where no human review is involved;
 * 3) Automated requests made through fully-automated caching systems, such as third-party reusers that ask for content they're missing when a user asks for it but then caches the result for future reuse.

From the perspective of the access logs, however, all successful requests - from humans or from automata - look exactly the same. We have no way of easily identifying which are from automata, and within that, which requests come with what purpose from the automata. This makes it difficult to gauge the scale of our human traffic or pinpoint whether changes in reader behaviour are real or coming from alterations in automata traffic - which are still noteworthy and worth investigating but requires a very different approach.

Automata tagging solutions
There are a few solutions here that might allow us to tag automata. The first, and most obvious, is centred on the open-source ua-parser tool, which is already applied to the request logs as they come into our storage systems. It is capable of identifying a vast number of common automata sources - namely, crawlers. However, the project is regular-expression based and expensive to run across user agents en masse and only identifies web crawlers, not request frameworks within programming languages. Moreover, ua-parser is designed as a generalised parser, not one specific to Wikipedia, and so does nothing (and can do nothing) for the large number of Wikipedia-specific crawlers.

An alternative would be developing our own regex framework for tagging spiders and automated traffic. This has the advantage of being more accurate, but could become computationally complex and would certainly consume a lot of human time - it would require constant and consistent maintenance to ensure we are tagging as much automata as possible, as often as possible. It would also not allow us to easily distinguish different types of traffic - "Python urllib" could be a research project just as much as it could be an actual service surfacing requests to people.

Finally, we could rely on client-side information. Something like a field passed on to x_analytics (and sanitised) that identifies a particular class of user request. This would let us granularly tag requests in a consistent format. The disadvantage with this approach is that clients may not follow a guideline requesting this - but as a certain proportion of client already fails to follow our user agent guidelines, and in doing so ensures that even with regex frameworks granular tagging is impossible, we're not necessarily losing any data there that we aren't already losing. We're just getting the data clients already send us in a more nuanced fashion.

Proposal
In order to enable our analysts, researchers and analytics engineers to better understand the nature of the automated traffic hitting our servers, we propose amending the API guidelines to mandate adding a new HTTP header,, to any automated requests from a system operating above a certain scale (say, 1,000/day). This header would contain one of several values:

The value passed in by the client would then be included in the  field,a semicolon-delimited, key-value pair field that contains information such as HTTPS status or Wikimedia Zero provider. To avoid bad data being intentionally passed in this inclusion would be handled by our Varnish Caching Layer (VCL), which would compare the provided value to the list above to ensure false values are not included.

The inclusion of this header would not be strongly enforced - as described, an existing population finds itself unable to follow (or uninformed of) our API guidelines around user agents, so full adherence isn't expected, and it quickly becomes a losing game to chase after each and every Python user. Instead, we would:


 * 1) Include it within the API guidelines and documentation;
 * 2) Document examples of how to incorporate this header using common programming languages for interfacing with Wikimedia sites;
 * 3) Reach out to existing, identifiable API users and ask them to switch over.