Requests for comment/Automata tagging in x analytics

Requests to Wikimedia servers
Wikimedia's servers receive ~7-8B requests for assets a day, from all over the world. Many of these requests come from human users directly through their browsers, be they mobile or desktop, but many more come from automated systems. These automated requests do sometimes have humans behind them; they cover a variety of use cases. These include:


 * 1) Automated requests made through semi-automated editing tools, where there is a single user behind each unique source of requests;
 * 2) Automated requests made through fully-automated tools, such as research code or "bots", where no human review is involved;
 * 3) Automated requests made through fully-automated caching systems, such as third-party reusers that ask for content they're missing when a user asks for it but then caches the result for future reuse.

From the perspective of the access logs, however, all successful requests - from humans or from automata - look exactly the same. We have no way of easily identifying which are from automata, and within that, which requests come with what purpose from the automata. This makes it difficult to gauge the scale of our human traffic or pinpoint whether changes in reader behaviour are real or coming from alterations in automata traffic - which are still noteworthy and worth investigating but requires a very different approach.

Automata tagging solutions
There are a few solutions here that might allow us to tag automata. The first, and most obvious, is centred on the open-source ua-parser tool, which is already applied to the request logs as they come into our storage systems. It is capable of identifying a vast number of common automata sources - namely, crawlers. However, the project is regular-expression based and expensive to run across user agents en masse and only identifies web crawlers, not request frameworks within programming languages. Moreover, ua-parser is designed as a generalised parser, not one specific to Wikipedia, and so does nothing (and can do nothing) for the large number of Wikipedia-specific crawlers.

An alternative would be developing our own regex framework for tagging spiders and automated traffic. This has the advantage of being more accurate, but could become computationally complex and would certainly consume a lot of human time - it would require constant and consistent maintenance to ensure we are tagging as much automata as possible, as often as possible. It would also not allow us to easily distinguish different types of traffic - "Python urllib" could be a research project just as much as it could be an actual service surfacing requests to people.

Finally, we could rely on client-side information. Something like a field passed on to x_analytics (and sanitised) that identifies a particular class of user request. This would let us granularly tag requests in a consistent format. The disadvantage with this approach is that clients may not follow a guideline requesting this - but as a certain proportion of client already fails to follow our user agent guidelines, and in doing so ensures that even with regex frameworks granular tagging is impossible, we're not necessarily losing any data there that we aren't already losing. We're just getting the data clients already send us in a more nuanced fashion.