Analytics/Archive/Logging infrastructure/status

Last update on: 2012-05-monthly

2012-05-22
Will soon deploy new version of udp-filter that accepts a variable number of fields. This will allow us to migrate more custom C filters to udp-filter. udp-filter can now filter by HTTP response status, and geocode along side of IP address. 

2012-05-10
We have added a third log collector machine (oxygen) to supplement our current collectors (locke and emery). Andrew is working out a strategy for dealing with errant spaces in nginx logs that throw off our logging scripts. Also figuring out how to better match wikipedia-zero traffic; will probably add custom response header.

2012-05-monthly
We've got a plan in the works to improve logging sources (squid, varnish, nginx, etc.). The plan includes adding more fields, and also allowing us to add arbitrary fields in the future without breaking features. You can see the outline and status of the plan here: https://www.mediawiki.org/wiki/Analytics/Pageview_logging

Changing the field formats of the logging sources requires some babysitting and coordination with ops, we hope to do this soon. The format changes have been committed, but not yet deployed. (See: https://gerrit.wikimedia.org/r/#/c/6526/ ).

udp-filter has been modified so that it is more flexible and some nifty features have been added. It now can geocode and anonymize inline in the same field as the IP address, so that later log parsers don't have to try to detect a new field. (We don't want udp-filter itself to add new fields). During the Berlin Hackathon, a patch was submitted that allows udp-filter to do IPv6 address filtering. We hope to incorporate this soon as well. 