Analytics/Archive/Logging infrastructure/status

Last update on: 2012-09-monthly

2012-05-22
Will soon deploy new version of udp-filter that accepts a variable number of fields. This will allow us to migrate more custom C filters to udp-filter. udp-filter can now filter by HTTP response status, and geocode along side of IP address. 

2012-05-10
We have added a third log collector machine (oxygen) to supplement our current collectors (locke and emery). Andrew is working out a strategy for dealing with errant spaces in nginx logs that throw off our logging scripts. Also figuring out how to better match wikipedia-zero traffic; will probably add custom response header.

2012-05-monthly
Our plan to improve logging sources (Squid, Varnish, nginx, etc.) includes adding more fields, and also allowing us to add arbitrary fields in the future without breaking features. Changing the field formats of the logging sources requires coordination with the Operations team. The format changes have been committed, but not yet deployed. has been modified so that it is more flexible, and a few features have been added as well: it now can geocode and anonymize inline in the same field as the IP address, so that later log parsers don't have to try to detect a new field.

2012-06-03
During the Berlin Hackathon, a patch was submitted that allows udp-filter to do IPv6 address filtering. We hope to incorporate this soon.

2012-06-monthly
A change to add 2 new headers to logging fields has been submitted. We are waiting on the go ahead from consumers to merge and deploy this.



2012-07-monthly
Modified lucene lsearchd code to use log4j appender for udp2log rather than manually editing codebase. Also built scribe and scribe log4j appenders for sending arbitrary logs to scribe. No movement on log format changes.

2012-09-monthly
** Augmented udp-filter to take CIDR ranges, and to consistently anonymize IP addresses.
 * Worked with Zero team to make sure incoming log filters are consistent.
 * Puppetized rsync module to allow for easy syncing of data between udp2log machines and stat1.