Analytics/Archive/Logging infrastructure/status

Jump to navigation Jump to search

Last update on: 2014-04-monthly


Will soon deploy new version of udp-filter that accepts a variable number of fields. This will allow us to migrate more custom C filters to udp-filter. udp-filter can now filter by HTTP response status, and geocode along side of IP address.


We have added a third log collector machine (oxygen) to supplement our current collectors (locke and emery). Andrew is working out a strategy for dealing with errant spaces in nginx logs that throw off our logging scripts. Also figuring out how to better match wikipedia-zero traffic; will probably add custom response header.


Our plan to improve logging sources (Squid, Varnish, nginx, etc.) includes adding more fields, and also allowing us to add arbitrary fields in the future without breaking features. Changing the field formats of the logging sources requires coordination with the Operations team. The format changes have been committed, but not yet deployed. udp-filter has been modified so that it is more flexible, and a few features have been added as well: it now can geocode and anonymize inline in the same field as the IP address, so that later log parsers don't have to try to detect a new field.


During the Berlin Hackathon, a patch was submitted that allows udp-filter to do IPv6 address filtering. We hope to incorporate this soon.


A change to add 2 new headers to logging fields has been submitted. We are waiting on the go ahead from consumers to merge and deploy this.


Modified lucene lsearchd code to use log4j appender for udp2log rather than manually editing codebase. Also built scribe and scribe log4j appenders for sending arbitrary logs to scribe. No movement on log format changes.


    • Augmented udp-filter to take CIDR ranges, and to consistently anonymize IP addresses.
    • Worked with Zero team to make sure incoming log filters are consistent.
    • Puppetized rsync module to allow for easy syncing of data between udp2log machines and stat1.

Ongoing admin work with users of stat1 and stat1001.

  • Moved hosting from spence over to stat1001.
  • Set up easy deployment of data generated on stat1 to on stat1001.


Contractor Stefan Petrea has worked on bug fixes in wikistats, which we also migrated to git. On the udp2log front, we added features in udp-filters and webstatscollector, and deployed a new banner impression filter.


It was a quiet month for the logging infrastructure; things were running fine. We have been working on a patch to fix bug 45178, which we will try to deploy in March.


  • We have started testing Kafka and it's failover behavior in a multi datacenter setup. So far, the results have been very encouraging: failure of a broker is detected very fast and with almost no data-loss the producers start sending data to the backup broker. We have decided to use JSON as the new message format in combination with the Snappy compression format for sending data from the Kafka producers to the Kafka brokers.


We've increased the throughput on Kafka from 6K Requests Per Second (RPS) to 50K RPS to test stability under higher loads.


We continue to investigate network issues between our data centers that are causing occasionally delivery issues. As noted above, we are currently deploying Camus, our software for transferring data between Kafka and Hadoop.


Data from text Varnishes is now being consumed through varnishkafka -> kafka -> camus into hdfs. Kafka now processes Bits, Images and Text data.