Analytics/Logging infrastructure

Documents

 * User requirements:
 * Specifications:
 * Software design document:
 * Test plan:
 * Documentation plan:
 * User interface design docs:
 * Schedule:
 * Task management:
 * Release management plan:
 * Communications plan:

Log Format Changes
The Analytics Team wants to make a number of changes to the web server logs to collect more data and to fix some issues with the output format. We propose the following changes:


 * 1) Add the X-Carrier header to be able to identify Wikipedia Zero traffic. I have sent a proposal for shortening country names and I have asked Amit to supply abbreviations for mobile carrier names.
 * 2) Add the Accept-Language header
 * 3) Use tab character as space delimiter instead of the space. This is probably the biggest change and it will affect all the people who use the server logs. See the Plan on how we want to make this transition as smooth as possible.

These log format changes will need to be changed for squid, varnichncsa, and nginx.

Overall Plan
We suggest the following approach to introduce these changes without disrupting the existing workflow


 * 1) Andrew has built in Labs an nginx/varnish/squid mediawiki configuration where we can extensively test the new configuration of the server logs.
 * 2) We will generate test data and supply that to Erik Zachte and give him ample time to adjust his scripts.
 * 3) Once we receive thumbs from Erik Zachte, we will communicate with all the other log file consumers when we are going to deploy the changes on the servers. Particularly, the fundraising team is an important consumer of log data that will be affected by this change as well.
 * 4) Deploy changes.  See  below.

Software Changes
We will have to make changes to the following programs:

Webstatscollector
TODO: What needs done here?

udp-filter
 The above two changes should be deployed before we finally switch to \t.
 * Remove exact field count requirement.
 * Add ability to filter by HTTP response code.
 * Use \t as field delimiter..

AWK scripts
These do not need changed, as they are currently splitting on any white space character. They will behave as they currently do either way. However, to make these more accurate than they currently are, we should change them so they split on \t rather than any whitespace.

C-based filter scripts
Migrate these to use udp-filter.

latlongCountry-writer
This currently prepends CountryCode lat,lon to log lines. Will it be ok if we change the format to what udp-filter does with -g -b everything?

TODO: talk to someone about latlongCountry-writer.

India
-pipe 10 /a/squid/india-filter >> /a/squid/india.log +pipe 10 /usr/bin/udp-filter -c IN -g -b country -m /usr/share/GeoIP/GeoIP.dat >> /var/log/squid/india.log

Mobile
-pipe 100 /a/squid/m-filter >> /a/squid/mobile.log +pipe 100 /usr/bin/udp-filter -d m.wikipedia.org >> /a/squid/mobile.log

India
''Do we really need two India filters? One is already on emery.'' -pipe 10 /a/squid/india-filter >> /a/squid/india.log +pipe 10 /usr/bin/udp-filter -c IN -g -b country -m /usr/share/GeoIP/GeoIP.dat >> /var/log/squid/india.log

Edits
-pipe 1 /a/squid/edits-filter >> /a/squid/edits.log +pipe 1 /usr/bin/udp-filter -p "action=edit,action=submit" >> /a/squid/edits.log

5xx errors
-pipe 1 /a/squid/5xx-filter | awk -W interactive '$9 !~ "upload.wikimedia.org|query.php"' >> /a/squid/5xx.log +pipe 1 /usr/bin/udp-filter --http-status='50' | awk -W interactive '$9 !~ "upload.wikimedia.org|query.php"' >> /a/squid/5xx.log

Fundraising Landing Pages
-pipe 1 /a/squid/fundraising/lp-filter >> /a/squid/fundraising/logs/landingpages.log +pipe 1 udp-filter -d wikimediafoundation.org,donate.wikimedia.org >> /a/squid/fundraising/logs/landingpages.log

Fundraising Banner Impressions
-pipe 100 /a/squid/fundraising/bi-filter >> /a/squid/fundraising/logs/bannerImpressions-sampled100.log +pipe 100 /usr/bin/udp-filter -p 'Special:BannerLoader' >> /a/squid/fundraising/logs/bannerImpressions-sampled100.log

packet-loss filter

 * Use \t as field delimiter.

Wikipedia Zero Filters
Modify the IP range filters to a single udp-filter that matches on X-Carrier header. This can be done after the additional fields are added to log sources.

sqstat.pl
Need to talk to Asher about this.
 * Use \t as field delimiter.

varnishncsa, nginx, and squid log formats

 * varnishncsa.default
 * nginx.conf.erb
 * frontend generate squid .php template

Wikipedia Zero Filters
Modify the IP range filters to a single udp-filter that matches on X-Carrier header. This can be done after the additional fields are added to log sources.

Reverse the nginx patch were we escape spaces
This can be done last. Is there similar patch to varnishncsa?

Update Wikitech documentation
We need to update wikitech documentation with the new http headers. I have requested access for us to edit the wikitech wiki.

Deployment Plan

 * 1) Deploy initial changes to udp-filter.  udp-filter needs to be able to accept a variable number of fields.  As of May 16 2012, this change has been committed and needs to be deployed.
 * 2) Verify that everything works exactly as it did before this change.  Wait a few days to catch any potential problems.
 * 3) Migrate existing custom C scripts to using udp-filter.  This needs to be done after the above udp-filter change as been deployed to avoid losing log lines due to having spaces in some of the fields.
 * 4) Verify that all migrated filters still work properly.  Wait for at least a few days after all filters have been migrated to ensure that things are ok.  It'd be good to get verification from the filter owners before we proceed as well.
 * 5) Deploy log sources change that adds additional fields in log format.  Do not yet deploy the \t change.
 * 6) Verify that all filters continue to work as before, even with the addition of extra fields.
 * 7) TODO: Work out \t deployment plan