Analytics/Legacy Logging

From mediawiki.org

Rationale[edit]

Timeline[edit]

Documents[edit]

  • User requirements:
  • Specifications:
  • Software design document:
  • Test plan:
  • Documentation plan:
  • User interface design docs:
  • Schedule:
  • Task management:
  • Release management plan:
  • Communications plan:

Communications[edit]

Proposal for changes to the format and content of the web server logs[edit]

Log Format Changes[edit]

The Analytics Team wants to make a number of changes to the web server logs to collect more data and to fix some issues with the output format. We propose the following changes:

  1. Add the X-Carrier header to be able to identify Wikipedia Zero traffic. I have sent a proposal for shortening country names and I have asked Amit to supply abbreviations for mobile carrier names.
  2. Add the Accept-Language header
  3. Use tab character as space delimiter instead of the space. This is probably the biggest change and it will affect all the people who use the server logs. See the Plan on how we want to make this transition as smooth as possible.

These log format changes will need to be changed for squid, varnichncsa, and nginx.

Overall Plan[edit]

We suggest the following approach to introduce these changes without disrupting the existing workflow

  1. Andrew has built in Labs an nginx/varnish/squid mediawiki configuration where we can extensively test the new configuration of the server logs.
  2. We will generate test data and supply that to Erik Zachte and give him ample time to adjust his scripts.
  3. Once we receive thumbs from Erik Zachte, we will communicate with all the other log file consumers when we are going to deploy the changes on the servers. Particularly, the fundraising team is an important consumer of log data that will be affected by this change as well.
  4. Deploy changes. See #Deployment_Plan below.

Progress[edit]

Summary of Progress
Task Status
Webstatscollector Finished
udp-filter Finished
AWK scripts Finished
C-based filter scripts Finished
Wikipedia Zero filters Finished
Varnish / squid/ nginx config changes Finished
Update wikitech documentation Finished

Software Changes[edit]

We will have to make changes to the following programs:

Webstatscollector[edit]

TODO: What needs done here?

udp-filter[edit]

  • Remove exact field count requirement.
  • Add ability to filter by HTTP response code.

The above two changes should be deployed before we finally switch to \t.

  • Use \t as field delimiter..

AWK scripts[edit]

These do not need changed, as they are currently splitting on any white space character. They will behave as they currently do either way. However, to make these more accurate than they currently are, we should change them so they split on \t rather than any whitespace.

C-based filter scripts[edit]

Migrate these to use udp-filter.

emery[edit]
latlongCountry-writer[edit]

This currently prepends CountryCode lat,lon to log lines. Will it be ok if we change the format to what udp-filter does with -g -b everything?

TODO: talk to someone about latlongCountry-writer.

India[edit]
 -pipe 10 /a/squid/india-filter >> /a/squid/india.log
 +pipe 10 /usr/bin/udp-filter -c IN -g -b country -m /usr/share/GeoIP/GeoIP.dat >> /var/log/squid/india.log
locke[edit]
Mobile[edit]
 -pipe 100 /a/squid/m-filter >> /a/squid/mobile.log
 +pipe 100 /usr/bin/udp-filter -d m.wikipedia.org >> /a/squid/mobile.log
India[edit]

Do we really need two India filters? One is already on emery.

 -pipe 10 /a/squid/india-filter >> /a/squid/india.log
 +pipe 10 /usr/bin/udp-filter -c IN -g -b country -m /usr/share/GeoIP/GeoIP.dat >> /var/log/squid/india.log
Edits[edit]
 -pipe 1 /a/squid/edits-filter >> /a/squid/edits.log
 +pipe 1 /usr/bin/udp-filter -p "action=edit,action=submit" >> /a/squid/edits.log
5xx errors[edit]
 -pipe 1 /a/squid/5xx-filter | awk -W interactive '$9 !~ "upload.wikimedia.org|query.php"' >> /a/squid/5xx.log
 +pipe 1 /usr/bin/udp-filter --http-status='50' | awk -W interactive '$9 !~ "upload.wikimedia.org|query.php"' >> /a/squid/5xx.log
Fundraising Landing Pages[edit]
 -pipe 1 /a/squid/fundraising/lp-filter >> /a/squid/fundraising/logs/landingpages.log
 +pipe 1 udp-filter -d wikimediafoundation.org,donate.wikimedia.org >> /a/squid/fundraising/logs/landingpages.log
Fundraising Banner Impressions[edit]
 -pipe 100 /a/squid/fundraising/bi-filter >> /a/squid/fundraising/logs/bannerImpressions-sampled100.log
 +pipe 100 /usr/bin/udp-filter -p 'Special:BannerLoader' >> /a/squid/fundraising/logs/bannerImpressions-sampled100.log

packet-loss filter[edit]

  • Use \t as field delimiter.

Wikipedia Zero Filters[edit]

Modify the IP range filters to a single udp-filter that matches on X-Carrier header. This can be done after the additional fields are added to log sources.

sqstat.pl[edit]

  • Use \t as field delimiter.

Need to talk to Asher about this.

varnishncsa, nginx, and squid log formats[edit]

  • varnishncsa.default
  • nginx.conf.erb
  • frontend generate squid .php template

Wikipedia Zero Filters[edit]

Modify the IP range filters to a single udp-filter that matches on X-Carrier header. This can be done after the additional fields are added to log sources.

Reverse the nginx patch were we escape spaces[edit]

This can be done last. Is there similar patch to varnishncsa?

Update Wikitech documentation[edit]

We need to update wikitech documentation with the new http headers. I have requested access for us to edit the wikitech wiki.

Deployment Plan[edit]

  1. Deploy initial changes to udp-filter. udp-filter needs to be able to accept a variable number of fields. As of May 16 2012, this change has been committed and needs to be deployed.
  2. Verify that everything works exactly as it did before this change. Wait a few days to catch any potential problems.
  3. Migrate existing custom C scripts to using udp-filter. This needs to be done after the above udp-filter change as been deployed to avoid losing log lines due to having spaces in some of the fields.
  4. Verify that all migrated filters still work properly. Wait for at least a few days after all filters have been migrated to ensure that things are ok. It'd be good to get verification from the filter owners before we proceed as well.
  5. Deploy log sources change that adds additional fields in log format. Do not yet deploy the \t change.
  6. Verify that all filters continue to work as before, even with the addition of extra fields.
  7. TODO: Work out \t deployment plan