Jump to: navigation, search

Migrate / Replace[edit | edit source]

2014 the WMF Analytics Team is working on a vastly improved data infrastructure, based on message collecting via hadoop.

Migration/replacement of traffic reports is not a minor task. All in all there around 15 different squid based reports, some in varieties (views/edits), around 20 in total. And two different page views reports, one of which comes in around 60 varieties, the other is a global overview.

There are several scenario's for how to modernize these reports. Note these are unvetted suggestions, and may change after discussion within the team and elsewhere.

Remove, no replacement;

Some squid based non-geo reports have marginal usefulness: no-one ever mentions them, no mails are ever received while the content deteriorated or even disappeared due to lack of maintenance in a changing environment. Possible candidates for removal:

Replace or swap input (feed from hadoop)

Some squid based non-geo reports are used widely, inside and outside Wikimedia/WMF and we receive mails about bugs or general questions from anywhere. Some are not so popular but are very useful intermittently and it seems wise to keep them (e.g. traffic by mime type, breakdown of Google traffic). They have however not been maintained for a year or two and there is a long list of open bugs. These reports have a orthogonal layout and could be replaced by a successor fairly simply. There is some usage of javascript to allow personalized figures (by day or by month, raw or normalized, full or shortened numbers). A more dynamic query front-end could add value, and maybe produce charts on the go.

By order of popularity:

Keep for the coming year(s), swap input when time allows 1

All webstatscollector based page view reports have an established user base and a highly functional layout. They are totally maintained, no known bugs. They are updated daily from very simple input files. Generating those input files (yearly tar of hourly projectcount files) or a new intermediary file from a hadoop environment seems not too complicated. Not urgent.

Keep for the coming year(s), swap input when time allows 2

Squid based geo-coded reports are used widely, inside and outside Wikimedia/WMF and we receive mails about bugs or general questions from anywhere. Especially the overview reports (views and edits) have a polished layout, with a.o. meta data periodically harvested from Wikipedia, and are concise, flexible and information rich.

Monthly wikistats page view trend reports, indirectly based on squid logs[edit | edit source]

wikistats server stat1002
Update frequency
hourly per-wiki page view totals (files named projectcounts-yyyymmdd-hhnnss) ... produced by webstatscollector downloadable from dump server
  • csv file for monthly Report Card
  • daily updated monthly page view stats. These reports are available in 4 variations: for desktop and mobile traffic, normalized (all months recalculated to 30 days) or original. There are reports for every project and a report for all project combined. See sitemap.

Monthly wikistats traffic reports (geo & traffic type), based on un|sampled squid logs[edit | edit source]

Update frequency

Monthly (manual run)

Scripts location

github, dev/beta on squid1002:/home/ezachte/wikistats/squids


stat1002:/a/squid/archive/sampled[edit | edit source]

Collects a large variety of data from a/squids/archive/sampled in two passes,

  • updates a few monthly files (in folder /a/wikistats_git/squids/csv/yyyy-mm), and
  • creates a large set of daily csv files in folder (../yyyy-mm/yyyy-mm-dd), subfolders ../private and ../public.

Modules[edit | edit source]

  • (now at EZ ~/lib)
  • ../
  • ../
  • ../

Phases[edit | edit source]

  • Pass/Phase 1: collect frequencies for all ip addresses, needed in phase 2 to filter addresses which are most likely from bots (freq > 1 in sampled squid log stands for 1000 real views/edits, esp. for edits this most likely means a bot (few false positive are accepted)
  • Pass/Phase 2: collect all other counts

Arguments[edit | edit source]

  • -t for test mode (run counts for short timespan)
  • -d date range, can be absolute or relative, specify one value for a single day, two values for a range of days
    • -d yyyy/mm/dd[-yyyy/mm/dd] (slashes are optional). This is best for generating counts for a specific day or period.
    • -d -m[-n] where m and n are number of days before today. This is best for daily cron scheduling (e.g. -d 1-7 => run for last 7 days before today, skip days for which output already exists).
  • -f [1|2|12|21], for force rerun phase 1 and/or 2, even when that phase completed successfully earlier
  • phase 1: check for existence SquidDataIpFrequencies.csv.bz2 in ../yyyy-mm/yyyy-mm-dd
  • phase 2: check for existence #Ready in ../yyyy-mm/yyyy-mm-dd

Input[edit | edit source]

1:1000 sampled squid log files from stat1002:/srv/squid/archive/sampled
  1. see also wikitech
  2. 1. Hostname
  3. 2. Sequence number
  4. 3. Current time in ISO 8601 format (oplus milliseconds), according ot the squid server's clock
  5. 4. Request time in ms
  6. 5. Client IP
  7. 6. Squid request status, HTTP status code
  8. 7. Reply size including HTTP headers
  9. 8. Request method (GET/POST etc)
  10. 9. URL
  11. 10. Squid hierarchy status, peer IP
  12. 11. MIME content type
  13. 12. Referer header
  14. 13. X-Forwarded-For header
  15. 14 User-Agent header

On file read country codes are added from local copy of MaxMinds free GeoLite Country service as follows:

open IN, "-|", "gzip -dc $file_in | /usr/local/bin/geoiplogtag 5" ; (author geoiplogtag Mark Bergsma), 5 means: 5th field contains ip address

Output[edit | edit source]

Most files are written to sub folder ../yyyy-mm/yyyy-mm-dd
SquidDataLogfileHeadTail.csv and SquidDataSequenceNumbersAllSquids.csv are reused between runs and updated in sub folder ../yyyy-mm
  • Files marked as private file can contain data which are covered by the Wikimedia Privacy Policy, such as ip addresses and/or referers. These files should never be published! Also these should be deleted after the maximum retention period, set by our privacy policy.
  • Count is number of occurences in one day sampled log, for actual traffic counts scale by factor 1000
  • Country codes are as used by MaxMinds free GeoLite Country service
  • There are several reasons why a requester is classified as bot
    • agent string contains url, which by convention should only occur for crawlers (exception: Embedded Web Browser from:
    • name contains bot/crawler/parser/spider
    • for edits only: ip address occurs more than once in sampled log on one day (few false positives accepted)
This file signals successful completion of script's phase 2: collecting all counts except ip addresses.
On rerun phase 2 will not be redone when this file exists, except when argument -f 2 (force) is specified.
Free format agent strings sent by browser
  • Agent string
  • Count
  • Image file name
  • Count
  • Category
    • - Desktop client
    • M Mobile client
    • E Engine (if Gecko or AppleWebKit)
    • G,- Group (higher aggregation level of desktop clients)
    • G,M Group (higher aggregation level of mobile clients)
  • Browser (client) brand and version
  • Share of total within category
  • Category
    • - Desktop client
    • M Mobile client
  • Client (~ browser) brand and version
  • Project (wikipedia, etc)
  • Language code
  • Share of total within category
  • Bot (Y/N) (see notes)
  • Project/Language (e.g. Wikipedia English = 'wp:en')
  • Country code (as used by MaxMind)
  • Count
  • Bot (Y/N) (see notes)
  • Project/Language (e.g. Wikipedia English = 'wp:en')
  • Country code (as used by MaxMind)
  • Count
  • Bot (Y/N) (see notes)
  • Project/Language (e.g. Wikipedia English = 'wp:en')
  • Country code
  • Count
SquidDataEditsSaves[..etc..].txt.bz2 (private file)
Capture all records describing edit/submit events for later analysis (will be obsolete when 1:1 non-sampled capture will be reactivated)
  • File extension
  • Count
Visits by googlebot from known Google ip addresses (note: list need manual update every now and then!)
  • Date
  • Ip range (3 triplets)
  • Count
  • File size range (per 20 bytes)
  • Count
Counts for edits and submits
  • Bot (Y/N) (see notes)
  • Project/Language (e.g. Wikipedia English = 'wp:en') target wiki
  • Project/Language (e.g. Wikipedia English = 'wp:en') referring wiki
  • Squid status code e.g. 'TCP_MISS/302'
  • Mime type e.g. 'text/html'
  • Arguments (with values for some parameters blocked out) e.g. 'action=edit&title=..'
  • Count
SquidDataIpFrequencies[..etc..].csv (private file)
Generated in pass/phase 1, used a.o. to flag multiple edits from same IP address on one day as bot edits (input is 1:1000 sampled file, few false positives accepted).
Also some distribution stats: x addresses occur y times, x addresses occur y+ times
  • Count
  • Ip address
For Chrome, Firefox, Opera, Safari
  • Browser name
  • Language code
  • Count
Caches timestamps found in first and last record from sampled log files /a/asquids/archive/sampled-1000.log-yyyymmdd.gz
Collecting these from a compressed log file is slow
  • Action e.g. GET
  • Status e.g. TCP_MISS/200
  • Count
  • Category
    • - Desktop client
    • M Mobile client
    • G Group (higher aggregation level)
  • OS name + version
  • Count
  • Share of total (- and M together 100%, G separately)
  • source (internal or external)
  • domain top level (external e.g. 'yahoo', internal e.g. 'wp:en')
  • domain e.g. '.com'
  • file type: image, (html) page, other (e.g. css)
SquidDataReferers[..etc..].txt (private file)
mainly for debugging (see comments in file)
(see also comments in file)
  • Project e.g. '@wp:en' for Mobile English Wikipedia
  • Referer e.g. ''
  • Extension e.g. 'gif'
  • Mime type e.g. 'text/html'
  • Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
  • Count
  • Type: css, js or php
  • Script name e.g. 'print.css'
  • Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
  • Count
Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)
This file can detect mass message log and help to aurocorrect counts
  • Date
  • Hour
  • Events
  • Avg distance between sequence numbers
Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)
  • Squid id e.g. ''
  • Hour
  • Events
  • Tot distance between sequence numbers
  • Avg distance between sequence numbers
  • Skin file path e.g. 'skins-1.5/common/ajax.js'
  • Count[edit | edit source]

Collect per country page view stats for all months

  • Views:
    • starting in July 2009 for page views
    • input in stat1002:/wikistats_git/squids/csv
  • Edits:
    • starting in November 2011 for page edits
    • input in stat1002:/wikistats_git/squids/csv_edits
input for to produce breakdown of Wikipedia views/edits by geo info
SquidDataCountries2.csv[edit | edit source]

Arguments[edit | edit source]

Input[edit | edit source]

Csv files generated by

Output[edit | edit source]

Once a month generates a lot of reports from the csv files generated by
Page view reports are based on a 1:1000 sampled server log (squids) ⇒ all counts x 1000.
Page edit reports are identical to page view reports, but based on 1:1 unsampled squid log!

Reports come in two sets[edit | edit source]

Quick script to collect browser stats for Excel chart, see example Excel chart[edit | edit source]

Read all files on squid log aggregator with hourly counts for

  • number of events received per squid
  • average gap in sequence numbers (this should be 1000 ideally on a 1:1000 sampled log)

Write several aggregations of these data:

  • Hourly average delta sequence numbers per squid (should be 1000 in 1:1000 sampled data set, if no UDP loss occurred)
  • Monthly averages of events per squid per hour
  • Monthly metric per squid set (set defined by location and server role)

This leads to the folowing report: udp2log message loss, and traffic volume, based on 1:1000 sampled squid logs

See also[edit | edit source]