Analytics/Wikistats/TrafficReports

From MediaWiki.org
Jump to: navigation, search

Monthly wikistats trend reports, indirectly based on squid logs[edit | edit source]

Data source: Hourly page view files, per article title, and per project, indirectly based on squid logs after aggregation with Domas Mituzas' scripts.
Platform: wikistats server bayes
Update frequency: daily
Input udp2log on locke collects udp messages from all squids, writes sampled data to disk for 3 months storageproduces two hourly files, both downloadable from dammit.lt(in fact these are now produced on WMF server so dammit.lt server can be taken out of the process some day.
Hourly files with total page views per wiki are used for report card and monthly page view stats. The reports are available in 4 variations: for desktop and mobile traffic, normalized (for months of 30 days) or original.

Monthly wikistats reports, providing breakdowns of traffic directly based on (sampled) squid logs[edit | edit source]

Platform

squid log aggregation server locke

Update frequency

Manually, ideally monthly (automation is in progress)

Scripts location

locke in folder /a/ezachte

SquidCountArchive.pl[edit | edit source]

Collects a host of data from a/squids/archive in two passes (soon to be scheduled daily), updates a few monthly files (in folder ../yyyy-mm), and creates a host of daily csv files in folder (../yyyy-mm/yyyy-mm-dd).

Modules[edit | edit source]

  • EzLib.pm (now at EZ ~/lib)
  • ../SquidCountArchiveProcessLogRecord.pm
  • ../SquidCountArchiveReadInput.pm
  • ../SquidCountArchiveWriteOutput.pm

Phases[edit | edit source]

  • Pass/Phase 1: collect frequencies for all ip addresses, needed in phase 2 to filter addresses which are most likely from bots (freq > 1 in sampled squid log stands for 1000 real views/edits, esp. for edits this most likely means a bot (few false positive are accepted)
  • Pass/Phase 2: collect all other counts

Arguments[edit | edit source]

  • -t for test mode (run counts for short timespan)
  • -d date range, can be absolute or relative, specify one value for a single day, two values for a range of days
    • -d yyyy/mm/dd[-yyyy/mm/dd] (slashes are optional). This is best for generating counts for a specific day or period.
    • -d -m[-n] where m and n are number of days before today. This is best for daily cron scheduling (e.g. -d 1-7 => run for last 7 days before today, skip days for which output already exists).
  • -f [1|2|12|21], for force rerun phase 1 and/or 2, even when that phase completed successfully earlier
  • phase 1: check for existence SquidDataIpFrequencies.csv.bz2 in ../yyyy-mm/yyyy-mm-dd
  • phase 2: check for existence #Ready in ../yyyy-mm/yyyy-mm-dd

Input[edit | edit source]

1:1000 sampled squid log files from locke at /a/squid/archive
Format
  1. see also wikitech
  2. 1. Hostname
  3. 2. Sequence number
  4. 3. Current time in ISO 8601 format (oplus milliseconds), according ot the squid server's clock
  5. 4. Request time in ms
  6. 5. Client IP
  7. 6. Squid request status, HTTP status code
  8. 7. Reply size including HTTP headers
  9. 8. Request method (GET/POST etc)
  10. 9. URL
  11. 10. Squid hierarchy status, peer IP
  12. 11. MIME content type
  13. 12. Referer header
  14. 13. X-Forwarded-For header
  15. 14 User-Agent header

On file read country codes are added from local copy of MaxMinds free GeoLite Country service as follows:

open IN, "-|", "gzip -dc $file_in | /usr/local/bin/geoiplogtag 5" ; (author geoiplogtag Mark Bergsma), 5 means: 5th field contains ip address

Output[edit | edit source]

Most files are written to sub folder ../yyyy-mm/yyyy-mm-dd
SquidDataLogfileHeadTail.csv and SquidDataSequenceNumbersAllSquids.csv are reused between runs and updated in sub folder ../yyyy-mm
Notes
  • Files marked as private file can contain data which are covered by the Wikimedia Privacy Policy, such as ip addresses and/or referers. These files should never be published! They will be automatically deleted after half a year (to be implemented !).
  • Count is number of occurences in one day sampled log, for actual traffic counts scale by factor 1000
  • Country codes are as used by MaxMinds free GeoLite Country service
  • There are several reasons why a requester is classified as bot
    • agent string contains url, which by convention should only occur for crawlers (exception: Embedded Web Browser from: http://bsalsa.com/)
    • name contains bot/crawler/parser/spider
    • for edits only: ip address occurs more than once in sampled log on one day (few false positives accepted)
#Ready
This file signals successful completion of script's phase 2: collecting all counts except ip addresses.
On rerun phase 2 will not be redone when this file exists, except when argument -f 2 (force) is specified.
SquidDataAgents.csv
Free format agent strings sent by browser
  • Agent string
  • Count
SquidDataBinaries.csv
  • Image file name
  • Count
SquidDataClients.csv
  • Category
    • - Desktop client
    • M Mobile client
    • E Engine (if Gecko or AppleWebKit)
    • G,- Group (higher aggregation level of desktop clients)
    • G,M Group (higher aggregation level of mobile clients)
  • Browser (client) brand and version
  • Share of total within category
SquidDataClientsByWiki.csv
  • Category
    • - Desktop client
    • M Mobile client
  • Client (~ browser) brand and version
  • Project (wikipedia, etc)
  • Language code
  • Share of total within category
SquidDataCountriesViews.csv
  • Bot (Y/N) (see notes)
  • Project/Language (e.g. Wikipedia English = 'wp:en')
  • Country code (as used by MaxMind)
  • Count
SquidDataCountriesSaves.csv
  • Bot (Y/N) (see notes)
  • Project/Language (e.g. Wikipedia English = 'wp:en')
  • Country code (as used by MaxMind)
  • Count
SquidDataCountriesViewsTimed.csv
  • Bot (Y/N) (see notes)
  • Project/Language (e.g. Wikipedia English = 'wp:en')
  • Country code
  • Count
SquidDataCrawlers.csv
SquidDataEditsSaves[..etc..].txt.bz2 (private file)
Capture all records describing edit/submit events for later analysis (will be obsolete when 1:1 non-sampled capture will be reactivated)
SquidDataExtensions.csv
  • File extension
  • Count
SquidDataGoogleBots.csv
Visits by googlebot from known Google ip addresses (note: list need manual update every now and then!)
  • Date
  • Ip range (3 triplets)
  • Count
SquidDataImages.csv
  • File size range (per 20 bytes)
  • Count
SquidDataIndexPhp.csv
Counts for edits and submits
  • Bot (Y/N) (see notes)
  • Project/Language (e.g. Wikipedia English = 'wp:en') target wiki
  • Project/Language (e.g. Wikipedia English = 'wp:en') referring wiki
  • Squid status code e.g. 'TCP_MISS/302'
  • Mime type e.g. 'text/html'
  • Arguments (with values for some parameters blocked out) e.g. 'action=edit&title=..'
  • Count
SquidDataIpFrequencies[..etc..].csv (private file)
Generated in pass/phase 1, used a.o. to flag multiple edits from same IP address on one day as bot edits (input is 1:1000 sampled file, few false positives accepted).
Also some distribution stats: x addresses occur y times, x addresses occur y+ times
  • Count
  • Ip address
SquidDataLanguages.csv
For Chrome, Firefox, Opera, Safari
  • Browser name
  • Language code
  • Count
SquidDataLogFilesHeadTail.csv
Caches timestamps found in first and last record from sampled log files /a/asquids/archive/sampled-1000.log-yyyymmdd.gz
Collecting these from a compressed log file is slow
SquidDataMethods.csv
  • Action e.g. GET
  • Status e.g. TCP_MISS/200
  • Count
SquidDataOpSys.csv
  • Category
    • - Desktop client
    • M Mobile client
    • G Group (higher aggregation level)
  • OS name + version
  • Count
  • Share of total (- and M together 100%, G separately)
SquidDataOrigins.csv
  • source (internal or external)
  • domain top level (external e.g. 'yahoo', internal e.g. 'wp:en')
  • domain e.g. '.com'
  • file type: image, (html) page, other (e.g. css)
SquidDataReferers[..etc..].txt (private file)
mainly for debugging (see comments in file)
SquidDataRequests.csv
(see also comments in file)
  • Project e.g. '@wp:en' for Mobile English Wikipedia
  • Referer e.g. 'www.google.ca'
  • Extension e.g. 'gif'
  • Mime type e.g. 'text/html'
  • Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
  • Count
SquidDataScripts.csv
  • Type: css, js or php
  • Script name e.g. 'print.css'
  • Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
  • Count
SquidDataSearch.csv
tbd
SquidDataSequenceNumbersAllSquids.csv
Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)
This file can detect mass message log and help to aurocorrect counts
  • Date
  • Hour
  • Events
  • Avg distance between sequence numbers
SquidDataSequenceNumbersPerSquidHour.csv
Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)
  • Squid id e.g. 'amssq31.esams.wikimedia.org'
  • Hour
  • Events
  • Tot distance between sequence numbers
  • Avg distance between sequence numbers
SquidDataSkins.csv
  • Skin file path e.g. 'skins-1.5/common/ajax.js'
  • Count

SquidReportArchive.pl[edit | edit source]

Arguments[edit | edit source]

Input[edit | edit source]

Csv files generated by WikiCountArchive.pl.

Output[edit | edit source]

Once a month generates a lot of reports from the csv files generated by SquidCountArchive.pl. These reports are based on a 1:1000 sampled server log (squids) ⇒ all counts x 1000.