Analytics/Wikistats/TrafficReports

Monthly wikistats trend reports, indirectly based on squid logs

 * Data source: Hourly page view files, per article title, and per project, indirectly based on squid logs after aggregation with Domas Mituzas' scripts.
 * Platform: wikistats server bayes
 * Update frequency: daily
 * Input udp2log on locke collects udp messages from all squids, writes sampled data to disk for 3 months storageproduces two hourly files, both downloadable from dammit.lt(in fact these are now produced on WMF server so dammit.lt server can be taken out of the process some day.
 * Hourly files with total page views per wiki are used for report card and monthly page view stats. The reports are available in 4 variations: for desktop and mobile traffic, normalized (for months of 30 days) or original.

Monthly wikistats reports, providing breakdowns of traffic directly based on (sampled) squid logs
squid log aggregation server locke Manually, ideally monthly (automation is in progress) locke in folder /a/ezachte
 * Platform
 * Update frequency:
 * Scripts location

SquidCountArchive.pl
Collects a host of data from a/squids/archive in two passes (soon to be scheduled daily), updates a few monthly files (in folder ../yyyy-mm), and creates a host of daily csv files in folder (../yyyy-mm/yyyy-mm-dd).

Modules

 * EzLib.pm
 * SquidCountArchiveProcessLogRecord.pm
 * SquidCountArchiveReadInput.pm
 * SquidCountArchiveWriteOutput.pm

Phases

 * Pass/Phase 1: collect frequencies for all ip addresses, needed in phase 2 to filter addresses which are most likely from bots (freq > 1 in sampled squid log stands for 1000 real views/edits, esp. for edits this most likely means a bot (few false positive are accepted)
 * Pass/Phase 2: collect all other counts

Arguments

 * -t for test mode (run counts for short timespan)
 * -d date range, can be absolute or relative, specify one value for a single day, two values for a range of days
 * -d yyyy/mm/dd[-yyyy/mm/dd] (slashes are optional). This is best for generating counts for a specific day or period.
 * -d -m[-n] where m and n are number of days before today. This is best for daily cron scheduling (e.g. -d 1-7 => run for last 7 days before today, skip days for which output already exists).
 * -f [1|2|12|21], for force rerun phase 1 and/or 2, even when that phase completed succesfully earlier
 * phase 1: check for existence SquidDataIpFrequencies.csv.bz2 in ../yyyy-mm/yyyy-mm-dd
 * phase 2: check for existence #Ready in ../yyyy-mm/yyyy-mm-dd

Input
1:1000 sampled squid log files from locke at /a/squid/archive

Output

 * Notes


 * Count is number of occurences in one day sampled log, for actual traffic counts scale by factor 1000
 * Country codes are as used by MaxMinds free GeoLite Country service
 * # Ready
 * This file signals succesful completion of script's phase 2: collecting all counts except ip addresses.
 * On rerun phase 2 will not be redone when this file exists, except when argument -f 2 (force) is specified.


 * DebugSquidDataErr.txt : errors and warnings
 * DebugSquidDataOut.txt : content of some hashes
 * DebugSquidDataOut2.txt : content of other hashes
 * SquidDataAgents.csv
 * Free format agent strings sent by browser


 * Agent string
 * Count
 * SquidDataBinaries.csv
 * Image file name
 * Count
 * SquidDataClients.csv
 * Category
 * - Desktop client
 * M Mobile client
 * E Engine (if Gecko or AppleWebKit)
 * G,- Group (higher aggregation level of desktop clients)
 * G,M Group (higher aggregation level of mobile clients)
 * Browser (client) brand and version
 * Share of total within category
 * SquidDataClientsByWiki.csv
 * Category
 * - Desktop client
 * M Mobile client
 * Client (~ browser) brand and version
 * Project (wikipedia, etc)
 * Language code
 * Share of total within category
 * SquidDataCountriesViews.csv
 * Bot (Y/N) (see notes)
 * Project/Language (e.g. Wikipedia english = wp:en)
 * Country code (as used by MaxMind)
 * Count
 * SquidDataCountriesSaves.csv
 * Bot (Y/N) (see notes)
 * Project/Language (e.g. Wikipedia english = wp:en)
 * Country code (as used by MaxMind)
 * Count
 * SquidDataCountriesViewsTimed.csv
 * Bot (Y/N) (see notes)
 * Project/Language (e.g. Wikipedia english = wp:en)
 * Country code
 * Count
 * SquidDataCrawlers.csv
 * SquidDataErr.txt
 * SquidDataExtensions.csv
 * SquidDataGoogleBots.csv
 * SquidDataImages.csv
 * SquidDataIndexPhp.csv
 * SquidDataIpFrequencies.csv
 * SquidDataLanguages.csv
 * SquidDataMethods.csv
 * SquidDataOpSys.csv
 * SquidDataOrigins.csv
 * SquidDataReferers[..etc..].txt Not to be published, as that would violate Wikimedia privacy policy.
 * SquidDataRequests.csv
 * SquidDataScripts.csv
 * SquidDataSearch.csv
 * SquidDataSequenceNumbersAllSquids.csv
 * SquidDataSequenceNumbersPerSquidHour.csv
 * SquidDataSkins.csv

SquidReportArchive.pl
Csv files generated by WikiCountArchive.pl. Once a month generates a lot of reports from the csv files generated by SquidCountArchive.pl. These reports are based on a 1:1000 sampled server log (squids) ⇒ all counts x 1000.
 * Arguments
 * Input
 * Output
 * How many files are requested each day? Breakdown by file type and target (sub)project
 * Where do those requests originate? Breakdown by file category and origin
 * Where do those request land? Breakdown by file category and destination wiki
 * Which HTTP requests are issued? Breakdown by type and results
 * Which scripts are invoked? Breakdown by type (css, javascript, php), name and parameters
 * Which skin files are downloaded and how often?
 * Which crawlers access our servers? Breakdown by host, file types requested, agent string
 * Which operating systems do our clients use? Breakdown by platform, mobile or not, release
 * Which browsers are how popular? Breakdown by brand, revision level, mobile or not
 * Breakdown of all traffic that Google send us. Directly (crawlers) and indirectly (search results)