Analytics/Wikistats/TrafficReports

Monthly wikistats trend reports, indirectly based on squid logs

 * Platform: wikistats server stat1002
 * Update frequency: daily
 * Input: hourly per-wiki page view totals (files named projectcounts-yyyymmdd-hhnnss) ... produced by webstatscollector downloadable from dump server
 * Output:
 * csv file for monthly Report Card
 * daily updated monthly page view stats. These reports are available in 4 variations: for desktop and mobile traffic, normalized (all months recalculated to 30 days) or original. There are reports for every project and a report for all project combined. See sitemap.

Monthly wikistats reports, providing breakdowns of traffic directly based on (sampled) squid logs
Monthly (manual run) github, dev/beta on squid1002:/home/ezachte/wikistats/squids stat1002:/a/squid/archive/sampled
 * Update frequency:
 * Scripts location
 * Input:

SquidCountArchive.pl
Collects a large variety of data from a/squids/archive/sampled in two passes,
 * updates a few monthly files (in folder /a/wikistats_git/squids/csv/yyyy-mm), and
 * creates a large set of daily csv files in folder (../yyyy-mm/yyyy-mm-dd), subfolders ../private and ../public.

Modules

 * EzLib.pm (now at EZ ~/lib)
 * ../SquidCountArchiveProcessLogRecord.pm
 * ../SquidCountArchiveReadInput.pm
 * ../SquidCountArchiveWriteOutput.pm

Phases

 * Pass/Phase 1: collect frequencies for all ip addresses, needed in phase 2 to filter addresses which are most likely from bots (freq > 1 in sampled squid log stands for 1000 real views/edits, esp. for edits this most likely means a bot (few false positive are accepted)
 * Pass/Phase 2: collect all other counts

Arguments

 * -t for test mode (run counts for short timespan)
 * -d date range, can be absolute or relative, specify one value for a single day, two values for a range of days
 * -d yyyy/mm/dd[-yyyy/mm/dd] (slashes are optional). This is best for generating counts for a specific day or period.
 * -d -m[-n] where m and n are number of days before today. This is best for daily cron scheduling (e.g. -d 1-7 => run for last 7 days before today, skip days for which output already exists).
 * -f [1|2|12|21], for force rerun phase 1 and/or 2, even when that phase completed successfully earlier
 * phase 1: check for existence SquidDataIpFrequencies.csv.bz2 in ../yyyy-mm/yyyy-mm-dd
 * phase 2: check for existence #Ready in ../yyyy-mm/yyyy-mm-dd

Input

 * 1:1000 sampled squid log files from stat1002:/srv/squid/archive/sampled


 * Format
 * 1) see also wikitech
 * 2) 1. Hostname
 * 3) 2. Sequence number
 * 4) 3. Current time in ISO 8601 format (oplus milliseconds), according ot the squid server's clock
 * 5) 4. Request time in ms
 * 6) 5. Client IP
 * 7) 6. Squid request status, HTTP status code
 * 8) 7. Reply size including HTTP headers
 * 9) 8. Request method (GET/POST etc)
 * 10) 9. URL
 * 11) 10. Squid hierarchy status, peer IP
 * 12) 11. MIME content type
 * 13) 12. Referer header
 * 14) 13. X-Forwarded-For header
 * 15) 14 User-Agent header

On file read country codes are added from local copy of MaxMinds free GeoLite Country service as follows: open IN, "-|", "gzip -dc $file_in | /usr/local/bin/geoiplogtag 5" ; (author geoiplogtag Mark Bergsma), 5 means: 5th field contains ip address

Output

 * Most files are written to sub folder ../yyyy-mm/yyyy-mm-dd
 * SquidDataLogfileHeadTail.csv and SquidDataSequenceNumbersAllSquids.csv are reused between runs and updated in sub folder ../yyyy-mm


 * Notes
 * Files marked as private file can contain data which are covered by the Wikimedia Privacy Policy, such as ip addresses and/or referers. These files should never be published! Also these should be deleted after the maximum retention period, set by our privacy policy.
 * Count is number of occurences in one day sampled log, for actual traffic counts scale by factor 1000
 * Country codes are as used by MaxMinds free GeoLite Country service
 * There are several reasons why a requester is classified as bot
 * agent string contains url, which by convention should only occur for crawlers (exception: Embedded Web Browser from: http://bsalsa.com/)
 * name contains bot/crawler/parser/spider
 * for edits only: ip address occurs more than once in sampled log on one day (few false positives accepted)


 * # Ready
 * This file signals successful completion of script's phase 2: collecting all counts except ip addresses.
 * On rerun phase 2 will not be redone when this file exists, except when argument -f 2 (force) is specified.


 * SquidDataAgents.csv
 * Free format agent strings sent by browser
 * Agent string
 * Count


 * SquidDataBinaries.csv
 * Image file name
 * Count


 * SquidDataClients.csv
 * Category
 * - Desktop client
 * M Mobile client
 * E Engine (if Gecko or AppleWebKit)
 * G,- Group (higher aggregation level of desktop clients)
 * G,M Group (higher aggregation level of mobile clients)
 * Browser (client) brand and version
 * Share of total within category


 * SquidDataClientsByWiki.csv
 * Category
 * - Desktop client
 * M Mobile client
 * Client (~ browser) brand and version
 * Project (wikipedia, etc)
 * Language code
 * Share of total within category


 * SquidDataCountriesViews.csv
 * Bot (Y/N) (see notes)
 * Project/Language (e.g. Wikipedia English = 'wp:en')
 * Country code (as used by MaxMind)
 * Count


 * SquidDataCountriesSaves.csv
 * Bot (Y/N) (see notes)
 * Project/Language (e.g. Wikipedia English = 'wp:en')
 * Country code (as used by MaxMind)
 * Count


 * SquidDataCountriesViewsTimed.csv
 * Bot (Y/N) (see notes)
 * Project/Language (e.g. Wikipedia English = 'wp:en')
 * Country code
 * Count


 * SquidDataCrawlers.csv
 * SquidDataEditsSaves[..etc..].txt.bz2 (private file)
 * Capture all records describing edit/submit events for later analysis (will be obsolete when 1:1 non-sampled capture will be reactivated)


 * SquidDataExtensions.csv
 * File extension
 * Count


 * SquidDataGoogleBots.csv
 * Visits by googlebot from known Google ip addresses (note: list need manual update every now and then!)
 * Date
 * Ip range (3 triplets)
 * Count


 * SquidDataImages.csv
 * File size range (per 20 bytes)
 * Count


 * SquidDataIndexPhp.csv
 * Counts for edits and submits
 * Bot (Y/N) (see notes)
 * Project/Language (e.g. Wikipedia English = 'wp:en') target wiki
 * Project/Language (e.g. Wikipedia English = 'wp:en') referring wiki
 * Squid status code e.g. 'TCP_MISS/302'
 * Mime type e.g. 'text/html'
 * Arguments (with values for some parameters blocked out) e.g. 'action=edit&title=..'
 * Count


 * SquidDataIpFrequencies[..etc..].csv (private file)
 * Generated in pass/phase 1, used a.o. to flag multiple edits from same IP address on one day as bot edits (input is 1:1000 sampled file, few false positives accepted). Also some distribution stats: x addresses occur y times, x addresses occur y+ times
 * Count
 * Ip address


 * SquidDataLanguages.csv
 * For Chrome, Firefox, Opera, Safari
 * Browser name
 * Language code
 * Count


 * SquidDataLogFilesHeadTail.csv
 * Caches timestamps found in first and last record from sampled log files /a/asquids/archive/sampled-1000.log-yyyymmdd.gz
 * Collecting these from a compressed log file is slow


 * SquidDataMethods.csv
 * Action e.g. GET
 * Status e.g. TCP_MISS/200
 * Count


 * SquidDataOpSys.csv
 * Category
 * - Desktop client
 * M Mobile client
 * G Group (higher aggregation level)
 * OS name + version
 * Count
 * Share of total (- and M together 100%, G separately)


 * SquidDataOrigins.csv
 * source (internal or external)
 * domain top level (external e.g. 'yahoo', internal e.g. 'wp:en')
 * domain e.g. '.com'
 * file type: image, (html) page, other (e.g. css)


 * SquidDataReferers[..etc..].txt (private file)
 * mainly for debugging (see comments in file)


 * SquidDataRequests.csv
 * (see also comments in file)
 * Project e.g. '@wp:en' for Mobile English Wikipedia
 * Referer e.g. 'www.google.ca'
 * Extension e.g. 'gif'
 * Mime type e.g. 'text/html'
 * Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
 * Count


 * SquidDataScripts.csv
 * Type: css, js or php
 * Script name e.g. 'print.css'
 * Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
 * Count


 * SquidDataSequenceNumbersAllSquids.csv
 * Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)
 * This file can detect mass message log and help to aurocorrect counts
 * Date
 * Hour
 * Events
 * Avg distance between sequence numbers


 * SquidDataSequenceNumbersPerSquidHour.csv
 * Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)
 * Squid id e.g. 'amssq31.esams.wikimedia.org'
 * Hour
 * Events
 * Tot distance between sequence numbers
 * Avg distance between sequence numbers


 * SquidDataSkins.csv
 * Skin file path e.g. 'skins-1.5/common/ajax.js'
 * Count

Input
Csv files generated by WikiCountArchive.pl.

Output
Once a month generates a lot of reports from the csv files generated by SquidCountArchive.pl. These reports are based on a 1:1000 sampled server log (squids) ⇒ all counts x 1000. For overview see this portal section.
 * How many files are requested each day? Breakdown by file type and target (sub)project
 * Where do those requests originate? Breakdown by file category and origin
 * Where do those request land? Breakdown by file category and destination wiki
 * Which HTTP requests are issued? Breakdown by type and results
 * Which scripts are invoked? Breakdown by type (css, javascript, php), name and parameters
 * Which skin files are downloaded and how often?
 * Which crawlers access our servers? Breakdown by host, file types requested, agent string
 * Which operating systems do our clients use? Breakdown by platform, mobile or not, release
 * Which browsers are how popular? Breakdown by brand, revision level, mobile or not
 * Breakdown of all traffic that Google send us. Directly (crawlers) and indirectly (search results)