User:Erik Zachte (WMF)/Progress

For earlier history see User:Erik_Zachte/progress
 * to do: removal of last config files from squid reports, and replace by cmd line parameters)
 * to do: investigate duplication of page histories due to import of translated articles on other wiki (reported by Phoebe, Dec 6 2013)
 * to do: look into page view forecast algorithm, no longer sure how that works (and add some comments in the code)


 * week 25
 * https://phabricator.wikimedia.org/T117221 Update official Wikimedia press kit with accurate numbers
 * https://phabricator.wikimedia.org/T137984 Determine total number of external links in all Wikipedias (ball-park count found)
 * published Wikistats reports

(6 hrs)
 * week 24
 * https://phabricator.wikimedia.org/T136084#2356118 Unexpected increase in traffic for 4 languages in same region, on smaller projects (analysis ready, fix on hold)
 * https://phabricator.wikimedia.org/T137984 Determine total number of external links in all Wikipedias (ongoing)

(12 3/4 hrs)
 * week 23
 * https://phabricator.wikimedia.org/T136084#2356118 Unexpected increase in traffic for 4 languages in same region, on smaller projects

(4 1/2 hrs)
 * week 22
 * published Wikistats reports

(14 3/4 hrs)
 * week 21
 * https://phabricator.wikimedia.org/T126579 (resolved)

(7 1/4 hrs)
 * week 19
 * https://phabricator.wikimedia.org/T126579

(12 hrs)
 * week 18
 * https://phabricator.wikimedia.org/T126579

(12 3/4 hrs)
 * week 17
 * https://phabricator.wikimedia.org/T126579 Total page view numbers on Wikistats do not match new page view definition (ongoing)

(14 1/4 hrs)
 * week 16

(11 hrs)
 * week 15
 * server maintenance, backup scripts updated, lots of pruning of backups

(18 1/2 hrs)
 * week 14
 * published traffic by country reports for Mar 2016 and 2016 Q1
 * investigated why so many inactive Wikipedias seem active all of a sudden (turns out these are rather trivial seeding activities from at least two ip addresses: 73.182.28.179 130.254.150.79)
 * extra metrics on Sitemap page + set default sort order via url, see [1]
 * sent update on video view stats for Khan Academy

( 8 hrs)
 * week 13
 * expanded sitemap pages with extra metrics (e.g. ), taken from another lesser known table, as they deserve more prominence, and can help to set sane limits for inclusion criteria of active wikis

(11 1/4 hrs)
 * week 12
 * worked on sane limits for inclusion criteria of active wikis
 * feedback on new traffic reports

(13 1/4 hrs)
 * week 11
 * published Wikistats reports for February
 * published regional traffic reports for February
 * (only traffic reports for which Wikistats is still responsible, but which have been migrated to hadoop (aka webstatscollector 3.0))


 * pushed https://gerrit.wikimedia.org/r/#/c/278917/, but updated reports in http://stats.wikimedia.org/wikimedia/squids/ manually (global sed), to avoid unexpected side effects
 * committed many files to git (long overdue)
 * working on new survey, this time for dump based stats

(4 1/4 hrs)
 * week 10

(1 hr)
 * week 9

(8 3/4 hrs)
 * week 8
 * analyzed anomaly where incomplete dumps was parsed
 * analyzed T127359 Problems with Erik Zachte's Wikipedia Statistics (legend text is out of date)

(15 1/4 hrs)
 * week 7
 * several minor issues
 * fixed Wikistats doesn't yet know of all content namespaces on Wikisource
 * explained earlier glitch in previous release of Wikistats reports see Facebook thread, started by Jonathan
 * several more threads on Wikistats via Facebook

(10 3/4 hrs)
 * week 6
 * mediacounts for webm files in category Videos_from_Osmosis for Rishi Desai and James Heilman
 * custom data for WMNL for one-time mass mailing to recently very active users
 * wikistat monthly reports
 * investigating glitch in previous release of Wikistats reports
 * press inquiry about total size of Wikipedia (Awuku, Yaw Boateng from German website t-online.de)

(9 1/4 hrs)
 * week 5
 * see week 5

(19 1/4 hrs)
 * week 4
 * working on script to collect bot free counts for all months before May 2015, so that we can patch our PV history, using a close approximation of new pageview definition

(3 1/4 hrs)
 * week 3
 * T124340: LIMN input file wikilytics_in_pageviews.csv no longer updated

(9 1/4 hrs)
 * week 2
 * WLA stats

(9 1/4 hrs)
 * week 1
 * T122864: Mediacounts missing top1000 files after 2016-01-01: rsync fails
 * T123477: Daily/monthly aggregation of hourly page view files halted

(5 1/2 hrs)
 * week 53

(5 hrs)
 * week 52
 * administrative

(11 1/2 hrs)
 * week 51
 * started to revive regional reports using new hourly hadoop-based csv files

(11 3/4 hrs)
 * week 50
 * migrated mail list stats from private server to WMF server, now at http://stats.wikimedia.org/mail-lists/ (+ minor fixes)

(13 1/4 hrs)
 * week 49
 * more on this (major) upgrade to Wikistats on infodisiac blog
 * upgrading scripts for webstatscollector 3.0 data feeds, subproject of T114379 Feed Wikistats traffic reports with aggregated hive data

(22 3/4 hrs)
 * week 48
 * upgrading scripts for webstatscollector 3.0 data feeds, subproject of T114379 Feed Wikistats traffic reports with aggregated hive data

(19 1/4 hrs) (38 3/4 hrs)
 * week 47
 * counts for Erasmus prize ceremony
 * vetting of stats.grok.se 2014 counts for Institute of War Documentation
 * upgrading scripts for webstatscollector 3.0 data feeds, subproject of T114379 Feed Wikistats traffic reports with aggregated hive data
 * week 46
 * new Q&D script to collect Wiki Loves Africa uploader stats via API instead of commons dump.
 * upgrading scripts for webstatscollector 3.0 data feeds, subproject of T114379 Feed Wikistats traffic reports with aggregated hive data

(29 3/4 hrs)
 * week 45
 * upgrading scripts for webstatscollector 3.0 data feeds, subproject of T114379 Feed Wikistats traffic reports with aggregated hive data

(21 1/2 hrs)
 * week 44
 * analyzed major anomalies in traffic reporting for wikibooks, wikinews, wikiquote, wikisource, wikiversity for July 2015
 * created https://phabricator.wikimedia.org/T116609 which lead to
 * created https://phabricator.wikimedia.org/T116531 'view counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects'

(11 hrs)
 * week 43
 * partially answered T113406 Quantifying the "sum of all contributors"
 * partially answered T113406 Quantifying the "sum of all contributors"

(17 3/4 hrs) (9 3/4 hrs)
 * week 42
 * published monthly dump reports
 * working on data files for geo reports, see T114379
 * working on assessing reliability of US per state breakdown of page views
 * week 41

(24 hrs) (8.5 hrs)
 * week 40
 * created process flow diagrams + proposed changes for Wikistats Pageview Reports T114379
 * lots of discussions on this
 * week 39
 * server cleanup: deleted 877G of daily dammit.lt aggregates (keeing monthly ones as backup for
 * collected ad hoc metrics for T113683 English Wikipedia stats for 5 millionth article

(18 hrs)
 * week 38
 * getting to the point on Percentage pageviews from Russia is too low in recent geographical breakdowns in Wikistats (to be documentend online)
 * looking into extreme page view stats for China
 * discussed/contributed to
 * reliability of historic page views
 * making all page views reports from hive use same definitions
 * syndication
 * GSMA initiative to develop common stats for all mobile operators (Kalvin Bahia)
 * new forecasting report

(5 1/4 hrs)
 * week 37
 * data for Dario, for Lila
 * published monthly Wikistats reports
 * discussed/contributed to meaning of editor activity counts

(8 1/4 hrs)
 * week 36
 * added Goan Konkana Wikipedia (gomwiki)
 * further investigating Percentage pageviews from Russia is too low in recent geographical breakdowns in Wikistats
 * discussed/contributed to bot selection criteria

(10 hrs) (17 hrs)
 * week 35
 * published monthly Wikistats reports
 * collected some data for audit report, see
 * investigating way too low comScore stats
 * discussed/contributed to upcoming signpost article on editor trends, lots of miss/500 in recent days, WLM retention and new signups
 * week 34
 * debugged documentation for new hourly pageview files
 * investigating Percentage pageviews from Russia is too low in recent geographical breakdowns in Wikistats
 * added orwikisource to Wikistats
 * discussed/contributed to stats.grok.se upgrade, upcoming signpost blog on growth in monthly editors, GA stats using Wikimedia Stats (wikimedia-l), protocol indepence on stats.wikimedia.org portal, missing browser reports, detailed geohack tools (wikitech-l), quarterly goals

(5 hrs)
 * week 33
 * further analysis of page views underreporting in June/July 2015

(11 1/4 hrs)
 * week 32
 * discussions on Wikistats 2.0
 * +4 entries in stats portal

(25 1/2 hrs)
 * week 31
 * JIT (?) workaround for corrupt stub dumps, needed soon for Quarterly Report Card
 * opened discussion (mail and wiki) on future of Wikistats traffic reports, see also and
 * updated all traffic reports with error notice for geo reports and call to vote to all reports
 * prepared data for Quarterly Report Card and Monthly Report Card

(9 3/4 hrs)
 * week 30
 * further investigating page views underreporting in June/July 2015

(6 3/4 hrs)
 * week 29
 * published squid log based view/edit reports for 2015-04/05/06
 * N/S reports still suffers from large percentage (~5%) not attributed to North or South (so didn't publish update)
 * investigating page views underreporting in June/July 2015

(1 1/4 hrs)
 * week 28

(3/4 hr)
 * week 27

(8.5 hrs)
 * week 26

(2 3/4 hrs)
 * week 25

(11 hrs)
 * week 24

(3 hrs)
 * week 23
 * analyzed bug T87738:Discrepancies in historical total active editor numbers

(7:30 hrs)
 * week 22
 * analyze bug T101519 :Wikistats under counted redirects on non English wikis since January (and hence over counted normal articles)

(11 3/4 hrs)
 * week 21
 * wikistats script maint for stub dumps

(21:30 hrs)
 * week 20
 * stub dump woes, see also T89273
 * quarterly report JIT
 * editor trends as YoY, see blog post

(14 3/4 hrs)
 * week 19
 * last tweaks on analyzing traffic to wikistats site
 * tweaked script to follow dump progress (woes with timely delivery continue)
 * report card

(12 3/4 hrs)
 * week 18
 * analyzing traffic to wikistats site
 * extra meetings, about reorganization

(22.15 hrs) (11.15 hrs)
 * week 17
 * reviving stalled squid reports
 * fixing maintenance scripts (backups)
 * week 16
 * Attended GLAM-WIKI conference (day 3)
 * Analysis of usage of Khan videos for James Heilman
 * reviving stalled squid reports (ongoing)

(24.5 hrs)
 * week 15
 * Attended GLAM-WIKI conference (day 1,2)
 * Input for Nuria's/Andrew's talk proposal at NYC

(2.5 hrs) (14 1/4 hrs) (13 3/4 hrs)
 * week 14
 * Looked at how many images on commons are linked to or requested on a single day (for Jaime)
 * week 13
 * Analyzed unsampled edits logs,both old (webstatscollector) and new (hadoop), see
 * Published and announced new media file request count dumps
 * Collected metrics and analyzed anomaly (bug) for Swiss TV Request
 * Collected metrics for WHYY, the NPR affiliate station in Philadelphia
 * week 12
 * Analyzed unsampled edits logs,both old (webstatscollector) and new (hadoop), see

(9 hrs)
 * week 11
 * Report Card for March (delayed because of missing dumps)

(17 hrs)
 * week 10
 * After fixing T90230 last week, rsync of daily aggregates of page view still didn't happen. Turns out rsync now needs -ipv4 parameter.
 * prerelease Wikistats (dumps for February not all in yet)
 * Report Card (ongoing) (data for comScore not yet accessible, subscription expired)
 * Data fact check for Communications

(20 hrs)
 * week 9
 * Removed translations for namespace 'User' from wikistats (some translations were incomplete, and buggy, and not really needed) per Amir's request
 * Fixed 'T90230: Daily aggregation of page view dumps stalled'. Filed a new bug for 'T90629: repairing the underlying hourly dumps'.
 * Analyzed 'T90240: Could it be that the geo IP matching is not accurate for Africa?' The answer is a big YES: ip->geo is faulty for squid logs processing since we changed to https. The real ip address is only available in the secure version of the message, thus edits are mostly assigned to WMF server locations. Full impact and fix to be determined.

(17 1/2 hrs) (25 1/4 hrs)
 * week 8
 * testing of new media file request dump
 * user requests (new log item):
 * [question] Raw file stats vs pageview API stats: (Jason Bub)
 * [question] [data] monthly per country view stats (Rütger Egolf, Research Assistant at Centre for European Economic Research)
 * [question] Explain how wikilinks are counted in wikistats (explained perl code) by
 * week 7
 * derive estimates for new quarterly report card from incomplete data (dumps have stalled) by extrapolation
 * adapt wikistats scripts to allow merge of totalactive editors for only those wikis which have data for latest month
 * Provide total active editors (TAE) for December 2014
 * Report Edits for 2014 Oct-Dec

(22 1/2 hrs)
 * week 6
 * partial publishing of RC input (dumps are lagging)
 * analyze progress of dump generation (by parsing index.html for 900+ wikis, for all available dump dates),
 * autonomous growth is dump sizes and job length can be shown
 * with a few further tweaks this scan can be run say half an hour, and also report on stalled dump jobs

(16 1/4 hrs) (8 3/4)
 * week 5
 * fixed 2 issues (coding & config glitch) which made Summary charts not update since Sep 2014, see e.g.
 * final tweaks (hopefully) for Wiki Loves Africa reporting
 * investigating 5 percent of page views /edit from sampled squid logs which don't have country info (ongoing)
 * issues with dumps (lagging behind, ongoing)
 * reassessment of where we are with issues with media file request counts RFC
 * week 4
 * fixed wikivoyage report showing wikipedia counts for el/fa
 * rerun Wiki Loves Africa reporting (now using categories *and* templates to find all images)

(17 hrs)
 * week 3
 * analysis of maintenance categories on wp:en (req. Lila), first release published
 * finalized analysis of wp:en maintenance categories (req. Lila), see
 * adapted several script to use proxy on stat1002 from now on, see
 * added Persian and Greek wikivoyage and looked into extraordinary large page counts for those two wikis

(22 1/4 hrs)
 * week 2
 * Wiki Loves Africa reporting (ongoing, looking into discrepancies)
 * analysis of maintenance categories on wp:en (req. Lila), ongoing
 * most wikistats reporting broken due to recent config changes, several issues
 * stat1001 changed to private IP (Putty config fixed)
 * updated all bash files for new access to stat1001
 * daily aggregation of page views aborted due to trivial error -> Q&D fix

(1 hrs)
 * week 53/2014 1/2015

(8 hrs)
 * week 52
 * misc maint.

(9 3/4 hrs)
 * week 51
 * end of year administrative housekeeping / reorg.

(13 3/4)
 * week 50
 * meetup with Europeana on how to proceed once media file requests counts are produced daily
 * looked into overnight sudden drop in article count on no.wikipedia.org of 30k articles (seems Mediawiki counter issue, not Wikistats)
 * mails

(18.5 hrs)
 * week 49
 * published traffic reports
 * adapted code for Medicin Translation Taskforce (which moved to google spreadsheet) (ongoing)
 * started to do daily/monthly aggregation of new hourly pageviews files from Hive successor of webstatscollector script (adapting existing script)

(12.5 hrs)
 * week 48
 * WLM reprisal (as contest continued in Oct)
 * comScore rank reassessment for
 * GLAM media file stats
 * data/config maintenance

(10 1/4 hrs)
 * week 47
 * GLAM media file stats
 * data/config maintenance

(29 1/4 hrs)
 * week 46
 * preparing for GLAM hackaton: RFC media file requests dump
 * GLAM hackaton

(3.5 hrs)
 * week 45

(17 3/4 hrs)
 * week 44
 * WLM 2014 stats (partial, will complete after Nov data are available)
 * Report Card prep
 * traffic reports
 * many mail threads

(22 hrs)
 * week 43
 * GLAM media file stats

(17 hrs)
 * week 42
 * GLAM media file stats

(31 3/4 hrs) (9 3/4 hrs)
 * week 41
 * started to look into hive (a bit)
 * studied new hive implementation of webstatscollector:
 * convert webrequests to pagecounts
 * render the pagecounts files
 * render the projectcounts files
 * commented on new pageview defs
 * generalised filters
 * week 40
 * updated PediaPress stats (adding 22 months till Nov 2013)
 * updated mailing list scanner (new aliases)
 * investigate source of implausible rise in monthly page views, see Trello card
 * prep squid reports (ongoing)

(11.5 hrs)
 * week 39
 * some page view stuff
 * prep report card

(11.5 hrs) (18 3/4 hrs)
 * week 38
 * helped define functionality for webstatscollector 2.0
 * fixed bug 57376 missing country names on this squid report
 * week 37
 * published squid based reports
 * worked on mobile stats (perc mobile per country), see also blog post
 * added support for new MSIE user agent string format to squid scripts 64125
 * investigated bug 70721, proving it's a non-fix issue
 * investigated millions of pageviews for same article by one ip address (stuck F5 key)

(18 3/4 hrs)
 * week 36
 * cleanup on stats1001/2/3,many old files removed,triggered by Ariels inventory

(19 3/4 hrs)
 * week 35
 * further research on pageviews from Africa, page views per country per language, see Google doc with charts
 * encoding issues in webstatscollector