User:Erik Zachte (WMF)/Progress

For earlier history see User:Erik_Zachte/progress
 * to do: removal of last config files from squid reports, and replace by cmd line parameters)
 * to do: investigate duplication of page histories due to import of translated articles on other wiki (reported by Phoebe, Dec 6 2013)
 * to do: look into page view forecast algorithm, no longer sure how that works (and add some comments in the code)


 * week 16
 * Attended GLAM-WIKI conference (day 3)

(24.5 hrs)
 * week 15
 * Attended GLAM-WIKI conference (day 1,2)
 * Input for Nuria's/Andrew's talk proposal at NYC

(2.5 hrs) (14 1/4 hrs) (13 3/4 hrs)
 * week 14
 * Looked at how many images on commons are linked to or requested on a single day (for Jaime)
 * week 13
 * Analyzed unsampled edits logs,both old (webstatscollector) and new (hadoop), see
 * Published and announced new media file request count dumps
 * Collected metrics and analyzed anomaly (bug) for Swiss TV Request
 * Collected metrics for WHYY, the NPR affiliate station in Philadelphia
 * week 12
 * Analyzed unsampled edits logs,both old (webstatscollector) and new (hadoop), see

(9 hrs)
 * week 11
 * Report Card for March (delayed because of missing dumps)

(17 hrs)
 * week 10
 * After fixing T90230 last week, rsync of daily aggregates of page view still didn't happen. Turns out rsync now needs -ipv4 parameter.
 * prerelease Wikistats (dumps for February not all in yet)
 * Report Card (ongoing) (data for comScore not yet accessible, subscription expired)
 * Data fact check for Communications

(20 hrs)
 * week 9
 * Removed translations for namespace 'User' from wikistats (some translations were incomplete, and buggy, and not really needed) per Amir's request
 * Fixed 'T90230: Daily aggregation of page view dumps stalled'. Filed a new bug for 'T90629: repairing the underlying hourly dumps'.
 * Analyzed 'T90240: Could it be that the geo IP matching is not accurate for Africa?' The answer is a big YES: ip->geo is faulty for squid logs processing since we changed to https. The real ip address is only available in the secure version of the message, thus edits are mostly assigned to WMF server locations. Full impact and fix to be determined.

(17 1/2 hrs) (25 1/4 hrs)
 * week 8
 * testing of new media file request dump
 * user requests (new log item):
 * [question] Raw file stats vs pageview API stats: (Jason Bub)
 * [question] [data] monthly per country view stats (Rütger Egolf, Research Assistant at Centre for European Economic Research)
 * [question] Explain how wikilinks are counted in wikistats (explained perl code) by
 * week 7
 * derive estimates for new quarterly report card from incomplete data (dumps have stalled) by extrapolation
 * adapt wikistats scripts to allow merge of totalactive editors for only those wikis which have data for latest month
 * Provide total active editors (TAE) for December 2014
 * Report Edits for 2014 Oct-Dec

(22 1/2 hrs)
 * week 6
 * partial publishing of RC input (dumps are lagging)
 * analyze progress of dump generation (by parsing index.html for 900+ wikis, for all available dump dates),
 * autonomous growth is dump sizes and job length can be shown
 * with a few further tweaks this scan can be run say half an hour, and also report on stalled dump jobs

(16 1/4 hrs) (8 3/4)
 * week 5
 * fixed 2 issues (coding & config glitch) which made Summary charts not update since Sep 2014, see e.g.
 * final tweaks (hopefully) for Wiki Loves Africa reporting
 * investigating 5 percent of page views /edit from sampled squid logs which don't have country info (ongoing)
 * issues with dumps (lagging behind, ongoing)
 * reassessment of where we are with issues with media file request counts RFC
 * week 4
 * fixed wikivoyage report showing wikipedia counts for el/fa
 * rerun Wiki Loves Africa reporting (now using categories *and* templates to find all images)

(17 hrs)
 * week 3
 * analysis of maintenance categories on wp:en (req. Lila), first release published
 * finalized analysis of wp:en maintenance categories (req. Lila), see
 * adapted several script to use proxy on stat1002 from now on, see
 * added Persian and Greek wikivoyage and looked into extraordinary large page counts for those two wikis

(22 1/4 hrs)
 * week 2
 * Wiki Loves Africa reporting (ongoing, looking into discrepancies)
 * analysis of maintenance categories on wp:en (req. Lila), ongoing
 * most wikistats reporting broken due to recent config changes, several issues
 * stat1001 changed to private IP (Putty config fixed)
 * updated all bash files for new access to stat1001
 * daily aggregation of page views aborted due to trivial error -> Q&D fix

(1 hrs)
 * week 53/2014 1/2015

(8 hrs)
 * week 52
 * misc maint.

(9 3/4 hrs)
 * week 51
 * end of year administrative housekeeping / reorg.

(13 3/4)
 * week 50
 * meetup with Europeana on how to proceed once media file requests counts are produced daily
 * looked into overnight sudden drop in article count on no.wikipedia.org of 30k articles (seems Mediawiki counter issue, not Wikistats)
 * mails

(18.5 hrs)
 * week 49
 * published traffic reports
 * adapted code for Medicin Translation Taskforce (which moved to google spreadsheet) (ongoing)
 * started to do daily/monthly aggregation of new hourly pageviews files from Hive successor of webstatscollector script (adapting existing script)

(12.5 hrs)
 * week 48
 * WLM reprisal (as contest continued in Oct)
 * comScore rank reassessment for
 * GLAM media file stats
 * data/config maintenance

(10 1/4 hrs)
 * week 47
 * GLAM media file stats
 * data/config maintenance

(29 1/4 hrs)
 * week 46
 * preparing for GLAM hackaton: RFC media file requests dump
 * GLAM hackaton

(3.5 hrs)
 * week 45

(17 3/4 hrs)
 * week 44
 * WLM 2014 stats (partial, will complete after Nov data are available)
 * Report Card prep
 * traffic reports
 * many mail threads

(22 hrs)
 * week 43
 * GLAM media file stats

(17 hrs)
 * week 42
 * GLAM media file stats

(31 3/4 hrs) (9 3/4 hrs)
 * week 41
 * started to look into hive (a bit)
 * studied new hive implementation of webstatscollector:
 * convert webrequests to pagecounts
 * render the pagecounts files
 * render the projectcounts files
 * commented on new pageview defs
 * generalised filters
 * week 40
 * updated PediaPress stats (adding 22 months till Nov 2013)
 * updated mailing list scanner (new aliases)
 * investigate source of implausible rise in monthly page views, see Trello card
 * prep squid reports (ongoing)

(11.5 hrs)
 * week 39
 * some page view stuff
 * prep report card

(11.5 hrs) (18 3/4 hrs)
 * week 38
 * helped define functionality for webstatscollector 2.0
 * fixed bug 57376 missing country names on this squid report
 * week 37
 * published squid based reports
 * worked on mobile stats (perc mobile per country), see also blog post
 * added support for new MSIE user agent string format to squid scripts 64125
 * investigated bug 70721, proving it's a non-fix issue
 * investigated millions of pageviews for same article by one ip address (stuck F5 key)

(18 3/4 hrs)
 * week 36
 * cleanup on stats1001/2/3,many old files removed,triggered by Ariels inventory

(19 3/4 hrs)
 * week 35
 * further research on pageviews from Africa, page views per country per language, see Google doc with charts
 * encoding issues in webstatscollector