User:Erik Zachte/progress


 * to do: start wikistats run for wp:de based on full archive dump to help address heavy debate on wp:de (needs code fix, as it needs to process 'new' partial dump files)
 * to do: removal of last config files from squid reports, and replace by cmd line parameters)
 * to do: finalize monthly reporting of unsampled edits from squid log (rather than yearly avg)
 * to be resumed: analyze effects of world wide switch to https on 28 August on squid log stats;
 * to do: investigate duplication of page histories due to import of translated articles on other wiki (reported by Phoebe, Dec 6 2013)
 * to do: look into page view forecast algorithm, no longer sure how that works (and add some comments in the code)
 * to do: comment on metrics definitions
 * to do: add comment to page view report https://bugzilla.wikimedia.org/show_bug.cgi?id=57980#c6


 * week 17
 * collect edits per editor per wiki per month per namespace for 800+ wikis
 * prototype detection of editor migration patterns
 * with extensive help from Andrew Otto I got connection problems fixed which were result of server switch

(20 3/4 hrs)
 * week 16
 * finalized portal search and blogged about it
 * analyzed how many usernames with 'bot' in name are still on purpose not regarded as bots by Wikistats
 * with great help by Christian got git working, reordered structure, committed many changes ((more to do)

(7 3/4 hrs)
 * week 15
 * fixed bug 63879: Incomplete monthly aggregated page view files
 * fixed bug 62230: Total edits on wikidata seems too low

(20 1/4)
 * week 14
 * generated data files for monthly Report Card
 * generated new squid based reports (some url's in the portal needs to be updated to quarterly versions)
 * last review of UN report for David Souter unearthed serious data errors, to be discussed

(19 hrs)
 * week 13
 * further consultation to David Souter for report commissioned by UN
 * finalized: bug 60826 Enable parallel processing of stub dump and full archive dump for same wiki.
 * file StatisticsMonthly.csv is copied hourly from stat1 to stat1001 under new name
 * WikiReportsOutputTables.pm now reads both StatisticsMonthly.csv and StatisticsMonthlyFullArchive.csv and uses different columns from each file
 * maintenance of wikistats portal to enhance upcoming search, also added several sections
 * submission of presentation proposal for Wikimania

(19 1/4 hrs) (23 3/4 hrs ) (18 1/4 hrs)
 * week 12
 * prioritized 46 bugs
 * in progress: bug 60826 Enable parallel processing of stub dump and full archive dump for same wiki.
 * new argument -F to force processing full archive dumps (regardless of dump size)
 * Wikistats now can handle segmented dumps (which BTW differ in file name for wp:de and wp:en) see first 100 or so lines in about meta-history dump files
 * Wikistats can detect error messages in index.html (where a msg about phase completion contains 'failed')
 * in progress: bringing all scripts up to date for changed config (wikistats portal from dataset2 -> dataset1001)
 * kick-off follow-up study deduplicated editors
 * week 11
 * finalizing search facility
 * added comments (some boilerplate) and impact assessment and proposed priorities for 46 wikistats bugs no in the list are three marked as resolved
 * published squid log reports for February
 * fixed bug 61420: Missing stats for zh.wikivoyage
 * processed mail backlog
 * week 10
 * generated deduplicated editor counts for several sets of wikis
 * built search facility for wikistats portal

(11 3/4 hrs) (7.5 hrs)
 * week 9
 * fixed bug https://bugzilla.wikimedia.org/show_bug.cgi?id=61929
 * updated unique active editors for all Wikimedia wikis report for Economist
 * fixed bug https://bugzilla.wikimedia.org/show_bug.cgi?id=62044
 * generated data files for monthly Report Card
 * week 8


 * consulted David Souter on how to use wikistats for report commissioned by UN (on developments in Internet content and language issues since the World Summit on the Information Society (WSIS) in 2003)
 * worked with Magnus Manske to assess usability of monthly aggregated page view files for his scripts

(1 3/4 hrs)
 * week 7

(14 hrs)
 * week 6
 * produced trend charts for google traffic by country
 * produced reports for smallest Wikipedias on request (normally not generated when articles and monthly edits are below threshold of 10)
 * final analysis of page view trends
 * provided data for NYT on page view trends

(18 3/4 hrs)
 * week 5
 * produced trend charts for crawler patterns
 * ongoing analysis of page view trends

(18 3/4 hrs) (23 3/4 hrs)
 * week 4
 * produced long term browser trend charts (mobile/non mobile as well as absolute/relative) from squid log based csv files
 * reran squid log reports with bogus traffic filtered out (Jul-Dec 2013)
 * looking into doing the same for crawler patterns
 * generated input for monthly report card (incl minor bug fixing)
 * week 3
 * continued to analyze low page views counts, also from squid logs
 * produced breakdowns of article traffic by directly analyzing squids log with grep
 * see last pages of

(24.5 hrs) (3 1/ hrs) (5 hrs) (12 1/4 hrs) -> caused by wikistats skipping pages where checksum is missing in dumps -> rerunning all dumps (13 1/2 hrs) (24 1/4 hrs) (18 1/4 hrs) (12 1/4 hrs)
 * week 2
 * prepared files for Limn
 * incl. fix to circumvent for Limn bug, where Limn does not know how to handle empty values for WikiData
 * incl. fix to accept new standardized file names for comScore csv files
 * fixed missing wikis from dump reports (complaint by language team)
 * there was a design flaw, since API querying was added in July 2013, a circular dependency that prevented new codes added to dblist files to be incorporated, after fix two new wikis finally got coverage:
 * Vietnamese Wikivoyage, e.g.
 * Minangkabau Wikipedia, e.g.
 * updated monthly merged page view files + prepped top views reports, e.g. wp:en
 * instructed scripts to ignore input for Jan 5/6 2014 (totals will be extrapolated from remainder)
 * done WikiCountsSummarizeProjectCounts.pl, collects counts for page view reports, reran reports
 * done SquidCollectBrowserStatsExcel.pl
 * other scripts (daily/monthly merge of dammit.lt files) are automatically doing that with hourly precision
 * any other scripts to do? hmm, pondering
 * fixed page view counts shown in Summary reports for Sara Lasner e.g. Greek Wikipedia, now shows pv count for same month as other data in the report
 * added trend line for mobile page views and combined mobile+non-mobile to Summary reports, e.g. Japanese Wikipedia
 * fixed publication of patched projectcount files
 * started to analyze low page views counts, also from squid logs
 * week 1
 * mostly vacation
 * published squid based page view/edit reports
 * published monthly wikistats dump based reports
 * week 52
 * mostly vacation
 * transforming yearly page view/edit reports with yearly averages into monthly reports, last month only
 * week 51
 * solved bug : Italian Wikivoyage page count in Wikistats seems too low
 * week 50
 * finalized patch (see week 49)
 * published squid reports
 * contributed to new metric definitions
 * week 49
 * built script to patch project files from pagecount files (per wiki, since June 1 2013) to substract counts for bogus page views
 * quick charts on total (very) active editors and how those metrics drop on Wikipedia faster than on other projects
 * patched project files
 * assessment of download size for full wikipedia for journalist
 * in depth analysis of impact of patch
 * week 48
 * investigated with Christian the issue of inflated page views by webstatscollector bug
 * prep comScore files for RC
 * file name normalization of 100's of inconsistently named historic comScore files
 * week 47
 * published monthly Wikistats reports
 * prepped data for Limn except comScore data (subscription stalled again)
 * ongoing: discussions on metrics definitions
 * marked bug 46289 as resolved (see wk 46)
 * deactivated squid based report Devices and removed links to it
 * several minor fixes on squid reports (layout, update time)

(15 hrs) (14.5 hrs)
 * week 46
 * published new geo breakdown reports based on unsampled squids log
 * also updated chart on edits breakdown global N vs S (added to squid report portal)
 * urgent: updated chart for Sue on UV trends for news sites vs Wikimedia (based on a patchwork of yearly comScore data)
 * also created new charts on top reference sites
 * got mailing list stats back running (stalled since Feb 13),
 * (two open issues : look at gap in summer 13, apply Nemo's patch after gerrit sync issue has been fixed)
 * week 45
 * prepared input for Monthly Report Card (minus comScore data, subscription renewal is ongoing)
 * updated input for Monthly Report Card (after comScore subscription renewal)
 * minor: Wikistats Overview diagram is now public (linked from Wikistats portal About page)
 * analyzed drop in mobile page views in recent months on English Wikipedia (and others) vs steep rise in non-mobile page views (it turns out the rise in non-mobile is far too large for any possible underreporting on mobile)
 * ongoing: analyze effects of world wide switch to https on 28 August on squid log stats
 * published squid based reports

(28 1/4 hrs)
 * week 44
 * publish monthly wikistats reports
 * helped analyze drop in total active editors for Sep 2013 (probably seasonal (=within normal range) after all)
 * ongoing: analyze effects of world wide switch to https on 28 August on squid log stats
 * made WLM data more visible in Commons report
 * fixed bugzilla bug 55558: new Wikivoyage logo on wikistats portal
 * ongoing: publish input for report card

(11.5 hrs) (11 hrs)
 * week 43
 * input for cohort analysis to Daimee
 * adapted squid based edit reports, based explanatory texts and final counts on new argument: sample rate
 * reran squid based edit reports from 1:1 unsampled edit log
 * reran edit(or) counts for Sarikas, with updated title list
 * week 42
 * collect data from German dump for external researcher (Dr Sarikas)
 * new script to build new filtered full archive dump based on discrete list of article titles
 * new script to collect edits/editors count (registered,anon,bot) from full archive dump

(21 hrs)
 * week 41
 * updated (overdue) monthly squid based reports
 * large cluster of reports on page views/edits per country are back after 6 months (more to do, see week 43)
 * squid based data collection now based on 1:1 instead of 1:1000 log files for page edits
 * fixed https://bugzilla.wikimedia.org/show_bug.cgi?id=55528

(11 3/4 hrs)
 * week 40
 * worked on squid reports (ongoing)