Wikistats/WLM stats

From mediawiki.org

This is the procedure to collect information about editors to the Wiki Love Monuments (WLM) contest(s).

The procedure is rather hackish as it's run only once a year, and was to be implemented in limited time.

First step: collect usernames of contributing users[edit]

This is done by user Platonides, as follows

 time sql commonswiki_p "SELECT DISTINCT user_name FROM u_platonides_wlm_p.wlm2013 JOIN user ON (user_id=wlm_author) WHERE wlm_source = 'commons';">  wlmUsernames.txt

result: wlmUsernames.txt


Or if we restrict to valid submissions (182 users less):

 time sql commonswiki_p "SELECT DISTINCT user_name FROM u_platonides_wlm_p.wlm2013 JOIN user ON (user_id=wlm_author) WHERE wlm_source = 'commons' AND wlm_status='Participating';">  wlmParticipatingUsernames.txt

result: wlmParticipatingUsernames.txt

Second step: collect edits for all contributing users[edit]

This is done as a side-task during the editor deduplication step (where edits per user, per wiki, per month, per namespace are merged). There are no run-time arguments. This filtering step, and export to WLM specific csv files is always performed. Even the input files selection pattern is hard coded. The script (WikiCounts.pl -y ...) will read names for file WLM_uploaders_2010.txt, WLM_uploaders_2011.txt and further years. If you want to collect all edits for editors in one WLM year only, replace other input files by empty files. (yes this is Q&D)

 bash folder: 
 stat1002:/a/wikistats_git/dumps/bash or  
 stat1002:/a/home/ezachte/wikistats/dumps/bash (beta env)

 bash file: count_merge_editors.sh
 input: 
 stat1002:/a/wikistats_git/dumps/csv/csv_mw/WLM_uploaders_yyyy.txt (where yyyy  in '2010', '2011', etc) 
 plus usual input for deduplication process: 
 stat1002:/a/wikistats_git/dumps/csv/csv_[xx]/EditsBreakdownPerUserPerMonth[lang].csv 
   where xx is project code, and lang is language code (e.g. EN,DE,FR) or COMMONS, META etc.
 output:
 folder: stat1002:/a/wikistats_git/dumps/csv/csv_mw 
 files: 
 WLM_Uploaders_EditsBreakdownPerUserPerMonth.csv
   layout:  username,yyyy-mm,project-language,namespace,edits
   example: 1971markus,2008-08,wp-de,0,51
 WLM_Uploaders_EditsFirstLast.csv
   layout: first month,last month,user,edits,wikis
   example: 2001-03,2014-03,Cdani,779,wp-ca|wp-en|wp-es|wx-commons|wx-meta|wx-wikidata
 WLM_Uploaders_EditsFirstLast2.csv
   layout: month,first edit, last edit  (??!) seems wrong
   example: 2003-02,1,0
 WLM_Uploaders_EditsFirstLastRetention.csv
   layout: first month,last month,users,total edits,average edits per user
   example: 2004-04,2014-03,7,296548,42364
 input and output are archived in:
 stat1002:/a/wikistats_git/dumps/csv/csv_mw/yyyy

Collect images/uploaders per country[edit]

 bash folder: 
 stat1002:/a/wikistats_git/dumps/bash or  
 stat1002:/a/home/ezachte/wikistats/dumps/bash (beta env)

 bash file: count_commons_images_wlm.sh
 input:
 commons dump:  /mnt/data/xmldatadumps/public/commonswiki/latest/commonswiki-latest-pages-meta-history.xml.7z
 bot names:     /a/wikistats_git/dumps/csv/csv_wx/BotsAll.csv
 country names: /a/wikistats_git/squids/csv/meta/CountryCodes.csv
 output:
 WLM_images_by_country_by_year.csv
   contains images per year, (well formed tags + anomalies)
   contains images and uploades per year per country
   contains users and their upload count per country
 WLM_images_by_country_by_year_edits.txt
   contains page id,file,timestamp,usertype (R=registered user B=bot A=anonymous),year,country
 WLM_images_by_country_by_year_errors.txt
   anomalous tags and how they were fixed
 WLM_images_by_country_by_year_inspect.html
   anomalous tags and how they were fixed (as clickable html document) 
 WLM_images_by_country_by_year_trace.txt
   for debugging only
 WLM_images_by_country_by_year_uploads.txt
   year,country,file,usertype (R=registered user B=bot A=anonymous),user,timestamp,unflagged
 charts derived from this:
 http://commons.wikimedia.org/wiki/File:WLM_uploaders_2010-2012_linear.png
   This chart shows how the three WLM event lead to increasingly large peaks in new editors (on any project).
 http://commons.wikimedia.org/wiki/File:WLM_uploaders_2010-2012_log.png
   Same chart with logarithmic  y axis. This shows that numbers of old hands contributing to WLM is non-negligible.
 http://commons.wikimedia.org/wiki/File:WLM_uploaders_2012_linear.png
 http://commons.wikimedia.org/wiki/File:WLM_uploaders_2012_log.png
   For completeness sake, similar linear and log charts where only WLM 2012 is taken into account. 
 http://commons.wikimedia.org/wiki/File:WLM_uploaders_2010-2012_bar_chart_corrected.png
   Shows experienced vs new uploaders to each WLM event.
 http://commons.wikimedia.org/wiki/File:WLM_uploaders_2010-2012_vs_other_NS6_editors_linear.png
   Similar as first chart, now with remaining Commons contributors to namespace 6 plotted as second line.
   Surprisingly there is still a big leap in non WLM editors in Sep 2012 
   ==> (hmmm, seems too coincidental, are we missing WML participants ?, in other words should some users still move from red to blue line ?)

See also[edit]