User:Embr/continuity

Identities

 * Mediawiki: Embr
 * Wikitech: Erosen
 * Office: Erosen
 * Github: embr

Services
these are things which other people use from time to time and have been expecting me to maintain

Geowiki
0 0 * * * python /home/erosen/src/geowiki/geowiki/process_data.py\ -o /home/erosen/data/editor-geocoding/\ --wpfiles /home/erosen/src/geowiki/geowiki/data/all_ids.tsv --daily\ --start=`date --date='-1 day' +\%Y-\%m-\%d`\ --end=`date --date='1 day' +\%Y-\%m-\%d` 0 2 * * * python /home/erosen/src/dashboard/geowiki/format_geo_editors.py 0 4 * * * bash /home/erosen/src/dashboard/deploy_git.sh
 * Geocoded active editor numbers by country (and sort of by city)
 * Database lives on s1 cluster within 'staging' database as erosen_geowiki* tables
 * Code to populate database: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/editor-geocoding
 * Code to generate dashboard files: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/global-dev/dashboard
 * Cron jobs to update database, generate limn files, and deploying the data to the limn server:
 * 1) m h  dom mon dow   command


 * Repo holding datafiles: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/global-dev/dashboard-data
 * People:
 * Fabian Kaelin: former contractor; started with Summer of Research; wrote original codebase
 * Henrique Andrade: current contractor in Brazil who is familiar with the database and has an interest in improving city level info
 * Sajjad Anwar: CIS employee working with India team; also interested in improving city level info
 * Known Issues:
 * Using erosen_geowiki_city_edit_fraction and erosen_geowiki_country_counts to reconstruct actual edit counts does not return whole numbers (or anything that looks like it could be the result of a rounding error).

Grantmaking & Programs Dashboards

 * Dashboard which serves Grantmaking & Programs related graphs, including Wikipedia Zero partner dashboards


 * Code to generate dashboard files: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/global-dev/dashboard
 * Repo holding datafiles: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/global-dev/dashboard-data
 * People:
 * Jessie Wild: main customer for strategic indicators and grants charts
 * Asaf Bartov: very interested in country-language charts for use when doing grant proposal evaluation
 * Known Issues:
 * Datasources list is very slow to load
 * Graphs created with one set of datafiles suddenly present the wrong column from the csv because the csv file changes

Wikipedia Zero Dashboards

 * Code to generate dashboards: https://github.com/embr/wp-zero
 * Repo holding datafiles: https://github.com/embr/wp-zero-data
 * List of "live" carriers can be found here
 * wp-zero repo also holds script used for generating one off reports for Amit using sampled files
 * Metadata for carriers is scraped from mcc-mnc.com using mcc-mnc python library
 * However, metadata should be scraped from meta under Zero namespace (e.g. vimpelcom-pakistan)
 * People
 * Amit Kapoor

Wikipedia Education Program (WEP) Plagiarism Study

 * Code: https://github.com/embr/wep-plagiarism
 * Also relies heavily upon (and was the main impetus for developing) mwstats
 * People:
 * LiAnna Davis
 * Sage Ross
 * Rodney Duncan
 * Frank Schulenburg

Monthly Report GP Strategic Goals Metrics

 * See March report for an example
 * Involves reading off the Global South Active Editors chart and computing MoM numbers
 * Previously had been including the MoM change in Monthly Free Wikipedia Zero Page Views but this has been difficult given the volatility of the methods for determining such numbers.

Tools
Exhaustive records of my code-related activities can be found on and gerrit. Sadly, I was the only person to use most of these tools, so little maintenance is expected, but many are relatively stable and quite useful for common analyst tasks.

gcat

 * https://github.com/embr/gcat
 * tool for turning Google Drive spreadsheets into Pandas.DataFrame objects.
 * Uses OAuth and should eventually be merged into Pandas

limnpy

 * https://github.com/wikimedia/limnpy
 * tool for generating limn dashboards programmatically

squidpy

 * https://github.com/embr/squidpy
 * A library for rapidly prototyping squid log parsing

python-iso3166

 * https://github.com/embr/python-iso3166
 * Reliable mapping between country names and ISO-3166 codes
 * Tool for disambiguating country names for use in limn maps

wikipandas

 * https://github.com/embr/wikipandas
 * Tool for converting Mediawiki tables into Pandas.Dataframes

mwstats

 * https://github.com/embr/mwstats
 * OOP Library for programmatically modeling Mediawiki users
 * Uses database and API

mwgit

 * https://github.com/embr/mwgit
 * Library for translating mediawiki articles into git repositories
 * Supports git blame for word-level attribution
 * Known Issues:
 * uses git-python which has a memory leak

mwcat

 * https://github.com/embr/mwcat
 * Tool for iterrogating mediawiki category graph
 * Relies heavily on a cached version of the categorylinks table
 * Uses NetworkX graph library for graph representation
 * Provides CLI and python library

mcc-mnc

 * https://github.com/embr/mcc-mnc
 * scrapes mcc-mnc.com and create mnc-mnc mapping used for decoding X-CS field
 * as noted in zero dashboard section, metadata should instead be scraped from meta under Zero namespace
 * e.g. vimpelcom-pakistan json schema

userstats

 * https://github.com/embr/userstats
 * Predecessor to wikimetrics
 * Framework for computing various wikipedia user metrics
 * Supports aggregation across cohorts
 * Uses API exclusively
 * CSV / terminal based workflow

wikisax

 * https://github.com/embr/wikisax
 * SAX XML parser for wikipedia dumps
 * written for interview task; not well maintained