User:Embr/continuity

Identities

 * Mediawiki: Embr
 * Wikitech: Erosen
 * Office: Erosen
 * Github: embr

Services
these are things which other people use from time to time and have been expecting me to maintain

Geowiki

 * Geocoded active editor numbers by country (and sort of by city)
 * Database lives on s1 cluster within 'staging' database as erosen_geowiki* tables
 * Code to populate database: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/editor-geocoding
 * Code to generate dashboard files: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/global-dev/dashboard
 * Repo holding datafiles: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/global-dev/dashboard-data
 * People:
 * Fabian Kaelin: former contractor; started with Summer of Research; wrote original codebase
 * Henrique Andrade: current contractor in Brazil who has used the code and has an interest in improving city level info
 * Sajjad Anwar: CIS employee working with India team; also interested in improving city level info
 * Known Issues:
 * Using erosen_geowiki_city_edit_fraction and erosen_geowiki_country_counts to reconstruct actual edit counts does not return whole numbers (or anything that looks like it could be the result of a rounding error).

Grantmaking & Programs Dashboards

 * Dashboard which serves Grantmaking & Programs related graphs, including Wikipedia Zero partner dashboards


 * Code to generate dashboard files: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/global-dev/dashboard
 * Repo holding datafiles: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/global-dev/dashboard-data
 * People:
 * Jessie Wild: main customer for strategic indicators and grants charts
 * Asaf Bartov: very interested in country x language charts for use when doing grant evaluation
 * Known Issues:
 * Datasources list is very slow to load
 * Graphs created with one set of datafiles suddenly present the wrong column from the csv because the csv file changes

Wikipedia Zero Dashboards

 * Code to generate dashboards: https://github.com/embr/wp-zero
 * Repo holding datafiles: https://github.com/embr/wp-zero-data
 * List of "live" carriers can be found here
 * wp-zero repo also holds script used for generating one off reports for Amit using sampled files
 * Metadata for carriers is scraped from mcc-mnc.com using mcc-mnc python library
 * However, metadata should be scraped from meta under Zero: namespace (e.g. vimpelcom-pakistan)
 * People
 * Amit Kapoor

Wikipedia Education Program (WEP) Plagiarism Study

 * TODO

Monthly Report GP Strategic Goals Metrics

 * see March report for an example
 * Involves reading off the Global South Active Editors chart and computing MoM numbers
 * Previously had been including the MoM change in Monthly Free Wikipedia Zero Page Views

Tools
Exhaustive records of my code-related activities can be found on and gerrit. Sadly, I was the only person to use most of these tools, so little maintenance is expected, but many are relatively stable and quite useful for common analyst tasks.

gcat

 * https://github.com/embr/gcat
 * tool for turning Google Drive spreadsheets into Pandas.DataFrame objects.
 * Uses OAuth and should eventually be merged into Pandas

limnpy

 * https://github.com/wikimedia/limnpy
 * tool for generating limn dashboards programmatically

squidpy

 * https://github.com/embr/squidpy
 * A library for rapidly prototyping squid log parsing

python-iso3166

 * https://github.com/embr/python-iso3166
 * Reliable mapping between country names and ISO-3166 codes
 * Tool for disambiguating country names for use in limn maps

wikipandas

 * https://github.com/embr/wikipandas
 * Tool for converting Mediawiki tables into Pandas.Dataframes

mwstats

 * https://github.com/embr/mwstats
 * OOP Library for programmatically modeling Mediawiki users
 * Uses database and API

mwgit

 * https://github.com/embr/mwgit
 * Library for translating mediawiki articles into git repositories
 * Supports git blame for word-level attribution
 * Known Issues:
 * uses git-python which has a memory leak

mwcat

 * https://github.com/embr/mwcat
 * Tool for iterrogating mediawiki category graph
 * Relies heavily on a cached version of the categorylinks table
 * Uses NetworkX graph library for graph representation
 * Provides CLI and python library

mcc-mnc

 * https://github.com/embr/mcc-mnc
 * scrapes mcc-mnc.com and create mnc-mnc mapping used for decoding X-CS field
 * as noted in zero dashboard section, metadata should instead be scraped from meta under Zero namespace
 * e.g. vimpelcom-pakistan json schema

userstats

 * https://github.com/embr/userstats
 * Predecessor to wikimetrics
 * Framework for computing various wikipedia user metrics
 * Supports aggregation across cohorts
 * Uses API exclusively
 * CSV / terminal based workflow

wikisax

 * https://github.com/embr/wikisax
 * SAX XML parser for wikipedia dumps
 * written for interview task; not well maintained