Community metrics

How is the MediaWiki / Wikimedia tech community doing? Let's analyze the data available in order to highlight the contributors and areas setting an example, and also the bottlenecks or inactive corners requiring our attention.

Your feedback and requests are welcome in Phabricator (project Analytics-Tech-community-metrics). You can also comment at the discussion page and at the Analytics mailing list.

Median age of open changesets
"Time from last patchset" at "Age of open changesets (monthly snapshots)".

Open changesets waiting for review
"Waiting for review" at "Backlog of open changesets (monthly snapshots)"

New changesets submitted per month
The "submitted" line in "submitted vs. Merged changes vs. Abandoned".

Active Gerrit code review users per month
"Code review users"

Other reports
Monthly active users in Bugzilla (from 2013-02 to 2014-10) and in Phabricator (from 2014-12 to last month).

{ "width": 1000, "height": 350, "padding": {"top": 10, "left": 30, "bottom": 30, "right": 10}, "data": [ {     "name": "table", "values": [ {"x": 1302, "y": 450}, {"x": 1303, "y": 495}, {"x": 1304, "y": 511}, {"x": 1305, "y": 532}, {"x": 1306, "y": 505}, {"x": 1307, "y": 542}, {"x": 1308, "y": 530}, {"x": 1309, "y": 492}, {"x": 1310, "y": 539}, {"x": 1311, "y": 508}, {"x": 1312, "y": 540}, {"x": 1401, "y": 537}, {"x": 1402, "y": 527}, {"x": 1403, "y": 533}, {"x": 1404, "y": 478}, {"x": 1405, "y": 470}, {"x": 1406, "y": 509}, {"x": 1407, "y": 501}, {"x": 1408, "y": 515}, {"x": 1409, "y": 475}, {"x": 1410, "y": 491}, {"x": 1411, "y": 0}, {"x": 1412, "y": 619}, {"x": 1501, "y": 669}, {"x": 1502, "y": 733}, {"x": 1503, "y": 838}, {"x": 1504, "y": 766}, {"x": 1505, "y": 775}, {"x": 1506, "y": 794}, {"x": 1507, "y": 797}, ]   }  ],  "scales": [ {     "name": "x", "type": "ordinal", "range": "width", "domain": {"data": "table", "field": "data.x"} },   {      "name": "y", "range": "height", "nice": true, "domain": {"data": "table", "field": "data.y"} } ],  "axes": [ {"type": "x", "scale": "x"}, {"type": "y", "scale": "y"} ], "marks": [ {     "type": "rect", "from": {"data": "table"}, "properties": { "enter": { "x": {"scale": "x", "field": "data.x"}, "width": {"scale": "x", "band": true, "offset": -1}, "y": {"scale": "y", "field": "data.y"}, "y2": {"scale": "y", "value": 0} },       "update": { "fill": {"value": "steelblue"} },       "hover": { "fill": {"value": "red"} }     }    }  ] }

korma.wmflabs.org
korma.wmflabs.org is the Wikimedia Tech community metrics dashboard. It has been under development since June 2013.

The data is refreshed on a daily basis (check FIXME, link not working?). The sources include Git and Gerrit repositories, Phabricator's Maniphest (and Bugzilla before), mediawiki.org, mailing lists, and IRC.

Korma is powered by open source projects Metrics Grimoire and Viz Grimoire. You can find the development specific to the Wikimedia tech dashboard in GitHub.

Bugs and feature requests can be reported under the Analytics-Tech-community-metrics project.

User data
Our goal is to provide a tool allowing users to edit their own data directly (T60585). Meanwhile, users can request updates to their personal data creating a Phabricator task including:
 * real name
 * username(s) and email address(es) used for your contributions
 * current and previous affiliations, with the dates of change of affiliation
 * Current location (country)

At the moment we can only process single affiliations (T95238). If you are contributing from different affiliations (i.e. Wikimedia Foundation as part of your work, Independent in your free time), then we recommend you to use different usernames and email addresses.

Git
ssh -p 29418 gerrit.wikimedia.org gerrit ls-projects | grep "mediawiki/extensions
 * The source code repos analyzed are mediawiki/core and all the mediawiki extensions:
 * FIXME: This is only a portion (a big one, yes) of all the repositories we need to scan. The default is everything at gerrit.wikimedia.org but let's look at every repo before adding it just in case.

Code review
Korma offers code review data based on a selection of repositories scanned on a daily basis at gerrit.wikimedia.org. Specifically:
 * gerrit_trackers.conf contains a list of repositories retrieved using a command similar to: . We are planning to maintain this list automatically (T104845).
 * gerrit_trackers_blacklist.conf contains a list of blacklisted repositories that is manintained manually. These are repositories that we don't want to compute against our metrics, and they will be ignored as soon as they are added to the blacklist: upstream projects, empty or deprecated repositories, and other exceptional cases.

In addition to this, if a project was removed from the gerrit list, the script will detect the change and the project will be removed from Korma's database automatically.

phabricator.wikimedia.org

 * (Since the move from Bugzilla to Phabricator there is currently (May 2015) no data available in the Metrics dashboard. This will change once T96238 and T28 get fixed.)

mediawiki.org

 * The wiki activity is analyzed using an analyzer tool including editions, editors and pages. Results and discussion.

lists.wikimedia.org

 * FIXME - mailing lists missing: mediawiki-l, ee, qa... more?
 * FIXME - Is it possible to specify the number of subscribers?

IRC

 * Pending to define the channels to be added to the dashboard.

Contributors
SortingHat is the tool to manage identities. This helps in the following way:
 * To centralize all information in a database.
 * To deal with several identities: a developer may have several identities depending on the data source she is working on. This tool helps to identify for each identity of a developer where that information came from.
 * To avoid the use of direct database: a command line interface deals with its.
 * To manage extra developer attributes: it has support for managing affiliations and other developer attributes such as nationalities or bot activity.
 * To manage black lists: this is typically used in cases where bots are committing changes, or too generic names or emails addresses such as "root".

The process to merge all of the identities into one database could be done in two ways: a more detailed one, or an incremental one. The first process is done through the use of extra scripts to parse such information. However this is a heavy-time process and this is typically used in the first identities database creation. Later updates of the database typically follows the second step.

SortingHat also provides a way to export all of this data. This helps to look for other developer identities and merge them through the command line.

This exported JSON files follows the same structure:

As an example, if an identity is required to be merged with another identity, the command "sortinghat merge" is used and the original ".identities.id" is merged into the specified "".

The most useful SortingHat commands to deal with identities are the following ones:
 * sortinghat merge: to merge unique identities
 * sortinghat affiliate: to affiliate an identity to some organization
 * sortinghat show: to show information about an identity
 * sortinghat profile: to show profile information of that unique identity

The user pages for Top contributors are linked in the top tables in the metrics browser. For example for SCM the third global committer has his own personal page.

Once unique people exists, other categories are created using it. For example, companies classification is done initially with a script that uses email domains if available. The classification supports periods of time to cover that a unique people has worked for several companies. There is some experimental support also for countries.
 * FIXME - Contributors must be linked to WMF and other orgs.
 * FIXME - Is By country relevant? Do we want to gather that data?
 * FIXME - Plan for linking this data to user profiles? Where?

Other data sources and tools
Git
 * Wikimedia stats in Ohloh including many projects.
 * "How many unique contributors submitted unique pull requests to a https://github.com/wikimedia/ repo" - Python script by marktraceur.

Gerrit
 * Gerrit/Navigation
 * MediaWiki Gerrit stats  (Is it working? 2013-06-28)  and how to query Gerrit data.
 * Number of gerrit committers (marktraceur's bash script)
 * cmd-query for Gerrit.

Phabricator
 * "Phabricator monthly statistics" emails on the wikitech-l mailing list - see its archives.

mediawiki.org
 * monthly Statistics of page views and how the data is gathered.

Mailman
 * Wikimedia Mail Stats: PowerPosters.

Team
Quim Gil and Andre Klapper from the Wikimedia Engineering Community team are coordinating the Metrics Dashboard project, which is being implemented by Bitergia as contractors.

The Bitergia team working in the MediaWiki dashboard is formed by Daniel Izquierdo, Luis Cañas and Jesus Gonzalez Barahona and Alvaro del Castillo as project manager.

The ownership of this project might get transfered to the Wikimedia Analytics team at some point.