Community metrics

How is the MediaWiki / Wikimedia tech community doing? Let's analyze the data available in order to highlight the contributors and areas setting an example, and also the bottlenecks or inactive corners requiring our attention.

Your feedback and requests are welcome in Phabricator (project Analytics-Tech-community-metrics). You can also comment at the discussion page and at the Analytics mailing list.

Reports
Monthly active users in Bugzilla (from 2013-02 to 2014-10) and in Phabricator (from 2014-12 to last month).

{ "width": 1000, "height": 350, "padding": {"top": 10, "left": 30, "bottom": 30, "right": 10}, "data": [ {     "name": "table", "values": [ {"x": 1302, "y": 450}, {"x": 1303, "y": 495}, {"x": 1304, "y": 511}, {"x": 1305, "y": 532}, {"x": 1306, "y": 505}, {"x": 1307, "y": 542}, {"x": 1308, "y": 530}, {"x": 1309, "y": 492}, {"x": 1310, "y": 539}, {"x": 1311, "y": 508}, {"x": 1312, "y": 540}, {"x": 1401, "y": 537}, {"x": 1402, "y": 527}, {"x": 1403, "y": 533}, {"x": 1404, "y": 478}, {"x": 1405, "y": 470}, {"x": 1406, "y": 509}, {"x": 1407, "y": 501}, {"x": 1408, "y": 515}, {"x": 1409, "y": 475}, {"x": 1410, "y": 491}, {"x": 1411, "y": 0}, {"x": 1412, "y": 619}, {"x": 1501, "y": 669}, {"x": 1502, "y": 733}, {"x": 1503, "y": 838}, {"x": 1504, "y": 766}, {"x": 1505, "y": 775}, {"x": 1506, "y": 794}, {"x": 1507, "y": 797}, ]   }  ],  "scales": [ {     "name": "x", "type": "ordinal", "range": "width", "domain": {"data": "table", "field": "data.x"} },   {      "name": "y", "range": "height", "nice": true, "domain": {"data": "table", "field": "data.y"} } ],  "axes": [ {"type": "x", "scale": "x"}, {"type": "y", "scale": "y"} ], "marks": [ {     "type": "rect", "from": {"data": "table"}, "properties": { "enter": { "x": {"scale": "x", "field": "data.x"}, "width": {"scale": "x", "band": true, "offset": -1}, "y": {"scale": "y", "field": "data.y"}, "y2": {"scale": "y", "value": 0} },       "update": { "fill": {"value": "steelblue"} },       "hover": { "fill": {"value": "red"} }     }    }  ] }

Metrics dashboard
Under development since June 2013: Updated daily (check), this dashboard provides data about our Git repositories, Bugzilla, Mailing Lists, Gerrit and IRC. Below you can find the details about what sources are being scanned.
 * http://korma.wmflabs.org/browser/

Korma is powered by open source projects Metrics Grimoire and Viz Grimoire. See also the development specific to this dashboard in GitHub.

Bugs and feature requests can be reported under the Analytics-Tech-community-metrics project. You can provide the data related to your contributors, also through a Phabricator task, including:
 * real name
 * username(s) and email address(es)
 * current and previous affiliations, with the dates of change of affiliation
 * Current location (country)

At the moment we can only process single affiliations. If you are contributing from different affiliations (i.e. Wikimedia Foundation as part of your work, Independent in your free time), then we recommend you to use different usernames and email addresses.

Git
ssh -p 29418 gerrit.wikimedia.org gerrit ls-projects | grep "mediawiki/extensions
 * The source code repos analyzed are mediawiki/core and all the mediawiki extensions:
 * FIXME: This is only a portion (a big one, yes) of all the repositories we need to scan. The default is everything at gerrit.wikimedia.org but let's look at every repo before adding it just in case.

gerrit.wikimedia.org

 * The source code repos analyzed are mediawiki/core and all the mediawiki extensions

phabricator.wikimedia.org

 * (Since the move from Bugzilla to Phabricator there is currently (May 2015) no data available in the Metrics dashboard. This will change once T96238 and T28 get fixed.)

mediawiki.org

 * The wiki activity is analyzed using an analyzer tool including editions, editors and pages. Results and discussion.

lists.wikimedia.org

 * FIXME - mailing lists missing: mediawiki-l, ee, qa... more?
 * FIXME - Is it possible to specify the number of subscribers?

IRC

 * Pending to define the channels to be added to the dashboard.

Contributors
SortingHat is the tool to manage identities. This helps in the following way:
 * To centralize all information in a database.
 * To deal with several identities: a developer may have several identities depending on the data source she is working on. This tool helps to identify for each identity of a developer where that information came from.
 * To avoid the use of direct database: a command line interface deals with its.
 * To manage extra developer attributes: it has support for managing affiliations and other developer attributes such as nationalities or bot activity.
 * To manage black lists: this is typically used in cases where bots are committing changes, or too generic names or emails addresses such as "root".

The process to merge all of the identities into one database could be done in two ways: a more detailed one, or an incremental one. The first process is done through the use of extra scripts to parse such information. However this is a heavy-time process and this is typically used in the first identities database creation. Later updates of the database typically follows the second step.

SortingHat also provides a way to export all of this data. This helps to look for other developer identities and merge them through the command line.

This exported JSON files follows the same structure:

As an example, if an identity is required to be merged with another identity, the command "sortinghat merge" is used and the original ".identities.id" is merged into the specified "".

The most useful SortingHat commands to deal with identities are the following ones:
 * sortinghat merge: to merge unique identities
 * sortinghat affiliate: to affiliate an identity to some organization
 * sortinghat show: to show information about an identity
 * sortinghat profile: to show profile information of that unique identity

The user pages for Top contributors are linked in the top tables in the metrics browser. For example for SCM the third global committer has his own personal page.

Once unique people exists, other categories are created using it. For example, companies classification is done initially with a script that uses email domains if available. The classification supports periods of time to cover that a unique people has worked for several companies. There is some experimental support also for countries.
 * FIXME - Contributors must be linked to WMF and other orgs.
 * FIXME - Is By country relevant? Do we want to gather that data?
 * FIXME - Plan for linking this data to user profiles? Where?

Other data sources and tools
Git
 * Wikimedia stats in Ohloh including many projects.
 * "How many unique contributors submitted unique pull requests to a https://github.com/wikimedia/ repo" - Python script by marktraceur.

Gerrit
 * Gerrit/Navigation
 * MediaWiki Gerrit stats  (Is it working? 2013-06-28)  and how to query Gerrit data.
 * Number of gerrit committers (marktraceur's bash script)
 * cmd-query for Gerrit.

Phabricator
 * "Phabricator monthly statistics" emails on the wikitech-l mailing list - see its archives.

mediawiki.org
 * monthly Statistics of page views and how the data is gathered.

Mailman
 * Wikimedia Mail Stats: PowerPosters.

Key performance indicators
Key factors to watch, in the scope of projects deployed in Wikimedia servers: All this indicators are computed using the databases updated daily.
 * Are the teams more efficient processing contributions?
 * Is the share of non-WMF contributions growing?
 * Are WMF and non-WMF contributions treated equally?
 * Are the attraction and retention of new contributors improving?
 * Are we improving the sustainability of our community?

Who contributes code
Who is contributing merged code each quarter? How is the weight of the WMF evolving? What regions have a higher density of contributors? The evolution of the total amount of merged commits should be visible too. Two charts? What type?
 * Number of developers and commits by organization: Wikimedia (WMF, WMDE...), known companies, OSS projects (if relevant) and independents.
 * Number of developers and commits by country, based on the data provided.

Queries
Reviews Database
 * Basic:


 * Email domains:

Metrics

 * DB updated on 2013-08-22
 * Total revisions: 56127
 * Total abandoned: 3015
 * Total contributors with merged code: 319
 * Total contributors with abandoned code: 263
 * Total organizations with merged code: 108
 * Total organizations with abandoned code: 82
 * Number of revisions merged per contributor
 * Number of revisions abandoned per contributor
 * Number of revisions merged per organization
 * Number of revisions abandoned per organization
 * Ratios merged/abandoned
 * Tops

Analysis

 * Top 10 for contributors: 7 WMF, 3 WMDE.
 * Only one organization has more than 500 merged: Wikimedia (@wikimedia.org + @wikimedia.de): 13593 + 2132
 * Only one organization has more than 300 abandoned: Wikimedia.

Gerrit review queue
'' Changes dates from gerrit ssh API are wrong until 2013-05 so time to review is only available after that. Time zones are not covered yet. ''

http://korma.wmflabs.org/browser/gerrit_review_queue.html

Backlog of open changesets (monthly snapshots)

 * "Open changesets" (blue line): number of changesets that are open (not merged nor abandoned) at the end of the specified month. Example: number of open changesets on 31st January 2014 at 23:59.
 * "Waiting for review" (green line): subset of open changesets that are waiting for review (not "WIP", with no review or just +1) at the end of the specified month. Open changesets with "WIP" in the subject or with -1/-2 are not counted. Example: number of changesets waiting for review on 31st January 2014 at 23:59.

Distribution of open changesets (by date of last patchset)
(The proposed definition has changed in order to count the dates of last patchset uploaded, see )


 * "Open changesets" (blue bars): for the current backlog of open changesets, number of changesets with a last patchset uploaded during each month. Example: number of changesets currently open with their last patchset uploaded during January 2014.


 * "Waiting for review" (green bars): for the subset of current backlog of open changesets which are waiting for review, number of changesets with a last patchset uploaded during each month. Example: number of changesets currently under review with their last patchset uploaded during January 2014.

Age of open changesets (monthly snapshots)

 * "Time from submission" (blue line): for open changesets at the end of the specified month, median time from the initial upload of those changesets until the end of the specified month. Example: median time since first upload at the end of 31st January 2014, for all changesets that are still open on 31st January 2014.
 * "Time from last patchset" (green line): for changesets that are still open at the end of the specified month, median time from the last patchset at that time until the end of the specified month. Example: median time since the most recent patchset, as measured on 31st January 2014, for all changesets that are still open on 31st January 2014.

The population of changesets for this metric is exactly the same for each monthly snapshot than "time since opened". Values for these metrics should always be equal or lower than that one.
 * This is unclear. Please clarify.--Qgil-WMF (talk) 19:27, 13 September 2014 (UTC)

Age of open changesets by affilation (monthly snapshot)
Each line corresponds to one type of affiliation. Unidentified developers are classified as Unknown.

For each affiliation: same definition as "Age of open changesets (monthly snapshots)" for changesets initially submitted by developers with the specified type of affiliation. Example: median time since first upload at the end of 31st January 2014, for all changesets that are still open on 31st January 2014 that were initially submitted by Independent developers.

Ranking of repositories
Sorted by the oldest "Time from last patchset (green line)" at "Age of open changesets" (third column).


 * "Backlog of open changesets" (first column, lines): same definition as "Backlog of open changesets (monthly snapshots)", for the repository specified.
 * "Distribution of open changesets" (second column, bars): same definition as "Distribution of open changesets (by date of submission)", for the repository specified.
 * "Age of open changesets" (third column, lines): same definition as "Age of open changesets (monthly snapshots)", for the repository specified.

Vocabulary
[ This is based on "Terminology for Gerrit" in http://qt-project.org/wiki/Gerrit-Introduction and in the Gerrit interface itself, since we didn't find something similar in the Wikimedia Gerrit docs, http://www.mediawiki.org/wiki/Gerrit). We've tried to adapt it to Wikimedia terminology a bit, though. ]

maybe submitted to (merged into) Git repository once the review is passed. A change usually include several patch-sets, as new versiosn are submitted for review. Synonym: change-set.
 * Change: A single commit and unit of a review. Changes are reviewed and


 * Change-set: change.

modified, it will receive a new patch set.
 * Patch-set: A version of a change. After each time a change is

submitted to a change (change-set). to the code. form of an specific patch-set). decide whether a patch-set is accepted (merged) or a new patch-set is requested from the developer. submitted a patch-set, and is still waiting for the corresponding review process to finish. doesn't want the change to be reviewed, because there is work on it underway. repository. restored later.
 * Submit a patch-set: An action that consists on a new patch-set being
 * Developer: Person starting a change (change-set) by proposing a change
 * Reviewer: Person deciding on the acceptance or not of a change (in the
 * Review process (for a patch-set): process, carried on by reviewers, to
 * Waiting for review: An state of a change in which the developer
 * Work in progress: An state of a change in which the developer still
 * Merge: An action that allows Gerrit to merge a change to Git
 * Abandon: Action that archives a change. An abandoned change can be

Queries
FIXME: links to the actual queries in GitHub.

Metrics

 * Total current queue size: 1033
 * Evolution in time queue size (for the current pending reviews), new reviews:

|     NEW |               YEAR |               MONTH |
 * 1 |              2012 |                   3 |
 * 7 |              2012 |                   4 |
 * 11 |              2012 |                   5 |
 * 10 |              2012 |                   6 |
 * 5 |              2012 |                   7 |
 * 11 |              2012 |                   8 |
 * 15 |              2012 |                   9 |
 * 20 |              2012 |                  10 |
 * 21 |              2012 |                  11 |
 * 46 |              2012 |                  12 |
 * 28 |              2013 |                   1 |
 * 59 |              2013 |                   2 |
 * 73 |              2013 |                   3 |
 * 55 |              2013 |                   4 |
 * 145 |              2013 |                   5 |
 * 103 |              2013 |                   6 |
 * 182 |              2013 |                   7 |
 * 241 |              2013 |                   8 |


 * Evolution in time for merged issues

| MERGED | YEAR | MONTH |
 * 15 | 2012 |    2 |
 * 404 | 2012 |    3 |
 * 1399 | 2012 |    4 |
 * 2397 | 2012 |    5 |
 * 2828 | 2012 |    6 |
 * 2468 | 2012 |    7 |
 * 4329 | 2012 |    8 |
 * 2784 | 2012 |    9 |
 * 4283 | 2012 |   10 |
 * 3903 | 2012 |   11 |
 * 4311 | 2012 |   12 |
 * 3752 | 2013 |    1 |
 * 3412 | 2013 |    2 |
 * 3645 | 2013 |    3 |
 * 3216 | 2013 |    4 |
 * 3168 | 2013 |    5 |
 * 3654 | 2013 |    6 |
 * 3958 | 2013 |    7 |
 * 2201 | 2013 |    8 |


 * Evolution in time for abandoned issues

|ABANDONED | YEAR | MONTH |
 * 20 | 2012 |    2 |
 * 101 | 2012 |    3 |
 * 132 | 2012 |    4 |
 * 98 | 2012 |    5 |
 * 171 | 2012 |    6 |
 * 111 | 2012 |    7 |
 * 148 | 2012 |    8 |
 * 136 | 2012 |    9 |
 * 172 | 2012 |   10 |
 * 221 | 2012 |   11 |
 * 162 | 2012 |   12 |
 * 193 | 2013 |    1 |
 * 164 | 2013 |    2 |
 * 337 | 2013 |    3 |
 * 201 | 2013 |    4 |
 * 175 | 2013 |    5 |
 * 207 | 2013 |    6 |
 * 196 | 2013 |    7 |
 * 70 | 2013 |    8 |


 * Evolution in time of queue size for all issues (merged+abandoned+new):

|     TOTAL | YEAR | MONTH |
 * 506 | 2012 |    3 |
 * 1538 | 2012 |    4 |
 * 2506 | 2012 |    5 |
 * 3009 | 2012 |    6 |
 * 2584 | 2012 |    7 |
 * 4488 | 2012 |    8 |
 * 2935 | 2012 |    9 |
 * 4475 | 2012 |   10 |
 * 4145 | 2012 |   11 |
 * 4519 | 2012 |   12 |
 * 3973 | 2013 |    1 |
 * 3635 | 2013 |    2 |
 * 4055 | 2013 |    3 |
 * 3472 | 2013 |    4 |
 * 3488 | 2013 |    5 |
 * 3964 | 2013 |    6 |
 * 4336 | 2013 |    7 |
 * 2512 | 2013 |    8 |


 * Time to review for the Top 10 slowest reviews

|  id  | revtime | date                | submitted_by | email                        |
 * 146 |    437 | 2013-08-16 07:46:28 |           11 | liangent@g           |
 * 208 |    378 | 2013-08-13 03:39:28 |           59 | hashar@f               |
 * 13796 |    317 | 2013-07-22 19:41:39 |          208 | jan@j           |
 * 323 |    307 | 2013-08-02 19:21:13 |           56 | daniel@n   |
 * 17642 |    300 | 2013-08-02 19:17:39 |           90 | chughakshay16@g      |
 * 17641 |    300 | 2013-08-02 19:18:08 |           90 | chughakshay16@g      |
 * 23 |    295 | 2013-08-21 11:12:16 |           30 | amir.aharoni@m |
 * 49272 |    284 | 2013-07-10 23:09:57 |          203 | toniher@c              |
 * 920 |    278 | 2013-07-01 23:09:47 |          105 | helder.wiki@g        |
 * 20147 |    276 | 2013-07-01 17:04:39 |           75 | vitalif@y           |


 * Evolution in time to review in days:

| SUM(revtime)/COUNT(revtime) | YEAR(date) | MONTH(date) |
 * 2.8988 |      2013 |           5 |
 * 2.5753 |      2013 |           6 |
 * 2.7894 |      2013 |           7 |
 * 4.8007 |      2013 |           8 |

Analysis

 * Total open reviews is growing faster in Jul and Aug 2013.
 * The time to review has grown in August
 * Merged issues has declined in August.
 * The rhythm of merged issues changes between months at the start. Now (2013-08) is more stable.
 * High abandoned rate in 2013-03: 337

Issues

 * Time to review is only computed once changes dates are correct: after 2013-05

Code contributors new / gone
Who are the new code contributors (commits + reviews)? Are they increasing their involvement? Who seems to be on a way out or gone? How are our contributor intake & loss evolving? Two charts? Which kind of charts?
 * Number of new contributors with 1 / 2-5 / 6+ changes submitted in the past 3 months (values may be fine tuned based on actual data).
 * (How to register increasing engagement versus one-offs or new contributors disengaging and vanishing after a short period?)
 * Number of contributors stopping contributing or decreasing continuously in the past 3 months.

Queries

 * New (min(submitted_on)) code contributors for last 3 months with more than 5 contributions:

SELECT id, email, total, age FROM ( SELECT people.id, COUNT(*) AS total, DATEDIFF(NOW,min(submitted_on)) AS age, email  FROM issues, people   WHERE issues.submitted_by=people.id   GROUP BY people.id) a WHERE age <= 90 and total>5 ORDER BY AGE


 * New (min(submitted_on)) code contributors for last 3 months with more than 5 contributions MERGED:

SELECT id, email, total, age FROM ( SELECT people.id, COUNT(*) AS total, DATEDIFF(NOW,min(submitted_on)) AS age, email  FROM issues, people   WHERE issues.submitted_by=people.id AND status='merged'  GROUP BY people.id) a WHERE age <= 90 and total>5 ORDER BY AGE


 * New (min(submitted_on)) code contributors for last 3 months with more than 5 contributions ABANDONED:

SELECT id, email, total, age FROM ( SELECT people.id, COUNT(*) AS total, DATEDIFF(NOW,min(submitted_on)) AS age, email  FROM issues, people   WHERE issues.submitted_by=people.id AND status='abandoned'  GROUP BY people.id) a WHERE age <= 90 and total>5 ORDER BY AGE


 * Gone code contributors, last contributions more than 6 months ago.

SELECT email,age FROM ( SELECT people.id, COUNT(*) AS total, DATEDIFF(NOW,max(submitted_on)) AS age, email  FROM issues, people  WHERE issues.submitted_by=people.id  GROUP BY people.id) t WHERE age > 180 order by age


 * Total gone code contributors, last contributions more than 6 months ago.

SELECT COUNT(email), age FROM ( SELECT people.id, COUNT(*) AS total, DATEDIFF(NOW,max(submitted_on)) AS age, email  FROM issues, people  WHERE issues.submitted_by=people.id  GROUP BY people.id) t WHERE age > 180 order by age


 * Evolution in time of age of gone code contributors:

SELECT COUNT(email) as total, YEAR(last_contrib), MONTH(last_contrib) FROM ( SELECT email, age, last_contrib FROM  ( SELECT people.id, COUNT(*) AS total, DATEDIFF(NOW,max(submitted_on)) AS age, max(submitted_on) AS last_contrib, email FROM issues, people WHERE issues.submitted_by=people.id   GROUP BY people.id) t   WHERE age > 180 order by age ) t1 GROUP BY YEAR(last_contrib), MONTH(last_contrib) ORDER BY last_contrib

Metrics

 * Total new code contributors: 13
 * Total gone code contributors: 127 (total contributors: 394)

| id | email                     | total | age  | 13 rows in set (0.18 sec)
 * New code contributors
 * 21 | bdavis@w     |     6 |    7 |
 * 198 | jack@c |   14 |   31 |
 * 20 | rainerrillke@h |    19 |   40 |
 * 156 | kartik.mistry@g  |    10 |   54 |
 * 332 | karima.rafes@g   |    15 |   58 |
 * 411 | nilesh@n       |    41 |   59 |
 * 409 | simon.eu@g       |     7 |   59 |
 * 360 | abreault@w   |    20 |   65 |
 * 69 | neverett@w   |    46 |   66 |
 * 97 | ebrahim@b      |     9 |   66 |
 * 330 | eu.vlasenko@g    |     6 |   76 |
 * 313 | sam@s      |     7 |   87 |
 * 402 | david@s  |     6 |   88 |


 * New code contributors merged

| id | email                        | total | age  | 10 rows in set (0.17 sec)
 * 21 | bdavis@w        |     6 |    7 |
 * 198 | jack@c   |    12 |   31 |
 * 20 | rainerrillke@h    |    13 |   40 |
 * 411 | nilesh@n          |    31 |   54 |
 * 156 | kartik.mistry@g     |    10 |   54 |
 * 332 | karima.rafes@g      |    14 |   58 |
 * 360 | abreault@w      |    18 |   64 |
 * 69 | neverett@w      |    38 |   66 |
 * 97 | ebrahim@b         |     8 |   66 |
 * 87 | yuvipanda+suchabot@g |   10 |   66 |


 * New code contributors abandoned

| id | email                      | total | age  | 2 rows in set (0.05 sec)
 * 82 | addshore@w    |    15 |   54 |
 * 314 | joel.natividad@o |    6 |   69 |

| total | YEAR(last_contrib) | MONTH(last_contrib) |
 * Evolution of last contribution of gone code contributors:
 * 1 |              2012 |                   2 |
 * 2 |              2012 |                   3 |
 * 7 |              2012 |                   4 |
 * 3 |              2012 |                   5 |
 * 22 |              2012 |                   6 |
 * 13 |              2012 |                   7 |
 * 7 |              2012 |                   8 |
 * 8 |              2012 |                   9 |
 * 11 |              2012 |                  10 |
 * 13 |              2012 |                  11 |
 * 15 |              2012 |                  12 |
 * 8 |              2013 |                   1 |
 * 17 |              2013 |                   2 |

17 people sent their last contribution on 2013-02 and do not contribute more with code submissions.

Analysis

 * 3 new people from WMF, 7 from other email domains

Comments
The analysis could be done also for organization or country, not just people.

Evolution of new comers in time could be really cool.

Top contributors
Wikimedia professionals apart, who are the top tech community contributors, what are their areas of activity and where are they based? Let's list everybody, not just the top 10. This will help the WMF and the Wikimedia movement knowing and supporting these contributors better. Tables are good. No need for charts?
 * Combined ranking of contributors of Git/Gerrit, Phabricator, MediaWiki, Mailman, IRC. We'll need to find the formula.
 * Rankings for each channel.

Team
Quim Gil and Andre Klapper from the Wikimedia Engineering Community team are coordinating the Metrics Dashboard project, which is being implemented by Bitergia as contractors.

The Bitergia team working in the MediaWiki dashboard is formed by Daniel Izquierdo, Luis Cañas and Jesus Gonzalez Barahona and Alvaro del Castillo as project manager.

The ownership of this project might get transfered to the Wikimedia Analytics team at some point.