Community metrics

From MediaWiki.org
Jump to navigation Jump to search
This page is about wikimedia.biterg.io, the Wikimedia Tech community metrics dashboard.
For a list of links to metrics and statistics, refer to Development statistics.
For metrics not related to Wikimedia's technical community (e.g. page views), refer to the Analytics mailing list.

The data sources of the Wikimedia Tech community metrics dashboard include Git and Gerrit repositories, Phabricator's Maniphest (though only basic support), mediawiki.org, and some mailing lists. The data sources are defined in a configuration file. Its data is refreshed regularly. For other data sources (on-wiki code, Github repositories) currently not covered by Wikimedia Tech community metrics dashboard, see the #Limitations section below.

Bug reports are welcome in the wikimedia.biterg.io project in Phabricator. Feedback and questions are welcome on the discussion page.

wikimedia.biterg.io offers:

  • Drill down: clicking an element and a filtered view will be applied
  • Time frame selection
  • Exporting data
  • API access via the Elasticsearch API
  • Wikimedia administrators to create widget and panels themselves
  • an advanced filter search box

User interface[edit]

Screenshot

The top bar lists Dashboards (also called Panels). By default the Overview is chosen. Each dashboard offers numerous widgets, and a result list at the bottom of the page (commits in Git, emails in mailing lists, etc.).

The interactive Widgets at the bottom display the actual data. Some panels support clicking displayed items to get more specific information about those items and some panels also allow downloading and exporting the displayed information as CSV or JSON.

Applying filters[edit]

In the right corner of the top bar, the Time filter allows adjusting the time span of all the data being displayed in the widgets.

Some widgets allow creating Filters: When the mouse pointer hovers over an item in a list, two small magnifier icons will be displayed. They allow showing only data for that very item on the page, or filtering out that very item from the displayed results on the page.

When creating a filter in Kibana, the filter is displayed below the Advanced filter text field and is applied to all widgets. In the screenshot above, only changesets with 'status: Merged' and by independent authors are shown in the panels. When hovering over a filter, you can enable/disable, pin/unpin (the filter will still be applied when you open that page again), invert (e.g. to get all companies listed except for one), remove or edit (e.g. to change the organization name) the filter. The "Actions" menu on the very right of the filters offers the same actions to apply them to all filters at once. For more information, see Discover Filters.

The Advanced filter text field allows searching for text in any items (commit messages, user names, repository names, etc.). It allows querying a subset of results provided by the time filter and filters already applied. By default, any free text items in any database columns are included (*; entering this also resets a search). All available fields (database columns) are listed when clicking "Add a filter" next to the filters.

The query syntax in the Advanced filter text field is based on the Lucene query syntax: You enter a field name followed by a colon followed by the value. Query examples:

  • Gerrit: author_org_name:("Wikimedia Foundation" OR "Wikimedia Deutschland") AND author_bot:false AND (repository:"mediawiki/extensions/Echo" OR repository:"mediawiki/extensions/Flow") AND status_value:"1" AND status:"NEW"
  • Git: author_name:"Andre Klapper" AND repo_name:"https://gerrit.wikimedia.org/r/operations/puppet"

The available fields across databases can also be looked up via the Discover functionality in the main panel. (use the "git" dropdown below the text field to change to another database).[1] See Kibana Queries and Filters for more information.

Some more notes on advanced filters:

  • The type of field (string, number, date, etc.) influences the query syntax
  • Queries are case sensitive
  • You can only create queries which use fields within the respective index (simplified, "indices" in ElasticSearch are kind of databases) that is used in a panel, otherwise the search will return "No results found".
  • Fields not available in an index by default use -1 for numbers and na for strings

Behavior that might surprise you[edit]

  • The data for Git repositories shows some individuals and companies that have never contributed to Wikimedia. That is because Wikimedia uses many upstream software projects and imports their Git repositories (including their change history) into Wikimedia Git. "There are cases where a repository's history consists primarily of upstream commits, but with some substantial packaging work by Wikimedia engineers, for example."[2] You may want to look at Wikimedia Gerrit data instead.
  • Some numbers which you might expect to be identical might differ because they are based on different data fields. For example, on the "Gerrit Overview" dashboard, the number of "Changeset Submitters" in the top widget can differ from the number of entries in the "Submitters" list widget at the bottom. The two widgets use different data fields in the database. See phab:T184741 for more information.
  • You might not always easily find yourself's (your author_name) in the system, due to how the data is stored and collected: Every identity can have several profiles (for example one profile for the Gerrit account, another profile for a Phabricator account, another profile for a mediawiki.org account, and all those profile name might differ). The name of your one identity depends on which of these profile names was indexed first or it could also be an identity name that was manually edited.
  • Your data might be incomplete. That can happen when your profiles in different systems (Gerrit, Phabricator, mediawiki.org, etc) have not all been merged yet into one identity. Also note that for Phabricator, only created tasks are indexed currently but not activity such as commenting.
  • Some recently created repositories might be missing. There is no code that crawls Wikimedia Gerrit for newly created repositories. Anyone can manually find missing repositories by using this bash script and then propose a merge request for https://gitlab.com/Bitergia/c/Wikimedia/sources/blob/master/projects.json
  • Wikimedia mirrors many code repositories between Wikimedia Gerrit and Github. If the commit hash is the same in both Github and Gerrit then the commit is correctly only counted once even if you had both the Gerrit repository and the Github repository for the same project displayed.
  • Organization "-- UNDEFINED --" is shown when a MediaWiki edit was indexed which was made by a user that is now hidden (example).
  • The status_value in Gerrit stores the value of the last Code-Review approval for a change set - it does not necessarily refer to the last patch set of a change set. See discussion in phab:T224755. (For the records, get_time_first_review and get_time_first_review_patchset only consider Gerrit Code Review label values, not Gerrit Verified labels added by bots.)
  • You can only filter by reviewer_bot on comments, so this filter does not work on the Gerrit dashboard as the dashboard shows information about changesets, not comments. Generally speaking, indexes consumed by dashboards contain different types of documents (comment, approval, patchset and changeset).

How can I…?[edit]

Create a short URL to share with others[edit]

Click "Share" in the upper right corner. Under "Link", select "Short URL". You must be logged in for this (bottom left corner).

Number of Gerrit patches and patch authors in a year[edit]

Go to the "Gerrit 🡒 Overview" dashboard, set the Time Filter to "Absolute" and enter the From and To dates, and look at the "Changeset Submitters" and "Changesets" values in the "Gerrit" widget.

Number of Gerrit patches written by volunteer authors in a year[edit]

Go to the "Gerrit 🡒 Overview" dashboard, set the Time Filter to "Absolute" and enter the From and To dates, click the "Independent" portion of the pie chart in the "Organizations" panel, and look at the "Changesets" value in the "Gerrit" widget.

Number of merged Gerrit patches in the MediaWiki core repository written by volunteer authors in a year[edit]

Go to the "Gerrit 🡒 Overview" dashboard, set the Time Filter to "Absolute" and enter the From and To dates, click the "Independent" portion of the pie chart in the "Organizations" panel, click the "MERGED" portion of the pie chart in the "Status" panel, hover over "mediawiki/core" in the "Repositories" panel and click the "Filter for value" magnifier, and look at the "Changesets" value in the "Gerrit" widget.

Number of Gerrit patches in all MediaWiki extension repositories written by volunteer authors in a year[edit]

This is a bit more tricky as we need to use a wildcard (*) which we cannot enter directly in the advanced filter search box:

Go to the "Gerrit 🡒 Overview" dashboard, set the Time Filter to "Absolute" and enter the From and To dates, click the "Independent" portion of the pie chart in the "Organizations" panel. Click on "Add a filter +", click "Edit Query DSL", enter {"query":{"wildcard":{"repository":"mediawiki/extensions/*"}}} and press "Save".

List of the most active patch authors in a Gerrit repository[edit]

Go to the "Gerrit 🡒 Overview" dashboard, hover the mouse over a repository name in the "Repositories" widget, click the + magnifier icon, and look at the "Submitters" widget.

List of the newest contributors in Gerrit[edit]

Go to the "Community 🡒 Demographics" dashboard, select "Gerrit" in the "Data Source" widget, click "Apply changes", and look at the "Last Attracted Developers" widget.

Note that the widget indexes any first activity / "contribution" in Gerrit (e.g. reviewing someone else's patch or commenting on someone else's patch without adding a review label), and not only activity when that developer is also the author of a patchset.

List of the most active reviewers in a Gerrit repository[edit]

Go to the "Gerrit 🡒 Approvals" dashboard, hover the mouse over a repository name, click the + magnifier icon, click "Apply now" in the Filters panel, and look at the "Approvals by Reviewer" widget.

List of the most active reviewers without CR+2 actions[edit]

Go to the custom dashboard at https://wikimedia.biterg.io/app/kibana#/dashboard/ffd01840-cdf9-11ea-8358-4d35848e335d. The first widget lists all patch reviewers which have not given any +2 in Gerrit. For further investigation on repositories, hover over a reviewer name and click the + magnifier icon, to filter the results in the widget on the right for that reviewer. Note that calculating the widget on the right is unperformant.[3]

For some top-ranked reviewers we may want to check if they could be +2 right candidates for some repositories.

Time to first review for Gerrit patches per repository[edit]

Go to the Gerrit or Gerrit-Timing dashboard and look at the "Repositories" gadget which offers `Avg. Time First Review (Days)` and `Median Time First Review (Days)` columns. Obviously this data does not take into account patches that are still waiting for their first review. The data excludes bot reviews.[4]

Data about currently open patches in a Gerrit repository[edit]

"Open" means neither merged nor abandoned. General caveat for all subitems here: status_value stores the value of the last Code-Review approval for a change set, not necessarily for the last patch set of a change set. See https://phabricator.wikimedia.org/T224755#5813556

On any Gerrit dashboards on https://wikimedia.biterg.io,

  • setting the Advanced Filter to status:"NEW" AND status_value:"1" lists data about open patches with some (!) patch set (not necessarily the last one) having the CR+1 label
  • setting the Advanced Filter to status:"NEW" AND status_value:"-1" lists data about open patches with some (!) patch set (not necessarily the last one) with the CR-1 label
  • setting the Advanced Filter to status:"NEW" AND !_exists_:status_value lists data about open patches with no patch set ever having a CR label (CR=0)

Get a list of open changeset by Code Review label of their very last patchset[edit]

Use the custom dashboard at https://wikimedia.biterg.io/app/kibana#/dashboard/5e903de0-bdd0-11ea-8358-4d35848e335d created for phab:T224755 to get lists of changesets per CR-1, CR+1, or without a CR label on their very latest (!) changeset.

As of July 2020 this dashboard currently lacks statistics but only provides lists.

Median time that unreviewed Gerrit patchsets are open, per repository[edit]

Direct link

Go to the Gerrit Efficiency dashboard, set the Advanced Filter to !_exists_:status_value to only show unreviewed open patches, and sort the "Repositories" widget by clicking on the column header of "Median Time Open (Days)".

Median time that Gerrit patchsets with some changeset which was +1'ed are open, per repository[edit]

Direct link - Caveat: status_value stores the value of the last Code-Review approval for a change set, not necessarily for the last patch set of a change set. See https://phabricator.wikimedia.org/T224755#5813556

Go to the Gerrit Efficiency dashboard, set the Advanced Filter to status_value:"1" to only show open patches which have a changeset which at some point (!) was +1ed, and sort the "Repositories" widget by clicking on the column header of "Median Time Open (Days)".

Number of unreviewed open Gerrit patchsets, per repository[edit]

Direct link; note that results might still be incorrect due to Gerrit data inconsistencies - Caveat: status_value stores the value of the last Code-Review approval for a change set, not necessarily for the last patch set of a change set. See https://phabricator.wikimedia.org/T224755#5813556

Go to the Gerrit Overview dashboard, set the Advanced Filter to status:"NEW" AND !_exists_:status_value to only show unreviewed open patches (CR=0), and sort the "Repositories" widget by clicking on the column header of "Changesets".

Number of open Gerrit patchset with some changeset that got +1'ed, per repository[edit]

Direct link - Caveat: status_value stores the value of the last Code-Review approval for a change set, not necessarily for the last patch set of a change set. See https://phabricator.wikimedia.org/T224755#5813556

Go to the Gerrit Overview dashboard, set the Advanced Filter to status:"NEW" AND status_value:"1" to only show unreviewed open patches with CR+1, and sort the "Repositories" widget by clicking on the column header of "Changesets".

Average time of open Gerrit changesets per author affiliation/organization[edit]

See the custom Gerrit Timing by author affiliation dashboard for the average time that changesets with either no Code Review label or with a +1 Code Review label for some (!) changeset are open per author affiliation.

List of the most active email senders in mailing lists[edit]

Go to the Mailing Lists dashboard and look at the "Email Senders" gadget. Note that only some mailing lists are indexed: See the section "pipermail" in projects.json for an exact list.

Architecture and source code[edit]

Everything is based on Kibana dashboards and Elasticsearch. The database provides indexes whose fields are used in panels, widgets and for searches.

Details on the underlying software architecture can be found on grimoirelab.github.io. A comprehensive GrimoireLab Tutorial and some webinar videos are available.

Source code of the Grimoirelab components is available. Most code is written in Python. The existing repositories are:

  • perceval: Data retrieval platform which creates JSON files. perceval/backends contains the available backends. Data is stored in Elasticsearch. (Source code)
  • kingarthur: Commander tool to run perceval and set up the panels. (Source code)
  • kibiter: Visualization on top of ElasticSearch. A fork of Kibana which contains changes until they get merged in the upstream code base. (Source code)
  • Sigils: The GrimoireLab standard set of panels in numerous JSON files. (Source code)
  • GrimoireELK: An incubator for new ideas. (Source code)
  • SortingHat: Command line interface to manage the data in our database. For admins, a complete database dump is available as a JSON file which allows manual account merging, updating affiliations, adding country information or marking an account as a bot. (Source code)
  • Hatstall: Web-based interface to manage the data in our database. (Source code; Access for Wikimedia administrators)
  • sirmordred: Orchestrates the execution of tools to produce a dashboard.
  • Bestiary: Web-based interface to manage the Sirmordred projects.json configuration files which specify the URLs of data sources and structure of projects. (Source code)
  • The configuration of our data sources is defined in a json configuration file. See the documentation for what is supported.

The steps performed are basically: Sources → Data gathering (mining via Perceval) → Data enrichment (e.g. producing indexes in ElasticSearch via GrimoireELK) → Visualization (ElasticSearch and Kibana).

For administrators[edit]

Once logged in via the "Login" item at the bottom of the main panel, functionality such as taking a look at parameters of visualizations or saving dashboards and saving widgets will be possible and not display error messages like for a non-logged in user. You can analyze specific data, create and edit widgets, visualizations and dashboards (also custom elements). To exit from edit mode, click the "Logout" item again at the bottom of the main panel.

Discover allows you to analyze specific data.

Choose a database from the dropdown in the left panel. Then expand the time span.
Results are displayed as a list of dropdown data items. Opening a dropdown displays all fields and their values as JSON or a table. A Kibana/ES visualization based on the JSON data is displayed on top.
Specific fields can be added as columns to the displayed results by adding/removing those fields in the left panel. It is basically a huge matrix, and if we wanted more data, more fields could be added in the future (e.g. "Gender").

Visualize allows creating a new visualization/widget (available types are e.g. data table, line chart, pie chart) or opening an existing saved visualization. Admins could rearrange and save. If you alter a saved visualization and want to keep the previous one, save the new one under a new name and then insert it into the dashboard.

When opening an existing saved visualization, the right panel shows the visualization view. The left panel shows the definitions: There are y-axis metrics (for each group; what am I going to solve) and x-axis buckets (grouping things).
  • Metrics have an Aggregation (e.g. medium, sum, unique count, percentiles) on a certain Field and a CustomLabel to display.
  • Buckets have the same parameters and an Interval (e.g. to display yearly instead of weekly bars).
To write a new visualization from scratch, click the + button and select for example "Pie". Choose From a New Search, Select Index and select for example the "git" index. An empty pie chart will be shown as nothing is defined yet (no buckets, hence it is the total number of everything).
Under buckets, choose Select buckets type and choose for example Split Slices. Set Aggregation to for example Terms (means: look for a specific field in every commit). Set Field to a value, for example "author_org_name" (means: by organization names). Set Order for example to Descending and Size to 10 to display the ten biggest companies in the pie chart. To display these changes, click the green Apply changes bottom at the top on the left.
Advanced: You can also Add sub-buckets at the bottom. For example, if you visualize bars and want to split each displayed bar to display several companies, go for Split bars. The order of buckets can be important when having sub-buckets, for example if you split bars before the x-axis in the previous example, the legend field in the visualization will be ordered by displaying the most active company in the first place of the legend list.
Advanced: When creating a new visualization you can also choose Or, From a Saved search to create new visualizations on top of searches instead of indices to avoid using a full index. Beforehand, under Discover you have to define a search as a specific view of a search.

If you are interested in certain visualizations, contact the #Team.

Dashboard allows creating and editing dashboards.

When an administrator loads an existing dashboard (via choosing it from the displayed list of dashboards), modifies it (e.g. dragging around widgets), and saves the changes under the same name, the view of that dashboard is modified for all users. When using a different name for a dashboard, an administrator would still have to add the link to that new dashboard to make it available for all users.

Timelion is supposed to allow you create time series using DSL queries. Example query for the gerrit_lead_time visualization in our instance: .es(index=gerrit, q='status:MERGED OR ABANDONED', timefield=closed, metric=avg:timeopen).label('Avg. Lead Time'),

Dev Tools: The Console allows building custom Elasticsearch queries.

Management offers access to internal stuff.

  • The Index Patterns tab allows to configure an index pattern. It lists information about all indices and all index series (a collection of indices). You can see all the fields by name or type. Via the controls column on the right, you could for example convert the type of a field from "string" to "date". This is also the place to make Kibana know about new stuff in ElasticSearch by adding the name of the index in ElasticSearch.
  • The Saved Objects tab lists all saved objects such as Dashboards, Searches and Visualizations and allows editing them directly, e.g. to change the number of buckets from 5 to 10 in a visualization. This is currently not possible via the UI and it is also prone to break the raw configuration.
You can choose any object from the lists and export it as a JSON file.

Hatstall offers a web user interface to update (affiliations etc.) and merge user account data.

In general, the name of a custom object (such as a custom visualization or custom dashboard) created manually by an admin must have the prefix C_ so it does not get overwritten by the next upstream software update. Select the "Save dashboard" and make sure to enable the option "Save as a new dashboard". One could also save a dashboard as a copy of the original one. If we want a new dashboard to appear in the top bar menu, we have to request a setting change to Bitergia.

REST API[edit]

Note that the database contains enrollment data (to allow filtering on organizations) so access via the API is currently not public.

The URL endpoint for the REST API which provides results in JSON format is https://wikimedia.biterg.io/data/. To search within a specific index (for example "gerrit"), use https://wikimedia.biterg.io/data/gerrit/_search. To quickly test queries, Kibana offers the Dev Tools app.

Example query: curl -X GET "https://wikimedia.biterg.io/data/_search" -u yourUsername -H 'kbn-xsrf: true' -H 'Content-Type: application/json' -d' {"aggs":{"2":{"terms":{"field":"status", "order":{"_count":"desc"}}}},"query": {"query_string":{"query":"*username"}}}'

See the upstream documentation for more information about the REST API and full string queries.

Team[edit]

Andre Klapper from the Developer Advocacy team coordinates the Metrics Dashboard project, which is being implemented by Bitergia as contractors.

The Bitergia team working in the MediaWiki dashboard is formed by Daniel Izquierdo, Luis Cañas and Jesus Gonzalez Barahona and Alvaro del Castillo as project manager.

https://github.com/chaoss/ is used to track any general upstream issues. The Wikimedia Foundation can also file support requests to Bitergia in a non-public GitLab instance.

Further links[edit]

If you would like to see specific customizations, please file a request in Wikimedia Phabricator including a user story.

Limitations[edit]

Wikimedia code development happens in many places, and some places are not indexed by the Wikimedia Tech community metrics dashboard:

  • Canonical Wikimedia repositories on GitHub: The software can index Github repositories and this is planned in phab:T186736 but currently (December 2018) blocked on phab:T109939.
  • Code hosted in wiki pages (gadgets, user scripts, modules, potentially templates): The software currently only indexes activity on mediawiki.org and not on other Wikimedia sites, and querying activity on (code related) wiki pages only in specific namespaces would need to be implemented. Regarding external tools, Quarry allows to run SQL queries but currently (December 2018) does not support queries across sites (see phab:T95582, and WMF Product Analytics uses Hive though it is unclear how it could be potentially useful in this context.
  • Maintainers of tools hosted on Toolforge are free to host the code repositories of their tools wherever they would like to, which makes it hard to identify and index these places.

Footnotes[edit]

  1. The list of all available fields might be confusing though as certain fields are obviously only available in certain contexts. For example, author_name works 'globally' on all data sources while author_user_name and name only work for the Gerrit data source. See the comparison screenshot in phab:T177890#3805694.
  2. https://phabricator.wikimedia.org/T103292#2083626
  3. https://phabricator.wikimedia.org/T199385#6334801
  4. phab:T242964