Wikimedia Product/Data dictionary/Reconcile datasets in Superset with Key Product Metrics

Our Key Product Metrics are a set of core metrics designed to give a high-level picture of the movement's health. These metrics are reported monthly to the Wikimedia Foundation board and at Wikimedia_Product, with further analysis in the quarterly metrics & insights presentations. The key product metrics were first defined back in 2013 to measure the progress of our movement strategy.

Since these metrics only give a birds-eye view, the Product Analytics team has provided a plethora of datasets in data exploration and visualization tools like  Superset and Turnilo for people within the foundation who want to know more about the components of our flywheel.

Data for content, pageviews, and editors is available for users to slice and dice elements of our core metrics and to examine it with different viewpoints and gain a deeper understanding of the impact of our projects.

The way our Key Product Metrics are derived and the datasets available in Superset differ slightly in the way they are perceived. Due to the nature of its purpose, there will always be a fundamental difference in the numbers between the two.

Given below are the differences and explanations for why they exist.

Editors metric
Editors Metric gives you the overall count of all new and returning active editors who have edited content pages across all Wikimedia projects in a given month. Active editors are registered users who made at least 5 content edits across all projects in the given month. New Active editors are users who registered during the given month while Returning Active editors are users who registered before the given month.

Both key metrics deck and the Superset editors dashboard are generated from an aggregated table that is updated monthly. They both report the overall Active editors and New/Returning Active editors metrics.

While Superset editors dashboard also allows user to slice data by dimensions like Location (Market), Project family, Project, etc. The editors who edit in more than one wiki or edit from multiple locations (that belong to Global south and Global north markets) will be counted more than once. In this case, if user sum up active editors counts across all the projects or location, the result will be greater than the overall active editors counts.

Readers metric
Pageviews in readers Key Product Metrics is monthly pageviews based on calendar month, generated from

As noted in the restricted phabricator task there are Internet Explorer user agents on desktop, without referrer, which seems more likely to be non-human pageviews. However, we have not been able to filter these pageviews as bots and they still get reported as user pageviews. As a workaround to this solution, the query to calculate the readers metric is modified to exclude spurious IE browser pageviews from the top 3 countries that contribute to this traffic - Pakistan, Iran, and Afghanistan. The monthly readers dashboard in Superset for monthly pageview counts are generated from pageviews_daily dataset. The dashboard has also been modified to account for this change. However, Superset filters do not accommodate nested conditions due to which all pageview traffic from the 3 countries (PK, IR, and AF) has been removed on the dashboards.

In this case, the IE pageviews from the 3 countries are not included in readers Key Product Metrics, while in Superset readers dashboard, all the pageviews from 3 countries are excluded.

The difference range between Key Product Metrics and readers metrics dashboard is less than 3%, see KPM Difference, for a detailed difference range.

Content metric
The net new content give you the number of content pages added since the previous month.

excluding deleted pages and redirects (current and historical), and including restores of old pages. The metrics are calculated from AQS Wikistats 2 API by subtracting last month's total content pages metrics from this month's.

In the Superset content metrics dashboard, net new content metrics are generated from edit_hourly dataset which counts newly created content pages and excludes “currently” redirect pages. In this case, historical redirects and restored pages are NOT included.

This is the cause of the discrepancy between the net new content counts in the Key Product metrics vs the different contents dashboard.

The difference range between Key Product Metrics and content metrics dashboard is less than 4%, see KPM Difference, for a detailed difference range.