Wikimedia Product/Data dictionary/Reconcile datasets in Superset with Key Product Metrics

From mediawiki.org

Our Key Product Metrics are a set of core metrics designed to give a high-level picture of the movement's health. These metrics are reported monthly to the Wikimedia Foundation board and at Wikimedia_Product#Metrics, with further analysis in the quarterly metrics & insights presentations. The key product metrics were first defined back in 2013 to measure the progress of our movement strategy.

Since these metrics only give a birds-eye view, the Product Analytics team has provided a plethora of datasets in data exploration and visualization tools like  Superset and Turnilo for people within the foundation who want to know more about the components of our flywheel.

Data for content, pageviews, and editors is available for users to slice and dice elements of our core metrics and to examine it with different viewpoints and gain a deeper understanding of the impact of our projects.

The way our Key Product Metrics are derived and the datasets available in Superset differ slightly in the way they are perceived. Due to the nature of its purpose, there will always be a fundamental difference in the numbers between the two.

Given below are the differences and explanations for why they exist.

Editors metric[edit]

Editors Metric gives you the overall count of all new and returning active editors who have edited content pages across all Wikimedia projects in a given month. Active editors are registered users who made at least 5 content edits across all projects in the given month. New Active editors are users who registered during the given month while Returning Active editors are users who registered before the given month.

Both key metrics deck and the Superset editors dashboard are generated from an aggregated table neilpquinn.editor_monththat is updated monthly. They both report the overall Active editors and New/Returning Active editors metrics.

While Superset editors dashboard also allows user to slice data by dimensions like Location (Market), Project family, Project, etc. The editors who edit in more than one wiki or edit from multiple locations (that belong to Global south and Global north markets) will be counted more than once. In this case, if user sum up active editors counts across all the projects or location, the result will be greater than the overall active editors counts.

Readers metric[edit]

Pageviews metric is monthly pageviews based on calendar month.

In Key Product Metrics, pageviews metrics is generated fromwmf.pageviews_hourly. The readers metric excludes spurious IE browser pageviews from the top 3 countries that contribute to this traffic - Pakistan, Iran, and Afghanistan using the following query. See phabricator task for more detail about this correction.

{ AND NOT (country_code IN ('PK', 'IR', 'AF') AND user_agent_map['browser_family'] = 'IE')}

In the Superset readers metrics dashboard, the monthly pageview metric is generated from pageviews_daily dataset. Since Superset filters do not accommodate nested conditions, the spurious IE browser pageviews from Pakistan, Iran, and Afghanistan are NOT excluded.

The difference range between Key Product Metrics and readers metrics dashboard is less than 3% in last two years, see KPM Difference, for a detailed difference range. 

2020 Monthly Avg Key Product Metrics Superset Content Dashboard Difference
Pageviews 17.2B 17.4B 1.2%

Content metric[edit]

Net New Content metric gives you the number of content pages added since the previous month.

In the Editing Movement Metrics Github repository, net new content is calculated from AQS Wikistats 2 API by subtracting last month's total content pages metrics from this month's. It excludes deleted pages and redirects (both current and historical), and includes restores of old pages.

In the Superset content metrics dashboard, net new content metrics are generated from edit_hourly dataset which counts newly created content pages and excludes “currently” redirect pages. It excludes deleted pages and current redirects, and restored pages are NOT included.

In this case, the discrepancy between the net new content counts in the Key Product metrics vs Superset contents dashboard is from number of historical redirects and restored pages.

The difference range between the Editing Movement Metrics Github repository content metrics calculations and content metrics dashboard calculations is less than 4% in last two years, see KPM Difference for a detailed difference range. Note, both of these sources are used in the Key Product Metrics deck.

2020 Monthly Avg Key Product Metrics Superset Content Dashboard Difference
Net New Content 3.10M 3.05M 2%