Wikimedia Technology/Annual Plans/FY2019/TEC14: Smart Tools for Better Data/Goals

=Program Goals and Status for FY18/19=

TEC14: Smart Tools for Better Data
 * Goal Owner: Nuria Ruiz
 * Program Goals for FY18/19: We will maintain and increase public access to past, present and real time data for Wikimedia projects. We will provide the infrastructure to measure the impact and reach of projects and features for editors, communities and WMF.
 * Annual Plan: TEC14: Smart Tools for Better Data
 * Primary Goal is Knowledge as a Service: Evolve our systems and structures
 * Tech Goal: Supporting our Community of contributors





=Q1 Goals =

Outcome / Output
Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.
 * Provision a cluster for public Data Lake access in Cloud Service

Goals

 * Order Data Lake hardware
 * Provide Rationale for SQL engine used to make data accessible in labs

Outcome 3 / Output 1
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goals

 * Build most prolific contributors report  ✅
 * Include metrics about total article count (pages to date) in Wikistats 2 ✅

Outcome 3 / Output 2
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Wikistats 2.0 - Beta (carry on items from last quarter)

Goals

 * Support annotations ✅
 * Improvements on pageview data per country ✅

Outcome 3 / Output 3
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Support for more data sources and programming languages for WMF Jupyter Notebook users.

Goals

 * Better integration of Jupyter with spark ✅

Outcome 4 / Output 1
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Users see improvements on data computing and data quality.

Goals

 * Data Sanitization backend for hadoop that includes ability to salt & hash. ✅
 * STRETCH GOAL: POC More efficient Bot filtering on pageview data.

Outcome 4 / Output 2
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * MediaWiki content is available on cluster on recurrent schedule

Goals

 * STRETCH GOAL: Productionize MediaWiki content processing. Ingest and process text on every wikipedia page to use later for analytics-style computations

Outcome 5 / Output 1
We have scalable, performant and reliable software for data transport
 * Software maintenance on analytics stack to maintain current level of service

Goals

 * Spin out a tiny EventLogging RL module for lightweight logging

Status
September 18, 2018
 * Work continues with performance team, work was completed by end of Q4



=Q2 Goals =

Outcome / Output
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goals

 * Create report for "articles with most contributors" in Wikistats2 ❌
 * Create report for Active editor metrics per project family ❌
 * Provide easier mapping between Wikistats1 metrics and Wikistats2 metrics (example: "active editors") ❌
 * Provide ability to query metrics per project family (*.wikipedia.org)  in Wikistats UI  ✅

Status
December 2018
 * Changes to display projects family data (new registrations for all wikipedias) deployed.

Outcome / Output
Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.
 * In this iteration (spanning several quarters) the Data Lake will include historical data about editing (revisions, pages, users) for all Wikimedia projects since the beginning. Data is optimized to be queried in an analytics-friendly way that allows for simple and fast queries.

Goals

 * Presto cluster online and usable with test data pushed from analytics prod infrastructure accessible by Cloud (labs) users ✅
 * Edit Data Lake Quality: Resolve known issues (ongoing goal) ❌

Status
November 14, 2018
 * Presto setup on labs started and is, we are discussing with SRE the flow of data

December 2018
 * Missed rest of goals due to issues with scooping mediawiki data from labs, those issues are being worked on in and

Outcome / Output
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Users see improvements on data computing and data quality.

Goals
STRETCH GOAL: POC More efficient Bot filtering on pageview data. ✅

Status
December 2018
 * Finished initial phase of POC, running additional tests and doing write up

Outcome 4 / Output 2
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * MediaWiki content is available on cluster on recurrent schedule

Goals
STRETCH GOAL: Productionize MediaWiki content processing. Ingest and process XML dumps to use later for analytics-style computations  ✅



=Q3 Goals =

Outcome / Output
Foundation staff and community have better visual tools to access data about content, contributors and readers.


 * Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goals

 * Create report for "Articles With Most Contributors" in Wikistats2
 * Create report for "Active Editors" metrics per project family in Wikistast2
 * Provide easier mapping between Wikistats1 metrics and Wikistats2 metrics (example: "active editors")
 * Import Edit Data Lake dataset into turnilo (WMF data exploratory tool)
 * Create staging environment to test upgrades of superset: z

Status
February 14, 2019
 * The comment and actor refactor on MediaWiki and performance problems on labs db replicas have delayed much of this work, those issues are being tracked here: T210749 and T210693
 * We plan to deploy superset's latest release to a staging environment by end of quarter

March 14, 2019

We have a staging environment in which we are testing the upgrade to superset, both "Articles With Most Contributors" and "Active Editors" metrics will worked on next quarter. The import of edit data into druid is on its way for a first import to happen this quarter.

Outcome / Output
Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.


 * In this iteration (spanning several quarters) the Data Lake will include historical data about editing (revisions, pages, users) for all Wikimedia projects since the beginning. Data is optimized to be queried in an analytics-friendly way that allows for simple and fast queries.

Goals

 * Edit Data Lake Quality: Resolve known issues (ongoing goal)
 * Sunset wikimetrics. It is being replaced by the event-metrics tool:

Status
February 2019
 * Efforts on improving data quality on data lake data are on track to be completed this quarter.

The effort of sunseting wikimetrics is schedule to start by the beginning of March

March 2019

Wikimetrics is on its way to be deprecated by end of quarter, the main work about quality issues is set to be done this quarter, some of it will spill to next quarter.



=Q4 Goals =

Outcome / Output
Foundation staff and community have better visual tools to access data about content, contributors and readers.


 * Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goals

 * Create report for "articles with most contributors" in Wikistats2
 * Create report for "active editor metric" per project family (like "editors for wikisource")

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...

Outcome / Output
Users see improvements on data computing
 * Foundations for ML: Initial deployment Pipeline

Dependencies: SRE

Goals
STRECH GOAL: Develop a workflow to move computed data from hadoop to production services T213976

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...