Wikimedia Technology/Annual Plans/FY2019/TEC14: Smart Tools for Better Data/Goals

=Program Goals and Status for FY18/19=

TEC14: Smart Tools for Better Data
 * Goal Owner: Nuria Ruiz
 * Program Goals for FY18/19: We will maintain and increase public access to past, present and real time data for Wikimedia projects. We will provide the infrastructure to measure the impact and reach of projects and features for editors, communities and WMF.
 * Annual Plan: TEC14: Smart Tools for Better Data
 * Primary Goal is Knowledge as a Service: Evolve our systems and structures
 * Tech Goal: Supporting our Community of contributors



 = Q1 Goals =

Outcome 1 / Output
Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.
 * Provision a cluster for public Data Lake access in Cloud Service

Goal(s)

 * Order Data Lake hardware
 * Provide Rationale for SQL engine used to make data accessible in labs

Status
August 22, 2018
 * both goals are

September 18, 2018
 * Order Data Lake hardware
 * Provide Rationale for SQL engine used to make data accessible in labs

Outcome 3 / Output 1
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goal(s)

 * Build most prolific contributors report
 * Include metrics about total article count (pages to date)

Status
August 22, 2018
 * both goals are

September 28, 2018
 * Both goals are ✅, backend deployed, team working on frontend UI for metrics.

Outcome 3 / Output 2
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Wikistats 2.0 - Beta (carry on items from last quarter)

Goal(s)

 * Support annotations
 * Improvements on pageview data per country ✅

Status
July 2018

August 22, 2018
 * one goals is, the other is ✅

September 18, 2018
 * Both goals are done ✅

Outcome 3 / Output 3
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Support for more data sources and programming languages for WMF Jupyter Notebook users.

Goal(s)

 * Better integration of Jupyter with spark ✅

Outcome 4 / Output 1
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Users see improvements on data computing and data quality.

Goal(s)

 * Data Sanitization backend for hadoop that includes ability to salt & hash. ✅
 * STRETCH GOAL: POC More efficient Bot filtering on pageview data.

Outcome 4 / Output 2
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * MediaWiki content is available on cluster on recurrent schedule

Goal(s)

 * STRETCH GOAL: Productionize MediaWiki content processing. Ingest and process text on every wikipedia page to use later for analytics-style computations

Status
July 2018

August 22, 2018

September 18, 2018
 * Better and easier ingestion will be available by end of quarter

Outcome 5 / Output 1
We have scalable, performant and reliable software for data transport
 * Software maintenance on analytics stack to maintain current level of service

Goal(s)

 * Spin out a tiny EventLogging RL module for lightweight logging

Status
August 22, 2018

September 18, 2018
 * Work continues with performance team to refactor eventlogging extension



=Q2 Goals =

Outcome / Output
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance

Goal(s)

 * Create report for "articles with most contributors" in Wikistats2
 * Create report for Active editor metrics per project family
 * Provide easier mapping between Wikistats1 metrics and Wikistats2 metrics (example: "active editors")
 * Provide ability to query metrics per project family (*.wikipedia.org)  in Wikistats UI

Status
October 19, 2018
 * New wikistats reports can be seen here: http://stats.wikimedia.org/v2

November 2018
 * Easier mapping for Wikistats1 and Wikistats2 metrics is still in a state.

December 2018
 * Working on the UI changes to display projects per family, we have done AQS changes to deploy unique devices data per family to the UI. and we are backfilling that data

Outcome / Output
Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.
 * In this iteration (spanning several quarters) the Data Lake will include historical data about editing (revisions, pages, users) for all Wikimedia projects since the beginning. Data is optimized to be queried in an analytics-friendly way that allows for simple and fast queries.

Goal(s)

 * Presto cluster online and usable with test data pushed from analytics prod infrastructure accessible by Cloud (labs) users
 * Edit Data Lake Quality: Resolve known issues (ongoing goal) ❌

Status
October 19, 2018
 * This is yet to be started and is in a status.

November 14, 2018
 * Presto setup on labs started and is, we are discussing with SRE the flow of data
 * We have had many problems with our imports of mediawiki to reconstruct history that will prevent us from meeting goals of this program. The work being done in labs by analytics, cloud and dba teams to solve the issues we found can be tracked here:

https://phabricator.wikimedia.org/T210693 https://phabricator.wikimedia.org/T210749 December 2018
 * Discussed...



Outcome / Output
Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Users see improvements on data computing and data quality.

Goal(s)

 * STRETCH GOAL: POC More efficient Bot filtering on pageview data.

Status
December 2018
 * Finished initial phase of POC, running additional tests and doing write up

= Q3 Goals =

Outcome X / Output X
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
 * Nullam interdum, elit in malesuada aliquam, libero lorem auctor lacus, eu mattis lacus velit vitae mauris.

Dependancies on: ___________

Goal(s)

 * Ut eget sodales odio. Maecenas a varius leo.

Status
January 2019
 * Discussed...

February 2019
 * Discussed...

March 2019
 * Discussed...



= Q4 Goals =

Outcome X / Output X
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
 * Nullam interdum, elit in malesuada aliquam, libero lorem auctor lacus, eu mattis lacus velit vitae mauris.

Dependancies on: ___________

Goal(s)

 * Ut eget sodales odio. Maecenas a varius leo.

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...