Wikimedia Technology/Annual Plans/FY2019/TEC14: Smart Tools for Better Data

Program outline
'''Efforts in this program are centered around making our data as discoverable as it can be for Wikimedia communities, Foundation staff, and the world at large. This program is critical in that it provides the tools for WMF and communities to measure the success of their efforts.'''

Work in this program in the past year been centered around the Alpha launch of Wikistats2, a new platform to explore statistics about Wikimedia projects. This year we plan to continue working on this project and integrate Wikistats 2 with more data per our community consultation. This project will add features to the Wikistats 2 user interface such as localization, but also will make more data discoverable for machines via newer APIs. As part of the Wikistats 2 project we have developed the Data Lake, a denormalized dataset that is the best data we have had to date to answer questions about content and contributors. At this time this data is only available for the WMF and part of the work of this program will making the Data Lake data available on our public cloud infrastructure for our community at large; the more accessible that data is, the more impact it can have.

On the quality front we aim to make our pageview data of better quality by the removal of non human traffic (bots) using machine learning techniques. While the objective of this program is clear, the computational needs might expand our capacity so we include a significant research component to study this problem. We also aim to make the Wikipedia text corpus available for analytics purposes, again, a project that will unblock several initiatives and help us tremendously to improve the quality of our data.

A big part of the focus of this program is on infrastructure and tools for better public data access; however, we also include significant improvements to internal WMF data tools such upgrades to our current internal Jupyter notebooks infrastructure and tools for visualization of user data.

Teams contributing to the program
Analytics, Wikimedia Cloud Services

Annual Plan priorities
Primary Goal: 3. Knowledge as a Service - evolve our systems and structures

How does your program affect annual plan priority?
This programs helps teams elsewhere in the foundation to measure the outcome of theirs

Program Goal
We will maintain and increase public access to past, present and real time data for Wikimedia projects. We will provide the infrastructure to measure the impact and reach of projects and features for editors, communities and WMF.
 * Outcome 1
 * Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.
 * Output
 * Provision a cluster for public Data Lake access in Cloud Services that can be used as a Quarry backend. In this iteration the Data Lake will include historical data about editing (revisions, pages, users) for all Wikimedia projects since the beginning. Data is optimized to be queried in an analytics-friendly way that allows for simple and fast queries.
 * Target
 * Users of Cloud Services data lake in terms of developers, researchers and tools


 * Outcome 2
 * Increase stability and availability of user created datasets hosted in Cloud Services
 * Output
 * build a replacement for ToolsDB database servers
 * Output
 * build a replacement for shared Postgres database servers


 * Outcome 3
 * Foundation staff and community have better visual tools to access data about content, contributors and readers.
 * Output
 * Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance.
 * Output
 * Wikistats 2.0 - Production Release
 * Target
 * Unique Users/ Pageviews of Wikistats 2.0.
 * Output
 * Develop pipeline for easy visual exploring of eventlogging data for WMF users.
 * Output
 * Support for more data sources and programming languages for WMF Jupyter Notebook users.
 * Target
 * Users of newer tools
 * Data moved


 * Outcome 4
 * Users see improvements on data computing and data quality.
 * Output
 * More efficient Bot filtering on pageview data
 * Target
 * Tickets filed for due to bot spikes
 * Output
 * Make the Wikipedia text corpus available for analytics purposes


 * Outcome 5
 * We have scalable, performant and reliable software for data transport
 * Output
 * Software maintenance on analytics stack to maintain current level of service

Outcome 2

 * Target
 * Legacy database servers retired


 * Measurement method
 * Legacy server hardware returned to SRE/DCOPS management by the end of Q3 FY2018/19.

Dependencies
n/a