Wikimedia Technology/Annual Plans/FY2019/TEC14: Smart Tools for Better Data

From mediawiki.org

Program outline[edit]

Efforts in this program are centered around making our data as discoverable as it can be for Wikimedia communities, Foundation staff, and the world at large. This program is critical in that it provides the tools for WMF and communities to measure the success of their efforts.

Work in this program in the past year been centered around the Alpha launch of Wikistats2, a new platform to explore statistics about Wikimedia projects. This year we plan to continue working on this project and integrate Wikistats 2 with more data per our community consultation. This project will add features to the Wikistats 2 user interface such as localization, but also will make more data discoverable for machines via newer APIs. As part of the Wikistats 2 project we have developed the Data Lake, a denormalized dataset that is the best data we have had to date to answer questions about content and contributors. At this time this data is only available for the WMF and part of the work of this program will making the Data Lake data available on our public cloud infrastructure for our community at large; the more accessible that data is, the more impact it can have.

On the quality front we aim to make our pageview data of better quality by the removal of non human traffic (bots) using machine learning techniques. While the objective of this program is clear, the computational needs might expand our capacity so we include a significant research component to study this problem. We also aim to make the Wikipedia text corpus available for analytics purposes, again, a project that will unblock several initiatives and help us tremendously to improve the quality of our data.

A big part of the focus of this program is on infrastructure and tools for better public data access; however, we also include significant improvements to internal WMF data tools such upgrades to our current internal Jupyter notebooks infrastructure and tools for visualization of user data.

Teams contributing to the program[edit]

Analytics, Wikimedia Cloud Services

Annual Plan priorities[edit]

Primary Goal: 3. Knowledge as a Service - evolve our systems and structures

How does your program affect annual plan priority?[edit]

This programs helps teams elsewhere in the foundation to measure the outcome of theirs

Program Goal[edit]

We will maintain and increase public access to past, present and real time data for Wikimedia projects. We will provide the infrastructure to measure the impact and reach of projects and features for editors, communities and WMF.

Outcome 1
Wikimedia Cloud Services users have easy access to high quality analytics data to answer questions about content and contributors.
Output
Provision a cluster for public Data Lake access in Cloud Services that can be used as a Quarry backend. In this iteration the Data Lake will include historical data about editing (revisions, pages, users) for all Wikimedia projects since the beginning. Data is optimized to be queried in an analytics-friendly way that allows for simple and fast queries.
Target
Users of Cloud Services data lake in terms of developers, researchers and tools
Outcome 2
Increase stability and availability of user created datasets hosted in Cloud Services
Output
build a replacement for ToolsDB database servers
Output
build a replacement for shared Postgres database servers
Outcome 3
Foundation staff and community have better visual tools to access data about content, contributors and readers.
Output
Wikistats 2.0 - Users (and Programatic tools) have access to most reports that community consultation found of importance.
Output
Wikistats 2.0 - Production Release
Target
Unique Users/ Pageviews of Wikistats 2.0.
Output
Develop pipeline for easy visual exploring of eventlogging data for WMF users.
Output
Support for more data sources and programming languages for WMF Jupyter Notebook users.
Target
Users of newer tools
Data moved
Outcome 4
Users see improvements on data computing and data quality.
Output
More efficient Bot filtering on pageview data
Target
Tickets filed for due to bot spikes
Output
Make the Wikipedia text corpus available for analytics purposes


Outcome 5
We have scalable, performant and reliable software for data transport
Output
Software maintenance on analytics stack to maintain current level of service

Resources[edit]

People FY2017–18 FY2018–19
Analytics
  • Software Engineer
  • Software Engineer
  • Software Engineer
  • Software Engineer
  • Software Engineer
  • Software Engineer
  • Software Engineer
  • .5 ✕ Software Engineer (reduction)
WMCS
  • 0.25 ✕ Product Manager
  • Operations Engineer (reallocated)
  • 0.33 ✕ Product Manager (reproportioned)
Research
  • 0.5 ✕ Senior Data Analyst
  • 0.5 ✕ Senior Data Analyst (no change)
  • 0.25 ✕ Researcher (reallocated)
CapEx
  • None
  • None
Travel & Other
  • None
  • None

Targets[edit]

Outcome 2[edit]

Target
Legacy database servers retired
Measurement method
  1. Legacy server hardware returned to SRE/DCOPS management by the end of Q3 FY2018/19.

Dependencies[edit]

n/a