Wikimedia Technology/Annual Plans/MTP Priority OKR: Data as a Service

From mediawiki.org

Data as a Service[edit]

FY21/22 MTP Priority OKR for Wikimedia Technology Department

Accountable: Tajh Taylor

OKR Overview[edit]

Wikimedia application data is easily discoverable and well-prepared to enable data-informed decision making, application development and research by the community and the Foundation.

Key Result 1
Establish organizational data management structure - Build a browseable, shareable data dictionary, and describe 25% of known data elements.

Key Result 2
Enable efficient program evaluation and decision support through three novel use cases. - Create a baseline selection of three use cases.

Key Result 3
Build machine learning services - Operationalize an ML governance strategy for the Foundation, and create ways to understand, evaluate, and provide feedback on ML models. Baseline to be determined.


Objective Rationale[edit]

Today, there are many barriers to the use of Wikimedia application data for analysis, decision-making, intelligence, and applied data science. These include lack of shared information describing the data, varying methods of access and access control, distributed and unclear data stewardship, technical and architectural impedance mismatches, unclear responsibility for data policy enforcement, etc. Although we have several teams around the organization performing data analysis and using data for a variety of purposes, their capacity is limited by these barriers.

Our purpose is to address these problems, at a scope and scale that crosses organizational boundaries, to establish a home for clear answers to questions about data access, accountability, and organizational policy. And to dissolve the barriers, enabling and empowering the data capabilities of the entire community (staff, volunteers, and external users of data).

By establishing the data governance capabilities described in Key Result 1, we provide the organizational structure to manage data at the Foundation level. In fulfilling the use cases described in Key Result 2, we demonstrate the ability to deliver capabilities that have previously been stymied by the barriers described above. And in Key Result 3, we transform our machine learning capabilities to be modern, standardized, flexible, scalable, and transparent.

To fulfill these goals, we must create a data strategy that clearly articulates how enhancing the data management capabilities of the Foundation enables us to better support Movement and Foundation strategy and to better measure our own performance and capabilities in the intersection of systems, programs, and people.


Key Result 1: Organizational Data Management and Data Catalog[edit]

Establish organizational data management structure - Build a browseable, shareable data dictionary, and describe 25% of known data elements.

Accountability for this Key Result is assigned to Olja Dimitrijevic, Director of Data Engineering


Intent and Desired Outcomes
Data management is an organizational-scale discipline and approach that:

  • Recognizes the high value of reliable, well-maintained, and easy-to-use data to inform our strategic mission and that of our community
  • Describes the access to and the use of data for the Foundation and community as a set of services for stakeholders
  • Provides systematic approaches to data discovery, assurance, access control, and policy application

The strategic value of data is evident in the frequent questions that we are asked and the informational requests that we receive for data. We can currently answer only a fraction of the questions and fulfill only some of the requests even when the requisite data is collected and available.

It is important to recognize that we are not starting from a blank slate. Some teams around the Foundation manage their own data sets, with varying tools and degrees of sophistication. This independent approach has the advantage of liberating teams to determine their own data destinies, but presents several issues: duplication of effort as teams determine and implement Foundation policy for things like sensitive data access control; derivable insight is somewhat limited to what can be learned from within particular sets of data because combining data from different sources and teams is cumbersome and difficult; and not every team has the same degree of technical data expertise on-hand to fulfill needs.

To address these issues and to more fully enable the distillation of value from Foundation & movement data, we intend to establish Foundation-wide data management practice, with the following outcomes:

  • A data governance council, comprised of stakeholder representatives from across the Foundation, and empowered by senior management to make decisions and determinations regarding data standards and practices
  • A Foundation data strategy describing our principles and objectives regarding the use and development of Foundation data and access
  • A data catalog describing the Foundation’s data (encoding, format, location, character, provenance, applicable policy, accessibility, stewardship), available online to Foundation and community users
  • An internal data services team that will support the embedded domain analysts, as well as teams that do not have expertise in data preparation and delivery
  • Recognition and elevation of data services and operations as products that are publicly accessible and available, with establishment of relevant SLOs and other treatment

Clearly, not all of these outcomes are achievable within 12 months, and they are not all presently included in this fiscal year’s annual planning. They represent the long term goals, and will be reflected in the data strategy we create.


Definitions

  • Wikimedia Application data – Structured data elements and records that are generated by the operation of our publicly accessible systems. This includes product metrics and production user data collected by and generated from Wikimedia properties, but does not include unstructured wiki content data, survey data, or third party data, which may be used with Wikimedia application data to support analyses or other work.
  • Metadata – Data about data. E.g. location, modification time, ownership, format, access permissions, privacy sensitivity, etc.
  • Discoverable – New and experienced users of WM data are able to find new data elements relevant to their use cases.
  • Accessible – The experience of locating and retrieving relevant data sets or viewing relevant reporting is self-directed and easy.
  • Well-defined – Data elements and records each have definitions that explain their provenance, appropriate and inappropriate uses, formats, constraints, expected ranges/distributions, and restrictions.
  • Navigable – The relationships between data elements are documented and defined.
  • Sourceable - It is easy to load data into systems for serving production features and analytics
  • Integrable – It is easy to use data sets from different sources in combination with one another.
  • Prepared – The data is in or close to the format in which the user wants to consume it.


Related Quarterly OKRs

  • OKRs to be drafted


Activities & Deliverables We expect that we will undertake the following activities in fulfillment of our objective:

1. Data Strategy
  • Organizing a working group of staff participants to define the strategy
  • Identify and describe the high-level mission principles and objectives that will guide the development of a data strategy
  • Working group to meet regularly to determine the scope and content of the strategy
  • Identifying and soliciting input from community members with high interest in data access, including tool developers, researchers and current users of bulk data access
  • Ratifying a first draft of the data strategy with signoff from stakeholders
2. Data Governance Assessment and Plan
  • Standard templates/ format to collect information about data sources and data policies
  • Inventory data sources (leverage use cases)
    • Where data comes from
    • Data stewards - Who may own it (if owned)
    • Maintenance policies
  • Collation of data policies
  • Organize investigation and insights from use cases to begin building out:
    • Data catalog
    • Data dictionary
  • SWOT analysis & preliminary findings
  • Present findings to the organization
  • Validate & decide on focus areas
3. Data Governance Implementation
  • Soliciting participant commitments from all relevant departments and stakeholders for the data governance council
  • Build a shared understanding of data principles, concepts, and best practices
    • Targeting the C team
    • Targeting data council members
  • Establish an initial (non-comprehensive) charter to define the work of the data governance council for the current FY
  • Determine the high-level requirements for the data catalog
  • Establish comprehensive document of data sources across the organizational silos
  • Identify and train data stewards
  • Iterate:
    • Determine the scope of data sets and elements to enter into governance
    • Collect and synthesize current and new governing policies and practices in a single place
    • Write guidance for the access and use of the scoped data elements
    • Solicit feedback on the guidance at the Foundation and with the movement (?)
    • Release guidance
  • Repeat
4. Data Catalog
  • Determine scope and establish requirements for gathering and publishing metadata information
  • Engineering design and implementation of data catalog
  • Data steward assignments and dataset metadata review
  • Iterate with release and community input
  • Determine policies to show and hide meta-data


Resourcing

Activity Responsible Accountable Consulted Informed
Data Source Data Engineering Data PM Data Executives
Data Strategy Document Tajh Taylor, VP of Data Science & Engineering

Olja Dimitrijevic, Director of Data Engineering Desiree Abad, Director of Product Management Sumeet Bodington, Director of Global Data Insights Chris Albon, Director of Machine Learning Leila Zia, Director of Research, Guillaume Lederrey, Engineering Manager for Search Kate Zimmerman, Director of Data Science

Tajh Taylor Technology leadership, Product leadership, Advancement, Community Investment, Legal, Trust & Safety, Site Reliability Engineering, Administration VP Cohort, Foundation, Community
Governance Council Formation Tajh Taylor, Olja Dimitrijevic, John Bennett / Director of Security, Desiree Abad, Kate Zimmerman / Director of Data Science, other participants in Data Engineering Tajh Taylor C-Team, VP Cohort, other departmental leaders
Data Governance Assessment and Plan Data Engineering Team (Dan Andreescu, Olja Dimitrijevic), Security Team, Legal Team Olja Dimitrijevic Data Governance Council, T&S, Data Persistence
Inventory (Dan A., Olja D.), Global Data Insights, Data PM (Desiree as stand-in) Data PM (Desiree as stand-in Governance Council, Product Analytics, Data Persistence, GD Insights & ops FR-Tech SLO working group
Data Catalog Requirements Data PM (Desiree as stand-in), Data Engineering, (Dan Andreescu, Olja Dimitrijevic, Desiree Abad) Data PM (Desiree as stand-in Data Governance Council, Product Analytics, Security, GD Insights & Ops, Designers
Data Catalog Implementation Data Engineering Team (Dan Andreescu, Olja Dimitrijevic) Olja Dimitrijevic Product Analytics, Data Engineering, Product Management, Research, Machine Learning


Key Result 2: New Efficiencies in Program Evaluation[edit]

Enable efficient program evaluation and decision support through three novel use cases. - Create a baseline selection of three use cases

Accountability for this Key Result is assigned to Desiree Abad, Director of Product Management


Intent and Desired Outcomes
Over time, the Foundation has invested heavily in various departments, teams, tools, and processes that can collect and analyze data to distill business insights. While we have analytics capabilities across the organization, these capabilities are often siloed by team and or dataset making it difficult to answer questions, distill insights, and derive intelligence across the Foundation. Key challenges include:

  • Data is often PII sensitive and must have adequate security and privacy controls to ensure that data is sanitized, access is controlled, and that data is handled as per our policies.
  • Data is siloed across the organization, being stored in different locations with different policies and security.
  • Data is difficult to translate within and across departments due to misaligned definitions, interpretations, and a common language.
  • Data is sourced from a wide variety of product platforms, applications, tools, and surveys, requiring customized ingestion/pipeline solutions.
  • Data analysis tools and skill sets vary across individuals and teams.

In order to address these challenges, we will narrow down the scope of these problems in the context of three novel use cases that specifically target challenging areas with cross-functional involvement:

1. Grant & Grantee Reporting
  • Context:
    • The Foundation uses an application called Fluxx to manage grants and interact with Grantees. As such, this data will need to be collected from Fluxx, existing sources will need to be switched, and additional analytics will be required.
  • Goals:
    • Ingest grant-related data from Fluxx and reflect on wiki;
    • Combine site-generated data with grant data
    • Identify and generate grantee impact metrics
  • Challenges:
    • Custom data ingestion will be required for Fluxx
    • A data privacy & security strategy must be identified and agreed upon across teams to support co-location of data, serving on wiki, and/or dataset blending for the purpose of impact metrics.
    • Data analysis skills, especially of different analytics datasets vary.
2. SpamBots impact on content, admins, and users
  • Context:
    • Currently we have hypothesized that due to inadequate captcha, we are permitting spambots through which results in spamming of content and administrators spending time removing these accounts and reverting spambot changes.
  • Goals:
    • Understand the impact of spambots on our content and users
    • Understand whether captchas are successfully blocking spambot accounts
    • Identify root causes of spam
  • Challenges:
    • Inventory the observability and product data we have and determine any potential gaps
    • Examine how we can isolate specific observability data and serve that data so that it may be joined with product analytics data
    • Support additional ETL, as needed, in order to join and analyze different datasets
3. Diversity, Equity, and Inclusion (DEI) Reporting
  • Context:
    • Global Data Insights (GDI) will launch Foundation's first public-facing dashboard to enable movement organizers and partners to map diversity, equity & inclusion among movement-wide programs and spaces.
  • Goals:
    • Understand what data is available and what data is missing that we would need to collect for each lens
    • Establish pubic-facing dashboards with any existing that can be used to analyze and uncover insights, that glean intelligence to inform data-driven decisions in the DEI space
  • Challenges
    • Legal barriers to data collection and use
    • Data can be over- or under-counted


Other Use Cases


4. VisualEditor (VE) Load Time
  • Why this was not selected: While a better understanding of load time would be valuable for product and engineering decisions, the specific case of VE was tabled for a variety of reasons, and the Editing team is focused on developing Talk Pages this year
  • Context:
    • User adoption of the VE on mobile was slower than expected and didn't meet expectations in certain geographic areas. One hypothesis was that VE caused load time to be too long. https://phabricator.wikimedia.org/T221198
  • Goals:
    • Determine whether we collect page load time and related data at the required level of granularity and with the appropriate additional metadata to answer the questions in the hypothesis
    • Join the load time data with the product data in a single dashboard to drive product intelligence.
    • Understand what level of performance is required and distill SLOs.
  • Challenges:
    • Inventory the observability and performance data we have and determine any potential gaps.
    • Examine how we can isolate specific observability and performance data and serve that data so that it may be joined with product analytics data.
    • Support additional ETL, as needed, in order to join and analyze different datasets.
5. A/B testing
  • Why this was not selected: We’re not ready yet for this, more involved architecture and stack needs.


Definitions

  • ETL – Extract, Transform, and Load. A common approach to retrieving and transactional application data for analytical use cases.
  • PII – Personally-identifying information. Data elements such as user name or IP address, that either alone or in combination with other data elements can be used to uniquely identify a person.
  • Observability Data - Data collected for the primary purpose of server administration, but which may also have applications for data insights, such as request rate and server-side latency.
  • Data Blending - a technique to combine data from multiple data sources in data analytics, reporting, and/or visualizations (Data blending - Wikipedia)


Activities & Deliverables

  • Data Security - Ensure data is stored, transformed, and accessed in a way that doesn’t compromise security and privacy.
    • Supporting individuals’ privacy needs
    • Blending and/or Co-location of data - establish a framework for how to blend and/or co-locate data
    • Access and usage controls
      • Authentication
        • Provision an authentication system at the required level of security and scale to facilitate more broad user access to data analysis tools
      • Authorization
        • Who can access what dashboards and data sets
        • Who can access data for exploratory use-cases
    • Address retention policies and practices
    • Auditability of access as well access control changes
  • Define & Develop Data Ingestion Processes
    • Establish ingestion methods for:
      • Fluxx
      • Bespoke datasets (ex: survey data)
    • Streamline data ingestion for:
      • Product instrumentation data (metrics platform)
    • Determine a methodology to support blending or x-dataset analysis
      • Data destination
  • Discovery & Analytics
    • Do product instrumentation dives to look at what data can be leveraged
    • Look across datasets
  • Data Analysis Support
    • Uncover and address any unmet technical requirements for data analysis engines and analytical data storage
    • Address scale and performance issues in data reporting tools; provision or improve an existing shared platform for data reporting
  • Data Dictionaries
    • For each use case define the fields, definitions, calculation methodology, etc. in a standardized format


Related Quarterly OKRs OKRs to be drafted


Resourcing

Activity Decision-Makers Responsible Accountable Consulted Informed
Discovery & Analytics: Product Instrumentation Kate Zimmerman, Mikhail Popov Product Analytics (Maya Kampurath, Irene Florez, and analysts working with consulted Product Teams) Kate Zimmerman Product Teams
Discovery & Analytics: Grants Data & Metrics Kassia Echevarri-Queen Community Investment Kassia Echevarri-Queen Product Analytics, Ilana Fried, Irene Florez
Discovery & Analytics: DEI Reporting Sumeet Bodington Global Data Insights Sumeet Bodington Product Analytics
Technology Implementation: Analytics Stack Olja Dimitrijevic, Data Engineering Data Engineering Olja Dimitrijevic Product Analytics
Technology Implementation: Metrics Platform Jason Linehan, Analytics Data PM, Desiree Abad Analytics Data PM (Desiree back-up) Desiree Abad Product Analytics, Data Engineering Senior Leadership, Product
Data Privacy & Security John Bennett Data Engineering John Bennett Security, Privacy, Legal, Product Analytics Senior Leadership, Product
Requirements, Roadmaps, & Product Management Analytics Data PM, Implementation Teams Platform Product Management Desiree Abad Product Analytics


Key Result 3: Build Machine Learning Services[edit]

Build machine learning services - Operationalize a ML governance strategy for the Foundation, and create ways to understand, evaluate, and provide feedback on ML models. Baseline to be determined.

Accountability for this Key Result is assigned to Chris Albon, Director of Machine Learning


Intent and Desired Outcomes
Over the last five years, the Wikimedia Foundation has proven that machine learning can add meaningful value to the experiences of both readers and editors. However, despite these successes, there are areas for improvement around how the Foundation conducts machine learning, including:

  • Models are trained using a framework that is difficult to maintain and limited in the variety of models able to be created.
  • Models are served using an aging infrastructure requiring users to be mindful of their own usage in order to prevent system failures.
  • There is no formal review process for evaluating models hosted by the Foundation.
  • Models hosted by the Foundation are largely opaque to the communities impacted.

In this Key Result, we have an opportunity to elevate machine learning at the Foundation to an example of best practices for applied ethical machine learning, while at the same time strengthening the technical infrastructure and expanding its features.

To accomplish this, the desired outcome for the fiscal year is to create a new modern training, serving, and management infrastructure incorporating the best practices in MLOps and used to host a wide variety of existing and new machine learning models, all of which are transparent and accessible to the public and governed by a well-developed Wikimedia machine learning strategy.


Activities & Deliverables

1. Ethical Machine Learning Governance
  • Publish data and model cards
  • Draft machine learning governance strategy
2. New Model Deployment
  • Launch Lift Wing model serving framework
  • Migrate ORES/RevScoring models to Lift Wing
  • Deploy new models to Lift Wing
  • Deprecate the ORES infrastructure
3. New Model Training And Management Infrastructure
  • Launch minimum viable product of Train Wing model training framework
  • Launch minimum viable product feature store


Related Quarterly OKRs

1. All machine learning models hosted by the Foundation are managed by an Ethical Machine Learning Governance Strategy
  • Q2 - 50% of machine learning models hosted by the Foundation have an accompanying model card
  • Q2 - Draft a proposed ML governance strategy
  • Q4 (Stretch KR) - Operationalize final governance strategy
2. Machine learning models hosted by the Foundation are easy to train, deploy, and manage at scale
  • Q2 - A trained model can be loaded, deployed, and serve API requests in less than four hours in a repeatable process
  • Q4 - Wikimedia hosts and serves 100 machine learning models on Lift Wing
  • Q4 (Stretch KR) - One machine learning model is automatically retrained and deployed nightly in repeatable process


Resourcing

Activity Responsible Accountable Consulted Informed
Publish data and model cards Machine Learning Chris Albon Product Management Design Research
Draft machine learning governance strategy Machine Learning Chris Albon Movement Comms, Product Management, Data Engineering Research
Internal Launch Lift Wing Machine Learning, Data Center Ops Chris Albon SRE teams, Product Management Research
Migrate ORES/RevScoring models to Lift Wing Machine Learning Chris Albon SRE teams, Product Management
Deploy new models to Lift Wing Machine Learning Chris Albon Product Management, Research
Deprecate ORES Machine Learning, Data Center Ops Chris Albon Movement Comms, Product Management, Data Engineering
Launch Train Wing MVP Machine Learning, Data Center Ops Chris Albon Research, Product Management
Launch feature store Machine Learning Chris Albon Data Engineering, Research, Product Management