Wikimedia Technology/Goals/2019-20 Q4

Technology Department Team Goals and Status for Q4 FY19/20 in support of the  Medium Term Plan (MTP) Priorities and Annual Plan for FY19/20



Analytics
Team Manager: Nuria Ruiz
 * MTP-Y1: Platform Evolution Build a reliable, scalable, and comprehensive platform for creating services, tools and user facing features that produce and consume event data
 * 5% of production and analytics events have been migrated to the new event platform.
 * By June 2020 all production and consumption of new event data originated in our websites is flowing through this new...
 * Build a reliable, scalable, and comprehensive platform for building services, tools and user facing features that...
 * ✅ Client Error Logging is deployed to 1 wiki and error stats are displayed on our operation dashboards.
 * ✅ Enable better scaling of our production infrastructure by moving our events standardized modern event system


 * MTP-Y1: Platform Evolution Reduce the complexity of workflows when it comes to build, train and deploy machine learning models to enable ML- aided product augmentation and research
 * ✅ Deploy a fully open source solution for GPU-enhanced computation infrastructure that improves training times
 * By the end of fiscal year present a design on how to speed up our model training by providing models


 * Modern Event Platform Build a reliable, scalable, and comprehensive platform for creating services, tools and user facing features that produce and consume event data
 * ✅ Working Stream Config Service and client side library for sending events to EventGate in MediaWiki Vagrant
 * One new event stream created and deployed by Product by the end of Q3 2019/2020
 * 2 existent EventLogging analytics streams migrated to Modern Event Platform by end of Q4 2019/2020
 * Resolve Kafka Connect HDFS Licensing issue and decide if we will use Kafka Connect task T223626
 * One new automated dashboard created and deployed by Product and Analytics engineering by end of Q4 2019/2020
 * Deploy a new event stream for analytics using the new Event Platform infrastructure
 * Vertical MEP from web to backend: Migrate SearchSatisfaction EventLogging event stream to Event Platform


 * Smart Tools for Better Data Make easier to understand the history of all Wikimedia projects
 * Wikistats UI is localized for languages and number formatting
 * Define computation for "Active Editors per project family"
 * Wikistats UI is more flexible when it comes to explore metrics. Allow spliting and filtering simultaneously
 * Add "Active Editors per project family" as a metric to wikistats UI
 * Design (together with core platform team) an alternative architecture for historic data endpoints used by iOS application
 * Define computation for active editors per project family
 * Implement foundations for newpyter (hadoop hosted distributed jupyter notebook setup)
 * Enhancements to Wikistats UI so you can split/filter simultaneously


 * Smart Tools for Better Data Increase Data Quality, Privacy and Security
 * Bots: Label high volume bot spikes in pageview data as automated traffic


 * Core Operational Excellence. Increase Resilience of Systems
 * Create a MySQL replica for backups for all MySQL instances we use MySQL on, like Oozie or Superset
 * Airflow as an easier job scheduling alternative, PoC for refine workflow
 * Unify stats and notebook cluster. Decomision notebook hosts and make puppet role of stat1007 just like the other stats boxes


 * Cassandra3 migration plan proposal



Fundraising Tech
Team Manager: Erika Bjune
 * Key Deliverable

Dependencies on:

 Status 
 * April 2020 status:
 * May 2020 status:
 * June 2020 status:
 * June 2020 status:
 * June 2020 status:



Engineering Productivity Team
Team Manager: Greg Grossmeier
 * MTP-Y1: Platform Evolution Maintain and evolve developer tooling, testing infrastructure, validation environments, deployment infrastructure, and supporting processes
 * Release Engineering and SRE jointly create a plan of action to implement a Deployment Pipeline compliant...
 * Developers have a consistent and dependable deployment service.
 * Improve developer productivity by automating manual steps out of the model development and deployment pipeline
 * Build and support a fully automated and continuous Code Health and Deployment Infrastructure
 * Maintain and improve the Continuous Integration and Testing services
 * Reduce infrastructure gaps in the areas of backups & disaster preparedness, observability, infrastructure...
 * Service owners, deployers and other stakeholders are able to develop, test, deploy, observe and maintain services...
 * The organization is able to make data-driven decisions about tests, testing infrastructure, and deployments.
 * ✅ Ensure WDQS Performance stability over time
 * MTP-Y1: Platform Evolution We will improve developer efficiency for all developers, new and experienced, internal and external
 * Create a cohesive documentation portal to onboard new developers to our API
 * Services are able to intercommunicate in a reliable, secure and standardised way in our infrastructure
 * Improve all baseline developer efficiency metrics by 10% by the end of the year.
 * Production-like containers
 * Create an easy to use REST API with the basic functionality needed to interact with our platform
 * ✅ Determine a baseline set of metrics to assess internal developer efficiency, including time to first merge (new...
 * We will improve Cycle Time for Internal Experienced Developers by 10% year over year.
 * Local development container system
 * ✅ We will plan and execute this year’s Technical Conference and produce a prioritized list of future work outcomes as...
 * Successfully run Wikimedia’s technical internship and outreach programs
 * ✅ Find committed owners for all prioritized outcomes of the 2019 Technical Conference in time for annual planning.
 * Increase visibility & knowledge of technical contributions, services and consumers across the Wikimedia ecosystem
 * Develop, test and evaluate different formats to build technical capacity in smaller wikis
 * Align developer services with SRE best practices
 * Improve Resilience of Wikimedia's Gerrit Install
 * ✅ Deploy developer services with Scap3
 * ✅ Improve availability and redundancy of Wikimedia's Phabricator
 * Build and support a fully automated and continuous Code Health and Deployment Infrastructure
 * ✅ Work across teams to ensure that 5% of projects are moving through the continuous delivery pipeline
 * Automate MediaWiki Train
 * Provide a new CI/CD platform to Wikimedia Technology and Product teams by the end of FY19-20
 * Continuously deploy at least one project by the end of fiscal year 2020
 * Static variant configuration for production
 * Improve ability for developers to share changes
 * Improve all baseline developer efficiency metrics by 10% by the end of the year
 * Develop the 2020 Developer Satisfaction Survey, gather, and summarize results, giving us another yearly data point...
 * Extend and improve Phabricator to enhance engineering productivity.
 * The organization is able to make data-driven decisions about tests, testing infrastructure, and deployments
 * Provide infrastructure to store data and metrics in support of decision making with regards to tests, testing...

Performance
Team Manager: Gilles Dubuc
 * MTP-Y1: Platform Evolution Create a culture of performance across all Wikimedia engineering teams by the end of the fiscal year, to reduce the frequency of performance regressions
 * ✅ By early November, smoothly and successfully give managerial duties of the Performance team to Gilles
 * Foster a culture of performance
 * Expand the coverage of performance monitoring
 * Improve performance
 * Collaborate with other teams via performance reviews
 * Performance review of the client-side-only version of the Graph extension
 * Performance review of the DiscussionTools extension
 * Performance review of Wikidata Bridge
 * Performance review of Improved Commons search
 * ✅ Performance review of the KaiOS app
 * Performance review of the GrowthExperiments extension
 * Performance review of Push Notifications infrastructure
 * Expand the coverage of performance monitoring
 * ArcLamp flame graphs stored in Swift, retained at least 2 years
 * Expand the coverage of synthetic performance monitoring
 * ✅ Collect and graph First Input Delay
 * Implement alerts for synthetic search
 * Add operational monitoring for 100% of the performance-team services
 * ✅ Expand the coverage of backend performance testing
 * Organise and oversee implementation of First Paint on Safari
 * Test out small variance (2%) in latency for our test
 * Remove XHGui dependency on MongoDB
 * Be able to run WPT & WPR on our own Kubernetes cluster.
 * Simulate slow connections on real devices.
 * Foster a culture of performance
 * ✅ Provide performance expertise to FAWG outcome (Sept '19-June '20)
 * ✅ Make it easier for engineers to find performance issues during development (Fresnel)
 * ✅ Organise and run the Web Performance devroom at FOSDEM 2020
 * Publish 8 blog posts about performance
 * ✅ Publish blog post about Wikipedia JS init improvement
 * Publish blog post about ResourceLoader feature test
 * ✅ Publish blog post about organising and running a FOSDEM devroom
 * ✅ Publish blog post on web performance calendar about Long tasks and FID
 * ✅ Publish blog post about WikimediaDebug v2
 * Blog post (with video?) about the WebPageTest and WebPageReplay setup
 * ✅ Publish blog post on web performance calendar about RUM insights
 * Publish blog post about CPU microbenchmark
 * ✅ Document and evangelize synthetic testing, User Timing and Element Timing
 * Collaborate with other teams via performance reviews
 * Organise and oversee org-wide frontend web performance training
 * Improve performance
 * Improve software consistency (speed/success) in handling contributor actions
 * ✅ Improve Wikipedia save-edit performance
 * Improve MediaWiki PHP startup time and recover from PHP7 regression
 * Audit default JS payload and lead efforts to reduce its cost
 * Make logged-in MediaWiki end-user use the closest datacenter
 * Migrate prod localisation cache to faster static-array distribution
 * Parallel MediaWiki phpunit test prototype

Quality and Test Engineering
Team Manager: JR Branaa
 * MTP-Y1: Platform Evolution Enable engineering-wide quality and testing strategy, tooling, education, and personnel.
 * An explicit set of unit, integration, and system testing tools is available for all supported engineering languages
 * Quality and Testing Engineering - Team formation and migration
 * ✅ Introduce TDD as a way of getting a better quality code throughout the Foundation
 * ✅ Evaluate alternative system level testing tooling options and provide a single recommendation by the end of Q2.
 * Work closely with Product teams to build and establish working relationships for the Quality and Test Engineering...
 * Organise regular TDD workshops

Release Engineering
Team Manager: Tyler Cipriani
 * Improve Resilience of Wikimedia's Gerrit Install
 * Gerrit is migrated to 2.16
 * Gerrit's backup restore is proven to work correctly



Machine Learning / Scoring Platform
Team Manager: Aaron Halfaker
 * Key Deliverable



Platform
Team Manager: Corey Floyd
 * Create a cohesive documentation portal to onboard new developers to our API
 * Create infrastructure for developing better structured documentation to make it easier to build easy to read
 * Developers can easily understand the contents of the portal and find the information they need
 * Allow developers to quickly get started building knowledge based applications using our APIs.
 * ✅ Build a prototype for the documentation portal
 * The portal is a hub for a thriving community of developers.
 * Develop a technical direction for the Wikimedia Platform to support Wikimedia Medium Term Plan
 * Enable the development of full featured Javascript web clients
 * Communicate the vision and plan for the Core Platform Team's work through the end of the FY resulting from PE
 * Develop a strategy to integrate Javascript frameworks into the MediaWiki platform to enable easy development
 * Improve the sustainability of MediaWiki and the ease of building on top of it
 * Quantify and reduce coupling in MediaWiki Core
 * Initiatives that the Core Platform Team begins are driven to completion
 * Allow for more confident refactoring of core code
 * Close out MCR work
 * MW Core Code is better logically decomposed into libraries, introduction of new cross dependencies is...
 * ✅ Product requirements for upcoming Core Platform Team initiatives are documented
 * Further decoupling efforts
 * Close out actor and comment migration
 * Limit vandalism requests by bad actors and guarantee levels of service through securing the API
 * Reduce the risk of vandalism by bad actors by limiting throughput of anonymous API calls
 * Completion and shipping of the OAuth 2.0 initiative Epic 1 and 2
 * Reduce the risk of vandalism by bad actors by enabling the ability to disable access of known API users

Architecture
Team Manager: Kate Chapman
 * Define target architecture for structured content so pieces of content can be more easily used to engage users.
 * ✅ Perform task analysis modeling with product managers help determine pain points and needed system capabilities
 * Engage stakeholders to present plan on system changes needed to better enabled structured data.
 * ✅ Develop proposal for modern system to enable structured data.
 * Present proposal to CTO and CPO to gain support for no longer focusing on building page building software
 * Make architectural decision process clear so teams have clear direction as to what decisions have been made and how to proceed.
 * Create plan for decision making process
 * Develop template for technical design and decisions
 * Engage stakeholders in decision making process for feedback.

Platform Engineeering
Team Manager: Mat Nadrofsky
 * Drive the Delivery of Q4 Platform Engineering Initiatives
 * Limit the ability for bad actors and misinformed users to impact the availability of our services
 * Enable developers to make system changes while maintaining a consistent stable experience
 * Help a team migrate their service to Kubernetes



Research
Team Manager: Leila Zia
 * Key Deliverable



Search Platform
Team Manager: Guillaume Lederrey
 * Key Deliverable



Security
Team Manager: John Bennett
 * Key Deliverable



Site Reliability Engineering
Directors: Mark Bergsma and Faidon Liambotis
 * Cross-cutting

Service Operations
Team Manager: Mark Bergsma
 * Key Deliverable

Data Persistence
Team Manager: Mark Bergsma
 * Key Deliverable

Traffic
Team Manager: Brandon Black
 * Key Deliverable

Infrastructure Foundations
Team Manager: Faidon Liambotis
 * Key Deliverable

Observability
Team Manager: Faidon Liambotis
 * Key Deliverable

Data Center Operations
Team Manager: Willy Pao
 * Key Deliverable



Technical Engagement
Team Manager: Birgit Müller

Developer Advocacy
Team Manager: Birgit Müller
 * Develop, test and evaluate different formats to build technical capacity in smaller wikis
 * ✅ Conduct workshop and document the technical challenges small wikis face in North America
 * Organize an online workshop series for Indic language small wikis
 * Create a hub for the Small Wiki Toolkits initiative
 * A starter kit for small wikis containing a recommended set of templates, Gadgets, bots, etc. is available by Q4
 * Write a report highlighting lessons learned from developing and testing different formats to build technical capacity in smaller wikis


 * Increase visibility & knowledge of technical contributions, services and consumers across the Wikimedia ecosystem
 * ✅ Establish Coolest Tool Award
 * Share stories and insights from the technical community
 * Increase knowledge on scope and breadth of technical contributions and contributors
 * Train people how to use Phabricator to increase acceptance and foster collaboration


 * Successfully run Wikimedia’s technical internship and outreach programs
 * Successfully coordinate Outreachy and GSOC
 * ✅ Mentor First Season of Google Season of the Docs.
 * ✅ Submit and hold session on Wikimedia's Tech internships at WikiCon Northamerica
 * ✅ Successfully coordinate Google Code-in 2019
 * Mentor 1 intern on the WikiContrib project via Outreachy round 20

Wikimedia Cloud Services
Team Manager: Bryan Davis
 * All Debian Jessie instances are removed/replaced in Cloud VPS hosted projects
 * ✅ Remove Debian Jessie from the Cloud VPS "toolsbeta" project
 * ✅ Remove Debian Jessie from the Cloud VPS "tools" project
 * Remove Debian Jessie from the Cloud VPS "openstack" project


 * Increase application security by hosting tools using unique hostnames rather than path based routing
 * ✅ Update front proxy to support host based routing
 * ✅ Create redirect system to preserve function of legacy URLs following conversion from path base to host based routing of each tool
 * Migrate all tools to host based routing
 * ✅ Update `webservice` to support host based routing
 * ✅ Migrate 5+ early adopter/beta tester tools to host based routing
 * Interwiki links support for $tool.toolforge.org


 * Upgrade Toolforge Kubernetes to 1.16
 * Update `webservice` to support k8s 1.16 APIs
 * Determine blockers for k8s 1.16 upgrade and assign as tasks/KRs to team
 * ✅ Fix psp API group to work with k8s 1.16
 * Deploy Kubernetes 1.16 in Toolforge


 * WMCS Infrastructure as a Service (IaaS)
 * Debian Jessie operating system deprecation
 * OpenStack platform upgrades
 * Galera cluster
 * CEPH instance storage
 * Fix Cloud VPS and Toolforge mail servers to work with the modern internet


 * WMCS Platform as a Service (PaaS)
 * Provide a more modern, secure, and performant PaaS experience for Toolforge tools
 * Increase quality of technical documentation for Toolforge and Cloud VPS users
 * PAWS Kubernetes rebuild'''