Wikimedia Technology/Goals/2019-20 Q1

Technology Department Team Goals and Status for Q1 FY19/20 in support of the  Medium Term Plan (MTP) Priorities and Annual Plan for FY19/20



Technical Engagement
Team Manager: Birgit Müller

Core
 * - HA for OpenStack API endpoints (keystone, glance, nova, designate)
 * - OpenStack version upgrade(s) - tbc in Q2
 * - Jessie deprecation (infra + Cloud VPS) - tbc in Q2
 * - Ceph cluster POC
 * - Improve Cloud VPS documentation - tbc in Q2
 * - Toolforge Kubernetes redesign/upgrade
 * - Improve Toolforge documentation - tbc in Q2

Increased visibility & knowledge of technical contributions, services and consumers across the Wikimedia ecosystem (Reduce Complexity of the Platform, Movement Diversity)
 * - Continue Tech Talks
 * - Conduct Coolest Tool Award
 * - Publish Technical Contributors Map
 * - Blog posts on Small Wiki Toolkits & Coolest Tool Award
 * - Design & publish Tech Engagement quarterly newsletter
 * - Develop visualization tool for WMCS edit data - tbc in Q2
 * - Publish Developer Metrics

Technical communities are supported & it becomes easier to contribute (Reduce Complexity of the Platform; Movement Diversity)
 * - Develop support formats: Coordinate Small Wiki Toolkits focus area; Create toolkits & experiment, evaluate
 * - Technical internships and mentoring: Mentor students in GSOD, GSOC, Outreachy
 * Always - Provide continuous bug management support in Phabricator

Dependencies on:

 Status 
 * July 2019 -
 * August 2019 -
 * September 2019 -



Fundraising Tech
Team Manager: Erika Bjune
 * - Pokem ipsum dolor sit amet Trubbish Porygon Seismitoad Grass Swalot Town Map. Ut aliquip ex ea commodo consequat Plain Badge Ninetales Lilligant Professor Elm Pidgeot incididunt ut labore.
 * - Cascade Badge Leafeon Thunder Badge Rising Badge Glaceon Magikarp used Splash Noctowl. Scratch Bagon Lillipup Houndoom Earthquake to denounce the evils of truth and love Leafeon. Anim id est laborum Scyther Chansey Purrloin Dome Fossil Rhyperior Lumineon.

Dependencies on: Electrike Phanpy Cascoon and Professor Elm Pidgeot

 Status 
 * July 2019 - Mewtwo Strikes Back...
 * August 2019 -
 * September 2019 -



Research
Team Manager: Leila Zia
 * - Pokem ipsum dolor sit amet Trubbish Porygon Seismitoad Grass Swalot Town Map. Ut aliquip ex ea commodo consequat Plain Badge Ninetales Lilligant Professor Elm Pidgeot incididunt ut labore.
 * - Cascade Badge Leafeon Thunder Badge Rising Badge Glaceon Magikarp used Splash Noctowl. Scratch Bagon Lillipup Houndoom Earthquake to denounce the evils of truth and love Leafeon. Anim id est laborum Scyther Chansey Purrloin Dome Fossil Rhyperior Lumineon.

Dependencies on: Electrike Phanpy Cascoon and Professor Elm Pidgeot

 Status 
 * July 2019 - Mewtwo Strikes Back...
 * August 2019 -
 * September 2019 -



Security
Team Manager: John Bennett

Core


 * - Finalize and publish service catalog
 * - Draft new employee security awareness content
 * - Create initial set of security measurements and metrics
 * - Create initial version of PHP security toolkit
 * - Create design document for how DAST will work
 * - Create team learning circles
 * - Publication of security team roadmap
 * - Release of Phan 2.x
 * - Security release
 * - Bug Bounty SOP
 * - Deploy StopForumSpam
 * - Draft 3 new security policies
 * - Draft 3 new Security Incident Response playbooks
 * - Socialize Corrective Action plan for Security Incidents
 * - Incident response Table Top and updates to security after action reports and improvement plans
 * - Discovery ticket for ElastAlert detection and alerting
 * - Phishing Security Awareness, at least 2 completed Phishing campaigns
 * - Team retro, implement agile ceremonies for appsec related projects
 * - Publish data protection and retention guidelines
 * - Create privacy engineering charter
 * - Update data classification policy
 * - Publication of privacy review template

Dependencies on:

 Status 
 * July 2019 -
 * August 2019 -
 * September 2019 -



Core Platform
Team Manager: Corey Floyd
 * - Pokem ipsum dolor sit amet Trubbish Porygon Seismitoad Grass Swalot Town Map. Ut aliquip ex ea commodo consequat Plain Badge Ninetales Lilligant Professor Elm Pidgeot incididunt ut labore.
 * - Cascade Badge Leafeon Thunder Badge Rising Badge Glaceon Magikarp used Splash Noctowl. Scratch Bagon Lillipup Houndoom Earthquake to denounce the evils of truth and love Leafeon. Anim id est laborum Scyther Chansey Purrloin Dome Fossil Rhyperior Lumineon.

Dependencies on: Electrike Phanpy Cascoon and Professor Elm Pidgeot

 Status 
 * July 2019 - Mewtwo Strikes Back...
 * August 2019 -
 * September 2019 -



Analytics
Team Manager: Nuria Ruiz


 * Make easier to understand the history of all Wikimedia projects
 * Release Public Edit Data Lake Dataset in JSON/CSV or mysql dump format


 * Make easier to understand how Commons media is used across our projects.
 * Work starting on mediarequest API.


 * Increase Data Quality
 * Enthrophy-based alarms for data issues


 * Increase Data Privacy and Security
 * Make kerberos infra prod ready.


 * Modern Event Platform
 * * Continue moving events from job queue to event gate main.
 * * Development work for kafka connect
 * * Schema Repository CI for convention and backwards compatibility enforcement


 * Operational Excellence. Increase Resilience of Systems
 * * New zookeeper cluster for tier-2
 * Operational Excellence. Reduce Operational Load by Phasing Out Legacy Systems
 * * Sunset MySQL data store for eventlogging.

 Status 
 * July 2019 -
 * September 2019 -



Search Platform
Team Manager: Guillaume Lederrey
 * - Reduce complexity of the platform: Reduce technical debt and increase automation to reduce workload and make it easier to add new search features
 * Refactor query highlighting to make it extensible by other extensions
 * Refactor Mjolnir jobs into separate smaller jobs
 * - Core work: Maintain CirrusSearch and the Search API and WDQS
 * Core maintenance work (always )
 * Improve WDQS updater performances by writing custom code for updates
 * Full data reimport for WDQS to enable optimizations that were done last quarter
 * Work through the backlog of bugs and performance improvements for WDQS with our contractor
 * Start the hiring process for a new WDQS Engineer
 * Hardware renewal: replace elastic1017-1031
 * - Continue to identify and enable machine learning and natural language processing techniques to improve the quality of search
 * "Did you mean" suggestions: deploy method0 to production
 * - Underserved communities benefit from search techniques that to date are only used on big wikis like machine learning–aided ranking, word embeddings or language specific analyzers: Language analysis / Phab work
 * Work on highest priority language tickets (Discovery Search board / Language Stuff—always )
 * - Structured Data on Commons support (as needed)
 * RDF export
 * Address the indexing issues of MediaInfo (labels vs descriptions)

Dependencies on: RDF export: WMDE / Wikidata, Hardware renewal: DC Ops, MediaInfo indexing: SDoC

 Status 
 * July 2019 - —> (to be discussed)
 * August 2019 -
 * September 2019 -



Scoring Platform
Team Manager: Aaron Halfaker
 * - Build out the Jade API to support user-actions
 * - Build/improve models in response to community demand
 * - Support operations infrastructure improvements (k8s, redis SPOF)

Dependencies on: SRE

 Status 
 * July 2019 -.
 * August 2019 -
 * September 2019 -



Release Engineering
Team Manager: Greg Grossmeier

Priority: Reduce complexity of the platform to make it easier for new developers to contribute.

 * - All applicable new and existing services (and partially MediaWiki) exist in the Deployment Pipeline
 * Migrate restrouter
 * (Stretch): MobileContentService
 * (Stretch): Preparatory MediaWiki config clean-up & static loading work
 * - Actionable code health metrics are provided for code stewards
 * Scope out requirements for a self-hosted version of SonarQube for our use.
 * Expand set of repositories covered by code health metrics (via sonarqube)
 * - Provide a standardized local MediaWiki development environment
 * Migrate local-charts to deployment-charts
 * Instantiate testing and linting of helm charts
 * Preliminary work on a CLI for setup/management

Dependencies on: SRE, Code Health Metrics WG

Core: Developers have a consistent and dependable deployment service.

 * - Iteratively improve our deployment tooling, service, and processes.
 * Streamline the Kibana -> Phab error reporting workflow (using client-side code, at first)
 * - Align developer services with SRE best practices.
 * Work with SRE to identify and implement needs of Phabricator and Gerrit (expected to last into Q2)

Dependencies on: SRE, Performance

Core: Maintain and improve the Continuous Integration and Testing services

 * - Maintain CI and testing services
 * Scope updated CI/testing KPIs
 * Set up an experimental elastic search instance to store and analyze CI logs and metrics
 * - Evaluate, select, and implement a new CI infrastructure.
 * POCs of GitLab and Zuul3 systems; evaluate options
 * Document an implementable architecture for what we want in new CI

Dependencies on: SRE/Others invested in CI architecture choices

Core: A clear set of unit, integration, and system testing tools is available for all supported engineering languages.

 * - Update the existing system test tooling and developer education.
 * Update existing Selenium documentation (https://www.mediawiki.org/wiki/Selenium/Node.js)

Dependencies on: none.

 Status 
 * July 2019 -
 * August 2019 -
 * September 2019 -



Performance
Team Manager: Kate Chapman

Platform Evolution: Reduce complexity of the platform to make it easier for new developers to contribute.

 * - Improve the filtering of obsolete domains in GTIDs to avoid timeouts on GTID_WAIT. (get reviewed and merged)
 * - Support Parsing Team with performance insights on Parsoid-php roll out.
 * - Reduce reliance on master-DB writes for RL file-dependency tracking (Multi-DC prep).phab:T113916:T113916
 * - Audit use of CSS image-embedding (improve page-load time by reducing the size of stylesheets) T121730
 * - Figure out the right store to use for the main stash (dynamo? mcrouter?). T212129
 * - Swift cleanup + WebP ramp up. T211661

Core: Maintain libraries for which Performance is currently responsible, evaluate libraries to determine if should be owned by another team and perform handoffs to other teams when possible.

 * - [Ongoing] Support and maintenance of MediaWiki's object caching and data access components.
 * - [Ongoing] Support and maintenance of WebPageTest and synthetic testing infrastructure.
 * - [Ongoing] Support and maintenance of MediaWiki's ResourceLoader.
 * - [Ongoing] Support and maintenance of Fresnel.
 * - Support AbuseFilterCachingParser deployment. T156095

Core: We can quickly detect performance regressions and be able to better detect potential ones prior to deployment.

 * - Add Grafana dashboard for WANObjectCache stats. T197849

Core: Create a culture of performance in Wikimedia

 * - Write two performance topic blog posts.
 * - Line up interested speakers for a FOSDEM Web Performance devroom proposal.

Dependencies on: SRE, CPT, Parsing

 Status 
 * July 2019 -
 * August 2019 -
 * September 2019 -



Site Reliability Engineering
Directors: Mark Bergsma and Faidon Liambotis

Cross-cutting

 * Firefighting improvements, ONFIRE (continuation)
 * Produce a standardized template for a status document for ongoing major incidents
 * Iterate on a process for running the incident documentation review board; review 90% of incident documents written this quarter
 * [stretch] Research possible implementations for synchronizing team contact information to everyone's phone


 * Database automation (continuation)
 * Productionize dbctl (deploy, import data, set up alerts)
 * Set up MediaWiki to optionally read the database configuration from etcd
 * Gradually migrate all MediaWiki instances to read the database configuration from etcd

Service Operations
Team Manager: Mark Bergsma


 * Complete the transition to PHP 7 in production
 * Move all application server & API traffic to PHP 7
 * Move maintenance scripts to PHP 7
 * Move jobrunners to PHP 7
 * [stretch] Remove HHVM from production


 * Self-service Deployment Pipeline
 * Define and document the process for service owners to deploy a new service onto the pipeline
 * Support migration of services RESTrouter, wikifeeds by service owners

Dependencies on: Release Engineering, Core Platform, Performance

Data Persistence
Team Manager: Mark Bergsma


 * Address Database infrastructure blockers on datacenter switchover
 * Order, rack and setup 10 new hosts in codfw
 * Failover all codfw masters
 * Failover eqiad masters to new hosts and decommission old masters
 * [stretch] Deploy codfw non-Mediawiki database proxies


 * Strengthen backup infrastructure and support
 * Deploy new Bacula hardware
 * Transfer ownership and knowledge of Bacula backup infrastructure
 * [stretch] Migrate general backup service from old to new host(s)

Traffic
Team Manager: Brandon Black


 * Create usable TLS ciphersuite dashboard
 * Decide on Prometheus vs Webrequest
 * Send all the right data from the cp boxes upstream
 * Make useful charts and graphs that can correlate ciphers to UA, Geo, ASN, etc.
 * Finish TLS deployment via ATS
 * Continuation of previous Q goal
 * Switch production edge TLS termination to ATS
 * [stretch] Support TLS1.3
 * ATS Backends: Test live cache_text traffic
 * Implement basic TLS termination for cache_text services (may not be final solution w/ real PKI)
 * Begin testing a small fraction of live cache_text traffic through ATS backends
 * AuthDNS: Implement smooth geoip repooling solution
 * Design new dynamic response architecture for future needs
 * MVP/Draft code for geoip smooth repooling using above
 * [stretch] release code, use in production
 * Deploy anycast recdns to all production
 * Finish evaluating current running implementation under live test
 * Implement any minor improvements we need
 * Switch most production hosts to using anycast recdns @ 10.3.0.1

Infrastructure Foundations
Team Manager: Faidon Liambotis


 * Puppet 5 (continuation & wrap-up)
 * Upgrade all production Puppetmasters to Puppet 5.5
 * Upgrade production PuppetDB to 6.2 in both data centers


 * Configuration management for network operations
 * Productionize existing configuration management software (jnt)
 * Integrate with Netbox for device selection and topology data gathering
 * Add safe push method for the configuration: interactive and sequential
 * [stretch] Evaluate Netbox to store network secrets


 * Bare metal cloud
 * Import existing management interfaces IPs into Netbox
 * Automate the assignment of new host's management interface IP
 * Automate the generation of management interface DNS records


 * Identity Management & Single Sign On
 * Build a production prototype of an Apereo CAS identity provider
 * Switch (at least) one service to authenticate against the identity provider

Observability
Team Manager: Faidon Liambotis


 * Improve our alerting capabilities
 * Produce and circulate an alerting infrastructure roadmap
 * Establish periodic alerts reviews, complete one by EOQ
 * Reduce Icinga alert noise


 * Tech debt: sunsetting of Graphite (part 1)
 * Deprecate statsd: fully migrate >= 30% of producers off statsd
 * [stretch] Deploy Thanos (long-term storage) stateless components: sidecar and query

Data Center Operations
Team Manager: Willy Pao


 * Refine procurement process
 * Improve average end-to-end turnaround time from hardware request to hardware delivery
 * Tighten up procurement cycle by implementing regularly scheduled deadlines for quotes, approvals, and purchase orders
 * Implement general template form for service owners to fill in


 * Improve turnaround times on repair/break-fix tasks
 * Implement a new hardware repair template & refine existing triaging processes
 * Enforce regular use of hardware troubleshooting runbook
 * Hire and on-board a contractor for additional support in eqiad
 * Identify 3rd party contractor to take care of straightforward tasks at remote caching sites


 * Operational excellence: resolve all inventory inconsistencies
 * Clean up existing backlog of Netbox inconsistencies and data errors
 * Keep all Netbox reports in a "passed" state
 * Maintain zero error reports going forward


 * Recycle all existing decommissioned hardware
 * Clear out existing decommissioned hardware in ulsfo and codfw
 * Determine alternative disposition company for Juniper equipment

 Status 
 * July 2019 - Mewtwo Strikes Back...
 * August 2019 -
 * September 2019 -