Wikimedia Technology/Goals/2019-20 Q1

From MediaWiki.org
Jump to navigation Jump to search
TriangleArrow-Left.svgQ4 Wikimedia Technology Goals, FY2019–20, Q1 (July - September 2019) Q2TriangleArrow-Right.svg

Technology Department Team Goals and Status for Q1 FY19/20 in support of the Medium Term Plan (MTP) Priorities and Annual Plan for FY19/20


Analytics[edit]

Team Manager: Nuria Ruiz

Make easier to understand the history of all Wikimedia projects
Release Public Edit Data Lake Dataset in JSON/CSV or mysql dump format task T208612 To do To do
Make easier to understand how Commons media is used across our projects.
Work starting on mediarequests API to get statistics of view of individual Wikimedia images. task T210313 In progress In progress
Increase Data Quality
Enthrophy-based alarms for data issues task T215863 Incomplete Partially done
Increase Data Privacy and Security
Make kerberos infra prod ready. task T226089 In progress In progress will continue into Q2 as well
Modern Event Platform
* Continue moving events from job queue to event gate main. task T211248 Incomplete Partially done
* Development work for kafka connect task T223626 N Postponed to next quarter
* Schema Repository CI for convention and backwards compatibility enforcement Yes Done
Operational Excellence. Increase Resilience of Systems
* New zookeeper cluster for tier-2 task T217057 To do To do
Operational Excellence. Reduce Operational Load by Phasing Out Legacy Systems
* Sunset MySQL data store for eventlogging. task T159170 In progress In progress this quarter and next.

Status

  • July 23, 2019 - In progress In progress
    • We will be moving any work that has to do with kafka connect to next quarter due to licensing issues. Thus marking as not done for this quarter.
    • Migration of Events to EventGate main has been rolling out w/o issues.
    • Work for mediarequests API started, probably API will be in service next quarter.
    • Enthrophy-based alarms for data issues - is done for this quarter and will be picked up next quarter.
  • August 2019 - To do To do
  • September 2019 - To do To do


Core Platform[edit]

Team Manager: Corey Floyd

To do To do - Kick off Front End Working Group to explore recommendations from the Q4 research and identify a project to begin working on in Q2 (PE, Reduce Complexity of the Platform)
To do To do - Build out platform infrastructure to support partner APIs to support better access and increased load (PE, Tech and Product Partnerships)
To do To do - Develop Multi-DC storage solution(s) to hold the remaining content in the main stash in order to unblock the move to Multi-DC reads (Core)

Dependencies on: Product (front end working group and API work) SRE and Performance (Multi-DC Mainstash)

Status

  • July 25, 2019 - In progress In progress
    • Kickoff of the working group is going well, started this week with Technology and Product
    • Platform infrastructure build out - is currently In progress In progress and we are waiting on on going Parsoid work to be completed (deploying APIs)
    • Multi-DC storage solutions is In progress In progress and figuring out possible alternate solutions
  • August 2019 - To do To do
  • September 2019 - To do To do


Fundraising Tech[edit]

Team Manager: Erika Bjune

To do To do - Get India form to first 1 hour test and continue further development
To do To do - Get recurring up-sell to first 1 hour test and continue further development
To do To do - Support ongoing fundraising activities

Dependencies on: Advancement team, Dlocal, Ingenico

Status

  • July 2019 - In progress In progress on all 3 points.
  • August 2019 - To do To do
  • September 2019 - To do To do


Performance[edit]

Team Manager: Kate Chapman

Platform Evolution: Reduce complexity of the platform to make it easier for new developers to contribute.[edit]

To do To do - Improve the filtering of obsolete domains in GTIDs to avoid timeouts on GTID_WAIT. (get reviewed and merged)
To do To do - Support Parsing Team with performance insights on Parsoid-php roll out.
To do To do - Reduce reliance on master-DB writes for RL file-dependency tracking (Multi-DC prep).phab:T113916:T113916
To do To do - Audit use of CSS image-embedding (improve page-load time by reducing the size of stylesheets) T121730
To do To do - Figure out the right store to use for the main stash (dynamo? mcrouter?). T212129
To do To do - Swift cleanup + WebP ramp up. T211661

Core: Maintain libraries for which Performance is currently responsible, evaluate libraries to determine if should be owned by another team and perform handoffs to other teams when possible.[edit]

To do To do - [Ongoing] Support and maintenance of MediaWiki's object caching and data access components.
To do To do - [Ongoing] Support and maintenance of WebPageTest and synthetic testing infrastructure.
To do To do - [Ongoing] Support and maintenance of MediaWiki's ResourceLoader.
To do To do - [Ongoing] Support and maintenance of Fresnel.
To do To do - Support AbuseFilterCachingParser deployment. T156095

Core: We can quickly detect performance regressions and be able to better detect potential ones prior to deployment.[edit]

To do To do - Add Grafana dashboard for WANObjectCache stats. T197849

Core: Create a culture of performance in Wikimedia[edit]

To do To do - Write two performance topic blog posts.
To do To do - Line up interested speakers for a FOSDEM Web Performance devroom proposal.

Dependencies on: SRE, CPT, Parsing

Status

  • July 2019 - In progress In progress
  • August 2019 - To do To do
  • September 2019 - To do To do


Release Engineering[edit]

Team Manager: Greg Grossmeier

Priority: Reduce complexity of the platform to make it easier for new developers to contribute.[edit]

In progress In progress - All applicable new and existing services (and partially MediaWiki) exist in the Deployment Pipeline
  • Migrate restrouter Yes Done
  • (Stretch): MobileContentService is now In progress In progress
  • (Stretch): Preparatory MediaWiki config clean-up & static loading work To do To do
In progress In progress - Actionable code health metrics are provided for code stewards
  • Scope out requirements for a self-hosted version of SonarQube for our use N Stalled for now, pending more investigation
  • Expand set of repositories covered by code health metrics (via SonarQube) In progress In progress
In progress In progress - Provide a standardized local MediaWiki development environment
  • Migrate local-charts to deployment-charts is In progress In progress
  • Instantiate testing and linting of helm charts In progress In progress
  • Preliminary work on a CLI for setup/management In progress In progress

Dependencies on: SRE, Code Health Metrics WG

Core: Developers have a consistent and dependable deployment service.[edit]

In progress In progress - Iteratively improve our deployment tooling, service, and processes.
  • Streamline the Kibana -> Phab error reporting workflow (using client-side code, at first) In progress In progress
To do To do - Align developer services with SRE best practices.
  • Work with SRE to identify and implement needs of Phabricator and Gerrit (expected to last into Q2) To do To do

Dependencies on: SRE, Performance

Core: Maintain and improve the Continuous Integration and Testing services[edit]

To do To do - Maintain CI and testing services
  • Scope updated CI/testing KPIs To do To do
  • Set up an experimental elastic search instance to store and analyze CI logs and metrics To do To do
In progress In progress - Evaluate, select, and implement a new CI infrastructure.
  • POCs of GitLab and Zuul3 systems (as well as argo); evaluate options In progress In progress
  • Document an implementable architecture for what we want in new CI In progress In progress

Dependencies on: SRE/Others invested in CI architecture choices

Core: A clear set of unit, integration, and system testing tools is available for all supported engineering languages.[edit]

In progress In progress - Update the existing system test tooling and developer education.

Dependencies on: none.

Status

  • July 25, 2019 - In progress In progress
    • Migrate restrouter is Yes Done and is now in Services's team hands
    • Some portions of SonarQube is not open sourced, so we're looking into options
    • Streamline the Kibana -> Phab error reporting workflow - has a POC now and should be deployed soon
  • August 2019 - To do To do
  • September 2019 - To do To do


Research[edit]

Team Manager: Leila Zia

To do To do - [P-O14-D4] Run a series of interviews, office hours, or surveys to gather volunteer editor community's input on citation needed template recommendations. The result of this work will inform the specifications of an API (to be developed) to surface citation needed recommendations as well as future directions for this research. task T228442
In progress In progress - [P-O14-D4] Complete the research on characterizing Wikipedia citation usage. (Why We Leave Wikipedia). This goal will continue in Q2 and depending on the submission results potentially in Q3. task T227790
In progress In progress - [W-O6-D3] Computer vision consultation as part of Structured Data on Commons task T228440
In progress In progress - [P-O14-D6] Building a pipeline for image classification based on Commons categories. task T228441
In progress In progress - [P-O14-D4] Make substantial progress towards a comprehensive literature review about automatic detection of misinformation and disinformation on the Web. We expect this work to be completed in Q2 and inform the work in this direction in Q3+. task T229595
In progress In progress - [P-O14-D4] Understand patrolling on Wikipedia. A write-up describing how patrolling is being done on Wikipedia across the languages. This work may be extended further by understanding the patrolling on Wikipedia in the context of Wikipedia's interaction with other projects such as Wikidata, Wikimedia Commons, ... task T228817
In progress In progress - Conduct the analysis on reader surveys to understand the relation between demographics and the consumption of content on Wikipedia across languages. (Why We Read Wikipedia + Demographics). This research will be concluded in Q2 and we expect substantial progress in Q1: task T228279
In progress In progress - Hiring and onboarding. We expect 1-2 scientists to join the team in Q1 and the onboarding work will need to happen. We also expect to open a position for an engineering position in the team. task T229259
In progress In progress - [T-O12-D3] Determine important features of articles w/r/t level of reader interest across different demographic groups (as motivation for what aspects a general article category model should capture): task T228319
Incomplete Partially done - Wrap up editor gender work: task T227793

Dependencies on: Product, Community Liaisons, and Structured Data teams

Status

  • July 23, 2019 - In progress In progress notes:
    • Complete the research on characterizing Wikipedia citation usage -- bulk of the work will be done in Q1 and Q2, and submitted in Q3.
    • Computer vision consultation as part of Structured Data on Commons -- more continued work on this, deadline is end of calendar year, currently waiting on word from Product on direction.
    • Building a pipeline for image classification based on Commons categories -- this work is ongoing through this quarter and next.
    • Comprehensive literature review about automatic detection of misinformation and disinformation -- this work will go on, but is not sustainable long term without addition of headcount for the team.
    • Analysis on reader surveys to understand the relation between demographics and the consumption of content -- we hope to present this at Wikimania 2019
  • August 2019 - To do To do
  • September 2019 - To do To do


Scoring Platform[edit]

Team Manager: Aaron Halfaker

To do To do - Build out the Jade API to support user-actions
To do To do - Build/improve models in response to community demand
To do To do - Support operations infrastructure improvements (k8s, redis SPOF)

Dependencies on: SRE

Status

  • July 2019 - To do To do.
  • August 2019 - To do To do
  • September 2019 - To do To do


Search Platform[edit]

Team Manager: Guillaume Lederrey

In progress In progress - Reduce complexity of the platform: Reduce technical debt and increase automation to reduce workload and make it easier to add new search features
  • Refactor query highlighting to make it extensible by other extensions In progress In progress
  • Refactor Mjolnir jobs into separate smaller jobs In progress In progress
To do To do - Core work: Maintain CirrusSearch and the Search API and WDQS
  • Core maintenance work (always In progress In progress)
  • Improve WDQS updater performance by writing custom code for updates task T212826 In progress In progress
  • Full data reimport for WDQS to enable optimizations that were done last quarter Yes Done
  • Work through the backlog of bugs and performance improvements for WDQS with our contractor In progress In progress
  • Start the hiring process for a new WDQS Engineer In progress In progress
  • Hardware renewal: replace elastic1017-1031 task T226843 In progress In progress
To do To do - Continue to identify and enable machine learning and natural language processing techniques to improve the quality of search
  • "Did you mean" suggestions: deploy method0 to production In progress In progress
To do To do - Underserved communities benefit from search techniques that to date are only used on big wikis like machine learning–aided ranking, word embeddings or language specific analyzers: Language analysis / Phab work
To do To do - Structured Data on Commons support (as needed)
  • RDF export task T221916 In progress In progress
  • Address the indexing issues of MediaInfo (labels vs descriptions) task T226722 Yes Done

Dependencies on: RDF export: WMDE / Wikidata, Hardware renewal: DC Ops, MediaInfo indexing: SDoC

Status

  • July 30, 2019 - In progress In progress
    • Hiring process is in full swing for WDQS engineer - lots of folks applying!
    • Hardware renewal is In progress In progress and we're getting quotes
  • August 2019 - To do To do
  • September 2019 - To do To do


Security[edit]

Team Manager: John Bennett

Core

In progress In progress - Finalize and publish service catalog
In progress In progress - Draft new employee security awareness content
To do To do - Create initial set of security measurements and metrics
In progress In progress - Create initial version of PHP security toolkit
To do To do - Create design document for how DAST will work
To do To do - Create team learning circles
In progress In progress - Publication of security team roadmap
To do To do - Release of Phan 2.x
To do To do - Security release
To do To do - Bug Bounty SOP
To do To do - Deploy StopForumSpam
To do To do - Draft 3 new security policies
To do To do - Draft 3 new Security Incident Response playbooks
To do To do - Socialize Corrective Action plan for Security Incidents
To do To do - Incident response Table Top and updates to security after action reports and improvement plans
To do To do - Discovery ticket for ElastAlert detection and alerting
To do To do - Phishing Security Awareness, at least 2 completed Phishing campaigns
To do To do - Team retro, implement agile ceremonies for appsec related projects
In progress In progress - Publish data protection and retention guidelines
In progress In progress - Create privacy engineering charter
In progress In progress - Update data classification policy
In progress In progress - Publication of privacy review template

Dependencies on: New employee security awareness needs OIT onboarding and new account process integration.

Status

  • July 25 2019 - In progress In progress
    • Draft service catalog and 4-5 service descriptions being drafted and schedule for release at the end of the Q
    • New employee security awareness content will bolt on to the OIT new employee process. Content being prepared, hope to deploy this quarter.
    • Initial measurements around the number of concept reviews for both appsec and privacy engineering will be collected this quarter.
    • Ongoing work in the creation of some appsec automation via PHP security toolkit
    • Ongoing work and investigation on how a DAST solution could fit into our appsec pipeline
    • Security team roadmap is being built in Asana and will be published on office wiki this quarter.
    • Lots of work in the data protection and privacy engineering space.
  • August 2019 - To do To do
  • September 2019 - To do To do


Site Reliability Engineering[edit]

Directors: Mark Bergsma and Faidon Liambotis

Cross-cutting[edit]

To do To do Firefighting improvements, ONFIRE (continuation)
  • Produce a standardized template for a status document for ongoing major incidents
  • Iterate on a process for running the incident documentation review board; review 90% of incident documents written this quarter
  • [stretch] Research possible implementations for synchronizing team contact information to everyone's phone
In progress In progress Database automation (continuation)
  • Productionize dbctl (deploy, import data, set up alerts)
  • Set up MediaWiki to optionally read the database configuration from etcd
  • Gradually migrate all MediaWiki instances to read the database configuration from etcd

Service Operations[edit]

Team Manager: Mark Bergsma

In progress In progress Complete the transition to PHP 7 in production
  • Move all application server & API traffic to PHP 7
  • Move maintenance scripts to PHP 7
  • Move jobrunners to PHP 7 Yes Done
  • [stretch] Remove HHVM from production
In progress In progress Self-service Deployment Pipeline
  • Define and document the process for service owners to deploy a new service onto the pipeline
  • Support migration of services RESTrouter, wikifeeds by service owners

Dependencies on: Release Engineering, Core Platform, Performance

Data Persistence[edit]

Team Manager: Mark Bergsma

In progress In progress Address Database infrastructure blockers on datacenter switchover
  • Order, rack and setup 10 new hosts in codfw
  • Failover all codfw masters
  • Failover eqiad masters to new hosts and decommission old masters
  • [stretch] Deploy codfw non-Mediawiki database proxies
To do To do Strengthen backup infrastructure and support
  • Deploy new Bacula hardware
  • Transfer ownership and knowledge of Bacula backup infrastructure
  • [stretch] Migrate general backup service from old to new host(s)

Traffic[edit]

Team Manager: Brandon Black

To do To do Create usable TLS ciphersuite dashboard (continued)
  • Decide on Prometheus vs Webrequest
  • Send all the right data from the cp boxes upstream
  • Make useful charts and graphs that can correlate ciphers to UA, Geo, ASN, etc.
In progress In progress Finish TLS deployment via ATS
  • Continuation of previous Q goal
  • Switch production edge TLS termination to ATS
  • [stretch] Support TLS1.3
In progress In progress ATS Backends: Test live cache_text traffic
  • Implement basic TLS termination for cache_text services (may not be final solution w/ real PKI)
  • Begin testing a small fraction of live cache_text traffic through ATS backends
To do To do AuthDNS: Implement smooth geoip repooling solution
  • Design new dynamic response architecture for future needs
  • MVP/Draft code for geoip smooth repooling using above
  • [stretch] release code, use in production
In progress In progress Deploy anycast recdns to all production
  • Finish evaluating current running implementation under live test
  • Implement any minor improvements we need
  • Switch most production hosts to using anycast recdns @ 10.3.0.1

Infrastructure Foundations[edit]

Team Manager: Faidon Liambotis

In progress In progress Puppet 5 (continuation & wrap-up)
  • Upgrade all production Puppetmasters to Puppet 5.5
  • Upgrade production PuppetDB to 6.2 in both data centers
In progress In progress Configuration management for network operations
  • Productionize existing configuration management software (jnt)
  • Integrate with Netbox for device selection and topology data gathering
  • Add safe push method for the configuration: interactive and sequential
  • [stretch] Evaluate Netbox to store network secrets
In progress In progress Bare metal cloud
  • Import existing management interfaces IPs into Netbox
  • Automate the assignment of new host's management interface IP
  • Automate the generation of management interface DNS records
In progress In progress Identity Management & Single Sign On
  • Build a production prototype of an Apereo CAS identity provider
  • Switch (at least) one service to authenticate against the identity provider

Observability[edit]

Team Manager: Faidon Liambotis

In progress In progress Improve our alerting capabilities
  • Produce and circulate an alerting infrastructure roadmap
  • Establish periodic alerts reviews, complete one by EOQ
  • Reduce Icinga alert noise
To do To do Tech debt: sunsetting of Graphite (part 1)
  • Deprecate statsd: fully migrate >= 30% of producers off statsd
  • [stretch] Deploy Thanos (long-term storage) stateless components: sidecar and query

Data Center Operations[edit]

Team Manager: Willy Pao

In progress In progress Refine procurement process
  • Improve average end-to-end turnaround time from hardware request to hardware delivery
  • Tighten up procurement cycle by implementing regularly scheduled deadlines for quotes, approvals, and purchase orders
  • Implement general template form for service owners to fill in
In progress In progress Improve turnaround times on repair/break-fix tasks
  • Implement a new hardware repair template & refine existing triaging processes
  • Enforce regular use of hardware troubleshooting runbook
  • Hire and on-board a contractor for additional support in eqiad
  • Identify 3rd party contractor to take care of straightforward tasks at remote caching sites
To do To do Operational excellence: resolve all inventory inconsistencies
  • Clean up existing backlog of Netbox inconsistencies and data errors
  • Keep all Netbox reports in a "passed" state
  • Maintain zero error reports going forward
In progress In progress Recycle all existing decommissioned hardware
  • Clear out existing decommissioned hardware in ulsfo and codfw
  • Determine alternative disposition company for Juniper equipment

Status

  • July 23, 2019 - In progress In progress
    • Complete the transition to PHP 7 in production is partially N Blocked currently
    • Self-service Deployment Pipeline draft has been posted
    • Refine procurement process is in testing right now (2 week cycle)
    • Improve turnaround times on repair/break-fix tasks is also in progress with a new hire
    • Recycle all existing decommissioned hardware is in progress with getting quotes for work to be done
  • August 2019 - To do To do
  • September 2019 - To do To do


Technical Engagement[edit]

Team Manager: Birgit Müller

Core

In progress In progress - HA for OpenStack API endpoints (keystone, glance, nova, designate)
In progress In progress - OpenStack version upgrade(s) - tbc in Q2
To do To do - Jessie deprecation (infra + Cloud VPS) - tbc in Q2
In progress In progress - Ceph cluster POC
To do To do - Improve Cloud VPS documentation - tbc in Q2
In progress In progress - Toolforge Kubernetes redesign/upgrade
To do To do - Improve Toolforge documentation - tbc in Q2

Increased visibility & knowledge of technical contributions, services and consumers across the Wikimedia ecosystem (Reduce Complexity of the Platform, Movement Diversity)

In progress In progress - Continue Tech Talks
In progress In progress - Conduct Coolest Tool Award
In progress In progress - Publish Technical Contributors Map
To do To do - Blog posts on Small Wiki Toolkits & Coolest Tool Award
To do To do - Design & publish Tech Engagement quarterly newsletter
To do To do - Develop visualization tool for WMCS edit data - tbc in Q2
To do To do - Publish Developer Metrics

Technical communities are supported & it becomes easier to contribute (Reduce Complexity of the Platform; Movement Diversity)

In progress In progress - Develop support formats: Coordinate Small Wiki Toolkits focus area; Create toolkits & experiment, evaluate
To do To do - Technical internships and mentoring: Mentor students in GSOD, GSOC, Outreachy
Always In progress In progress - Provide continuous bug management support in Phabricator

Dependencies for core work is on: SRE/Data Center Operations team

Status

  • July 23, 2019 - In progress In progress as marked above
  • August 2019 - To do To do
  • September 2019 - To do To do