Wikimedia Technology/Goals/2019-20 Q2

From MediaWiki.org
Jump to navigation Jump to search
TriangleArrow-Left.svgQ1 Wikimedia Technology Goals, FY2019–20, Q2 (October - December 2019) Q3TriangleArrow-Right.svg

Technology Department Team Goals and Status for Q2 FY19/20 in support of the Medium Term Plan (MTP) Priorities and Annual Plan for FY19/20


Analytics[edit]

Team Manager: Nuria Ruiz

Reduce platform Complexity. Modern Event Platform
Build a reliable, scalable, and comprehensive platform for creating services, tools and user facing features that produce and consume event data
Resolve Kafka Connect HDFS Licensing issue and decide if we will use Kafka Connect task T223626 N Postponed
Initial (Stream) Config Service implementation in vagrant task T233634 Yes Done
Smart Tools for Better Data. Make easier to understand the history of all Wikimedia projects
Release Mediawiki History in JSON/CSV or mysql dump format (the best dataset to date measure content and contributors) N Blocked
Deploy hadoop client to dump hosts so mediawiki history public dataset can get to dumps on a reasonable timeframe task T234229 In progress In progress
Smart Tools for Better Data. Make easier to understand how Commons media is used across our projects.
Announce the deployment of the mediarequests API: task T231589 Yes Done
Add mediarequests metrics to Wikistats UI task T234589 Yes Done
Smart Tools for Better Data. Increase Data Quality, Privacy and Security
Deploy Entropy-based alarms for data issues that could indicate, bugs, traffic drops due to censorship on inconsistencies task T215863, this work continues from Q1 In progress In progress
Productionize Kerberos Service Yes Done
Create test Kerberos identities/accounts for some selected users from Analytics Team in test cluster T212258, Yes Done
Core. Operational Excellence. Increase Resilience of Systems
New zookeeper cluster for tier-2 task T217057 Yes Done
Core. Operational Excellence. Reduce Operational Load by Phasing Out Legacy Systems/Technologies
Sunset MySQL data store for eventlogging. task T159170, this work continues from Q1 Yes Done
Migrate eventlogging to python3 task T234593 Yes Done

Dependencies on:

Status

  • October 28, 2019 status:
    • Finalize productionizing kerberos service, and then possibly enabling it Yes Done
    • Set up a generic workflow to create Kerberos accounts In progress In progress
    • Create test Kerberos identities/accounts for some selected users from Analytics Team in test cluster Yes Done
    • Deprecate eventlogging-service-eventbus Yes Done
    • Bot Detection “Remove automated traffic not identified as such from readers data” In progress In progress
  • December 12, 2019 status:
    • Yes Done
      • Make kerberos infra prod ready
      • New zookeeper cluster for tier-2
      • Sunset MySQL data store for eventlogging
      • Allow all Analytics tools to work with Kerberos auth
      • Superset upgrade
      • Release of editors per country dataset
      • Finish Swift workflow to transfer binaries from Hadoop to production
      • Enable GPU infrastructure on stats machines with purely OS components
      • Schema Repository CI for convention and backwards compatibility enforcement
      • Continue moving events from eventbus to eventgate-main
      • Start planning work on Stream Configuration Service and Product use of with Event Platform
      • Set up of Mediarequests API public endpoint. Phase 1. Infra.
      • Bot Detection Code Prototype: “Remove automated traffic not identified as such from readers data”.
      • Develop mediarequests API to get statistics of view of individual Wikimedia images
    • In progress In progress
      • Enqueue eventlogging requests for better performance
      • Release Mediawiki History in JSON/CSV or mysql dump format (the best dataset to measure content and contributors)
      • Enthrophy-based alarms for data issues
      • Presto experiments, interaction with HDFS/superset.
    • N Blocked
      • Describe statement of work (and task) for upcoming designer for wikistats


Core Platform[edit]

Team Manager: Corey Floyd

Reduce platform Complexity
Migrate Service - changeprop
Modernizing front end project planning (from Front End Working Group)
Add API Integration tests and decouple components
Initial librarization of MediaWiki
Frontend Architecture Group Planning for Desktop Improvements
Tech and Product Partnerships
Implement MediaWiki REST APIs for MVP
Integrate OAuth 2.0 into API
Prototype Documentation Portal

Dependencies on:

Status

  • October 28, 2019 status:
    • Modernizing front end project planning (from Front End Working Group) In progress In progress
    • Implement MediaWiki REST APIs for MVP In progress In progress
    • Prototype Documentation Portal In progress In progress
  • December 12, 2019 status:
    • Yes Done
      • Integrate Session Service
      • Migrate Mainstash
      • API Integration testing infrastructure
      • Kick off Front End Working Group to explore recommendations from the Q4 research and identify a project to begin working on in Q2
      • Schema Registry CI
      • Stream Config Planning and Design
      • REST API for Parsoid
    • In progress In progress
      • OAuth 2.0 Initial implementation


Fundraising Tech[edit]

Team Manager: Erika Bjune

Core Work
Support high revenue/high risk campaigns
Extra attention payed to security and privacy during highest revenue campaigns

Dependencies on:

Status

  • October 28, 2019 status:
    • all goals In progress In progress
  • December 12, 2019 status:
    • Yes Done
      • Support Advancement in testing and planned Q1 campaigns
      • Get India form to first 1 hour test and continue further development
      • Get recurring up-sell to first 1 hour test and continue further development
      • Proactively update current systems with latest security patches and respond to compliance or regulation changes.
      • Complete required security and complaince scans


Performance[edit]

Team Manager: Gilles Dubuc

Core Work
In progress In progress - Provide performance expertise to FAWG outcome
To do To do - Hold 3 or more workshops and training sessions with 1 engineering team
In progress In progress - Hire and onboard Systems Performance Engineer
To do To do - Publish 2 blog posts about performance
In progress In progress - Organise and run the Web Performance devroom at FOSDEM 2020
Reduce Complexity of the Platform
To do To do - Create performance alerts for 12 different wikis
To do To do - Create synthetic tests for backend editing with XHGui profile comparison
To do To do - Expand coverage of metrics from synthetic testing (introducing user journeys). Add 5 new user journeys and a minimum of 7 new metrics
To do To do - Add a new Graphite instance for synthetic metrics. It needs to be connected with our current Grafana instance and documented.
To do To do - Migrate ResourceLoader dependency tracking off the RDBMs

Dependencies on:

Status

  • October 28, 2019 status:
    • Hire Systems Performance Engineer and create onboarding material, ensuring that this new hire has a shared understanding of the team’s performance culture. In progress In progress
    • Organise and run the Web Performance devroom at FOSDEM 2020 In progress In progress
    • Add a new Graphite instance for synthetic metrics. It needs to be connected with our current Grafana instance and documented. In progress In progress
    • MachineVision extension performance review In progress In progress
  • December 12, 2019 status:
    • Yes Done
      • Support AbuseFilterCachingParser deployment
      • Create Grafana dashboard for WANObjectCache statistics
      • Support Parsing Team with performance insights on Parsoid-php roll out
      • Line up interested speakers for a FOSDEM Web Performance devroom proposal
      • Audit use of CSS image-embedding (improve page-load time by reducing the size of stylesheets)
    • In progress In progress
      • Improve the filtering of obsolete domains in GTIDs to avoid timeouts on GTID_WAIT. (get reviewed and merged)
      • Reduce reliance on master-DB writes for RL file-dependency tracking (Multi-DC prep)
      • Figure out the right store to use for the main stash
      • Publish 8 blog posts about performance
      • Support and maintenance of MediaWiki's object caching and data access components.
      • Support and maintenance of WebPageTest and synthetic testing infrastructure
      • Support and maintenance of MediaWiki's ResourceLoader
      • Support and maintenance of Fresnel
      • Provide performance expertise to FAWG outcome
    • N Blocked
      • Swift cleanup + WebP ramp up


Quality and Test Engineering[edit]

Team Manager: JR Branaa

Core Work
A clear set of unit, integration, and system testing tools is available for all supported engineering languages.
Update WebdriverIO from version 4 to 5 for Core.
Core Work
Actionable code health metrics are provided for code stewards
Add all applicable repos to the Code Health pipeline (Code Health Metrics).
Solicit feedback from current users of CHM POC and define phase 2 enhancements.
Improve Code Review experience
Interview engineering teams to understand their current code review practices - To do To do
Relaunch the Code Review Office Hours- In progress In progress
Put in place Code Review performance metrics- In progress In progress
Reduce complexity of the platform to make it easier for new developers to contribute
Actionable code health metrics are provided for code stewards
Make CI warn about slow tests, and publish a collated list of slow tests

Dependencies on:

Status

  • October 28, 2019 status:
    • Solicit feedback from current users of CHM POC and define phase 2 enhancements In progress In progress
    • Relaunch the Code Review Office Hours In progress In progress
    • Put in place Code Review performance metrics In progress In progress (We've defined things, but need to implement)
  • December 12, 2019 status:
    • Yes Done
    • In progress In progress
      • Team inception, formalization, and assessment of current organizational practices
    • N Blocked
      • Scope out requirements for a self-hosted version of SonarQube for our use.


Release Engineering[edit]

Team Manager: Tyler Cipriani

Reduce Complexity of Platform
Build and support a fully automated and continuous Code Health and Deployment Infrastructure
Update weekly branchcut script for MediaWiki to allow for automation
Production configuration is compiled into static files on deployment servers
Seakeeper (New CI) proposal for a dedicated CI cluster submitted for feedback
A demonstration MediaWiki development environment hosts the full TimedMediaHandler front-end and back-end workflow
Other service deployment pipeline migrations as prioritized between SRE/RelEng and relevant teams.
Core Work
Improve and maintain the Wikimedia code review system
Migrate Gerrit master from Cobalt to Gerrit1001
Migrate from Gerrit version 2.15 to 2.16
Continuation of Phabricator and Gerrit improvement (in conjunction with SRE)

Dependencies on:

Status

  • October 28, 2019 status:
    • Migrate Gerrit master from Cobalt to Gerrit1001 Yes Done (Completed on 2019-10-22; needed to be done early in the quarter to ensure we could also jump Gerrit versions this quarter)
    • Update weekly branchcut script for MediaWiki to allow for automation In progress In progress
    • Production configuration is compiled into static files on deployment servers In progress In progress
    • Seakeeper (New CI) proposal for a dedicated CI cluster submitted for feedback In progress In progress
  • December 12, 2019 status:
    • Yes Done
      • Streamline the Kibana -> Phab error reporting workflow (using client-side code, at first)
      • Work with SRE to identify and implement needs of Phabricator and Gerrit
      • Determine path forward with current CI infrastructure given Jan 1, 2020 python2 EOL
      • Document an implementable architecture for what we want in new CI
      • POCs of GitLab, Argo, and Zuul3 systems (as possible); evaluate options
      • Migrate restrouter
      • (Stretch): Preparatory MediaWiki config clean-up & static loading work
      • Preliminary work on a CLI for setup/management (local charts)
      • Instantiate testing and linting of helm charts
    • In progress In progress
      • (Stretch): MobileContentService
    • N Blocked
      • Migrate local-charts to deployment-charts
    • N Postponed
      • Scope updated CI/testing KPIs


Research[edit]

Team Manager: Leila Zia

Content Integrity
In progress In progress -  A comprehensive literature review of disinformation published in arxiv and meta (completing the work started in Q1).
To do To do -  Build a prioritized list of actions to take (tools to build, datasets to release, etc.) for combating disinformation (though discussions with the community of editors and developers, internal consultation, and maybe with external researchers)
To do To do -  Build one formal collaborations in the disinformation space to start the research for building solutions starting Q3.
Foundational
To do To do - Prepare the Research Internship proposal.
In progress In progress -  Finalize the research brief for crosslingual topical model laying out the work that will be done in this space starting Q3.
To do To do -  Literature review of reuse. task T235780
To do To do -  Review of the different types of re-use and what we know about their effect on traffic to Wikimedia. task T235781
To do To do -  Review of what data is available to us and what data is not. What questions we can currently answer. What questions we can't. task T235784
To do To do -  Initiate monthly or quarterly office hours for the community. (trial for 6 months if monthly and 12 months if quarterly)
Yes Done -  Wiki Workshop 2020 proposal submission. task T236066
To do To do -  Plan for a challenge: come up with an initial format, put a committee together, choose a venue for presentations.
Address Knowledge Gaps
To do To do -  Finalize the taxonomy of readership gaps
In progress In progress -  Make significant progress towards building the taxonomy of search (usage gaps). (We expect the research part of this work to conclude in Q3, as a stretch in Q2).
To do To do -  Literature review of identified content gaps in Wikipedia
To do To do -  Taxonomy of the causes of content gaps in Wikipedia
To do To do -  Build a series of hypotheses for the possible causes of skewed demographic representation of Wikipedia readers (specific to gender). Identify possible formal collaborations for research and testing starting Q3 if relevant based on the learnings from the list of hypotheses.
Yes Done -  Submit the citation usage paper to TheWebConf 2020. task T236067
In progress In progress -  (via mentoring an Outreachy) start work on the development of the data-set for statements in need of citation. task T233707
In progress In progress - Supervise a student evaluating methods to recommend images to Wikipedia pages. task T236142
To do To do - Train from scratch and evaluate an end-to-end (simple) classification model using Wikimedia Commons categories, optimized for GPU usage. task T221761
To do To do - Conduct a literature review, plan and set up collaborations for projects about understanding engagement with Wikimedia images around the world.
Core Work
In progress In progress -  Complete two 30-60-90 day plans.
To do To do -  Finalize a proposal for changes in Research based on learnings about Reseach's audience, what they expect from the team, our positioning within WMF, Movement, and the Research community, and the opportunities for impact.
To do To do -  Document and communicate with the team: expectations of the Research Scientist role and trajectory in the IC track.
To do To do -  Research Showcase feedback collection, assessment, and proposal for changes if relevant.
To do To do - A half-yearly newsletter for Research with the goal of making it quarterly if bandwidth allows and/or project is successful.

Dependencies on:

Status

  • October 28, 2019 status:
    • Finalize the research brief for crosslingual topical model laying out the work that will be done in this space starting Q3. In progress In progress
    • Finalize the taxonomy of readership gaps In progress In progress
    • Make significant progress towards building the taxonomy of search (usage gaps). (We expect the research part of this work to conclude in Q3, as a stretch in Q2). In progress In progress
    • Literature review of identified content gaps in Wikipedia In progress In progress
    • Build a series of hypotheses for the possible causes of skewed demographic representation of Wikipedia readers (specific to gender). Identify possible formal collaborations for research and testing starting Q3 if relevant based on the learnings from the list of hypotheses. In progress In progress
    • Submit the citation usage paper to TheWebConf 2020. Yes Done
    • (via mentoring an Outreachy) start work on the development of the data-set for statements in need of citation. In progress In progress
    • Supervise a student evaluating methods to recommend images to Wikipedia pages. In progress In progress
    • Build a prioritized list of actions to take (tools to build, datasets to release, etc.) for combating disinformation (though discussions with the community of editors and developers, internal consultation, and maybe with external researchers) In progress In progress
    • A comprehensive literature review of disinformation published in arxiv and meta (completing the work started in Q1) Incomplete Partially done
    • Prepare the Research Internship proposal. In progress In progress
    • Literature review of reuse. In progress In progress
    • Initiate monthly or quarterly office hours for the community. (trial for 6 months if monthly and 12 months if quarterly) In progress In progress
    • Wiki Workshop 2020 proposal submission. Yes Done
    • Complete two 30-60-90 day plans. In progress In progress
  • December 12, 2019 status:
    • Yes Done
      • Determine important features of articles w/r/t level of reader interest across different demographic groups (as motivation for what aspects a general article category model should capture)
      • Conduct the analysis on reader surveys to understand the relation between demographics and the consumption of content on Wikipedia across languages. (Why We Read Wikipedia + Demographics). This research will be concluded in Q2 and we expect substantial progress in Q1
      • Wrap up editor gender work
      • Complete the research on characterizing Wikipedia citation usage. (Why We Leave Wikipedia). This goal will continue in Q2 and depending on the submission results potentially in Q3.
      • Make substantial progress towards a comprehensive literature review about automatic detection of misinformation and disinformation on the Web. We expect this work to be completed in Q2 and inform the work in this direction in Q3+
      • Understand patrolling on Wikipedia. A write-up describing how patrolling is being done on Wikipedia across the languages. This work may be extended further by understanding the patrolling on Wikipedia in the context of Wikipedia's interaction with other projects such as Wikidata, Wikimedia Commons, ...
      • Run a series of interviews, office hours, or surveys to gather volunteer editor community's input on citation needed template recommendations. The result of this work will inform the specifications of an API (to be developed) to surface citation needed recommendations as well as future directions for this research.
      • A comprehensive literature review of disinformation published in arxiv and meta (completing the work started in Q1)
      • Hiring and onboarding. We expect 1-2 scientists to join the team in Q1 and the onboarding work will need to happen. We also expect to open a position for an engineering position in the team.
      • Computer vision consultation as part of Structured Data on Commons
    • N Postponed
      • Building a pipeline for image classification based on Commons categories.


Machine Learning / Scoring Platform[edit]

Team Manager: Aaron Halfaker

Core Work
Hire ML Engineer
Machine Learning Infrastructure
Jade use, maintenance, and user-research
Deployment of session-based models
Jade Entity Page UI
Newcomer quality session models
Expansion of Topic Model to ar, ko, and cswiki

Dependencies on:

Status

  • October 28, 2019 status:
    • (no updates available)
  • December 12, 2019 status:
    • Yes Done
      • Build out the Jade API to support user-actions
      • Build/improve models in response to community demand (ongoing every quarter)
    • In progress In progress
      • Hire ML Engineering Manager


Search Platform[edit]

Team Manager: Guillaume Lederrey

Address Knowledge Gaps
Any new data retention requirements are implemented
Core Work
New query parser is used in production by the end of Q2
WDQS storage expansion
CirrusSearch writes are split into per cluster kafka partitions to isolate clusters from each others by end of Q2
Get "explore similar" running again, with whatever has changed since we last looked at it
Increase understanding of our work outside our team, and outside the Foundation
Improve search quality, especially for non-English wikis by prioritizing community requests - Positive feedback from speakers/community on changes made
CirrusSearch writes can be paused during cluster operations without causing excessive stress on change propagation infrastructure by end of Q2
Rerun "explore similar" A/B test with rigorous analysis of results
Enable cross-wiki searching for 3+ new languages/projects (stretch)
Machine Learning Infrastructure
Glent method 0 (session reformulation) A/B tested and deployed by end of Q2
Learning to Rank (LTR) applied to additional languages and projects to improve ranking (needs experimentation, might not work at all)
Glent method 1 (comparison to other users' queries) offline tested, tuned, A/B tested and possibly deployed end of Q2
Structured Data
Proof of Concept SPARQL endpoint for SDoC is available on WMCS and updated weekly. (stretch)

Dependencies on:

Status

  • October 28, 2019 status:
    • WDQS storage expansion In progress In progress (Quote requested, waiting for feedback from vendor)
    • Glent method 0 (session reformulation) A/B tested and deployed by end of Q2 In progress In progress (A/B test running, still need to evaluate results and activate in production (provided the results are positive))
    • Glent method 1 (comparison to other users' queries) offline tested, tuned, A/B tested and possibly deployed end of Q2 In progress In progress (Some quality issues are identified in offline tests and need to be addressed before we can move forward. The biggest problems are that we are looking at edit distance per-string rather than per-token (probably because we thought too much about single word queries, where per-string and per-token are the same thing), and that Method 1 is too ready to add spaces or change the first letter of a word, all of which can make the ""semantic distance"" between a query and a suggestion much bigger.)
    • Proof of Concept SPARQL endpoint for SDoC is available on WMCS and updated weekly. In progress In progress (SPARQL endpoint for SDC (Commons Query Service - CQS) is blocked on having dumps from SDC that we can load on the endpoint.)
  • December 12, 2019 status:
    • Yes Done
      • Refactor query highlighting
      • RDF export
      • Address the indexing issues of MediaInfo (labels vs descriptions)
      • Full data reimport for WDQS to enable optimizations
      • Start the hiring process for a new WDQS Engineer
    • In progress In progress
      • Refactor Mjolnir jobs into separate smaller jobs
      • 2.1. Hardware renewal: replace elastic1017-1031
      • 3.1. "Did you mean" suggestions: deploy method0 to production and deployed by end of Q2
      • Improve WDQS updater performance


Security[edit]

Team Manager: John Bennett

Core Work
Security Engineering and Governance
Create initial version of PHP security toolkit
Deploy StopForumSpam
Create privacy engineering charter
Incident response Table Top and updates to security after action reports and improvement plans
Release of Phan 2.x
Update and publish data classification policy
Create initial set of security measurements and metrics
Publish data protection and retention guidelines (goal is being refined)
Bug Bounty SOP
Draft new employee security awareness content
Publication of privacy review template
Finalize and publish Security services catalog
Vulnerability Management
ERM implementation
Supplier assessments
Draft 3 new Security Incident Response playbooks Q2
Draft 3 new security policies Q2
Security release Q2
Assess, produce, and socialize Security documentation
Create or improve language-based best security practices documentation
Perform 2 phishing campaigns and provide awareness content
Assess / Refine Phab Usage and Workflows
Facilitate Agile / Scrum adoption
Develop Security PM Best Practices

Dependencies on:

Status

  • October 28, 2019 status (all are In progress In progress)
    • Create initial version of PHP security toolkit
    • Deploy StopForumSpam
    • Create privacy engineering charter
    • Incident response Table Top and updates to security after action reports and improvement plans
    • Release of Phan 2.x
    • Update and publish data classification policy
    • Create initial set of security measurements and metrics
    • Publish data protection and retention guidelines (goal is being refined)
    • Draft new employee security awareness content
    • Publication of privacy review template
    • Finalize and publish Security services catalog
    • ERM implementation
    • Draft 3 new Security Incident Response playbooks Q2
    • Draft 3 new security policies Q2
    • Security release Q2
    • Assess, produce, and socialize Security documentation
    • Create or improve language-based best security practices documentation
    • Perform 2 phishing campaigns and provide awareness content
    • Assess / Refine Phab Usage and Workflows
    • Facilitate Agile / Scrum adoption
    • Develop Security PM Best Practices
  • December 12, 2019 status:
    • Yes Done
      • Team retro, implement agile ceremonies for appsec related projects
      • Draft 3 new security policies
      • Create team learning circles
      • Socialize and Formalize Corrective Action plan for Security Incidents
      • Publication of security team roadmap
      • Phishing Security Awareness, at least 2 completed Phishing campaigns
      • Security release Q1
      • Discovery ticket for ElastAlert detection and alerting


Site Reliability Engineering[edit]

Directors: Mark Bergsma and Faidon Liambotis

Cross-cutting
Begin hiring for the SRE Engineering Manager positions and ensure at least 4 candidates are interviewed by the end of Q2, to position ourselves to fill our remaining IC positions
Deliver 80% of the asks set by the System of Performance project by EOQ

Service Operations[edit]

Team Manager: Mark Bergsma

Core Work
Finish what we started: Cleanup remnants of HHVM from our infrastructure by end of Q2
Migrate core software components of the Deployment Pipeline to current major releases

Data Persistence[edit]

Team Manager: Mark Bergsma

Core Work
Ensure general backup service is migrated to new hardware infrastructure by end of Q2 and general backup runs are monitored for basic success/failure criteria

Traffic[edit]

Team Manager: Brandon Black

Core Work

Infrastructure Foundations[edit]

Team Manager: Faidon Liambotis

Core Work
Integrate with Netbox for device selection and topology data gathering
Assist with adoption of at least 2 additional services into the Deployment Pipeline by service owners by end of Q2
Develop a new alert notification, escalation and paging capability to accommodate the increased needs of the team and department.
Enable opt-in 2FA for web services SSO
Extend security vulnerability tracking for container images
Upgrade the Elastic/Logstash version to >= 7.2
Replace/renew the internal Certificate Authority (expires Jun 2020)
Reduce the number of service clusters running a soon-to-be unsupported Debian release by 8
Reduce the number of manual steps involved in the provisioning and decommissioning of new services by 1
Drive the configuration of the networking infrastructure via automated means & ensure multiple team members are able to deploy new configuration

Observability[edit]

Team Manager: Faidon Liambotis

Core Work

Data Center Operations[edit]

Team Manager: Willy Pao

Core Work
Deliver 80% of new installs by its requested need by date.
Complete decommission of at least 50% (currently 48 tasks) of existing decommission tasks in eqiad, with servers completed unracked, to make room for new installs.
Grant root access for Papaul, to take over remote portion of decommissioning servers in eqiad.
Complete the rebuild/refresh of the esams caching facility in/near Amsterdam by end of October.
Upgrade all PDUs in eqiad to new Servertech models (15 racks total) by end of November.
Return all servers back to Cisco from previous server donations by end of Q2.
Identify at least 3 new vendors as potential options for future disposition and sale of goods/services.
Order and upgrade all PDUs in eqsin by end of quarter.
Proper proper training for dc-ops team for receiving equipment in Coupa.
Partner with Finance and determine point person for submitting orders in Coupa.
Utilize bi-weekly meetings with Finance to target and resolve all issues within Coupa that may impede our current hardware procurement process.

Dependencies on:

Status

  • October 28, 2019 status: all goals below are In progress In progress
    • Deliver 80% of new installs by its requested need by date.
    • Complete decommission of at least 50% (currently 48 tasks) of existing decommission tasks in eqiad, with servers completed unracked, to make room for new installs.
    • Grant root access for Papaul, to take over remote portion of decommissioning servers in eqiad.
    • Complete the rebuild/refresh of the esams caching facility in/near Amsterdam by end of October.
    • Upgrade all PDUs in eqiad to new Servertech models (15 racks total) by end of November.
    • Return all servers back to Cisco from previous server donations by end of Q2.
    • Identify at least 3 new vendors as potential options for future disposition and sale of goods/services.
    • Order and upgrade all PDUs in eqsin by end of quarter.
    • Utilize bi-weekly meetings with Finance to target and resolve all issues within Coupa that may impede our current hardware procurement process.
    • Ensure general backup service is migrated to new hardware infrastructure by end of Q2 and general backup runs are monitored for basic success/failure criteria
    • Finish what we started: Cleanup remnants of HHVM from our infrastructure by end of Q2
  • December 12, 2019 status:
    • Yes Done
      • Clear out existing decommissioned hardware in ulsfo and codfw
      • Implement a new hardware repair template & refine existing triaging processes
      • Implement general template form for service owners to fill in
      • Improve average end-to-end turnaround time from hardware request to hardware delivery
      • Clean up existing backlog of Netbox inconsistencies and data errors
      • Maintain zero error reports going forward (catch up and get to close to zero)
      • Determine alternative disposition company for Juniper equipment
      • Hire and on-board a contractor for additional support in eqiad
      • Keep all Netbox reports in a "passed" state
      • Identify 3rd party contractor to take care of straightforward tasks at remote caching sites
      • Tighten up procurement cycle by implementing regularly scheduled deadlines for quotes, approvals, and purchase orders
      • [stretch] Deploy codfw non-Mediawiki database proxies
      • Failover all codfw masters
      • Transfer ownership and knowledge of Bacula backup infrastructure
      • Deploy new Bacula hardware
      • Failover eqiad masters to new hosts and decommission old masters
      • Order, rack and setup 10 new hosts in codfw
      • [stretch] Migrate general backup service from old to new host(s)
      • Ensure general backup service is migrated to new hardware infrastructure by end of Q2 and general backup runs are monitored for basic success/failure criteria
      • Build a production prototype of an Apereo CAS identity provider
      • [stretch] Evaluate Netbox to store network secrets
      • Switch (at least) one service to authenticate against the identity provider
      • Import existing management interfaces IPs into Netbox
      • Move all application server & API traffic to PHP 7
      • Support migration of services RESTrouter, wikifeeds by service owners
      • Move jobrunners to PHP 7
      • Move maintenance scripts to PHP 7
      • Begin testing a small fraction of live cache_text traffic through ATS backends
      • Finish evaluating current running implementation under live test
      • Switch most production hosts to using anycast recdns @ 10.3.0.1
      • Implement any minor improvements we need (anycast, etc)
      • Decide on Prometheus vs Webrequest
    • In progress In progress
      • Gradually migrate all MediaWiki instances to read the database configuration from etcd
      • Set up MediaWiki to optionally read the database configuration from etcd
      • Productionize dbctl (deploy, import data, set up alerts)
      • Automate the assignment of new host's management interface IP
      • Establish periodic alerts reviews, complete one by EOQ
      • Produce and circulate an alerting infrastructure roadmap
      • Reduce Icinga alert noise
      • [stretch] Remove HHVM from production
      • Switch production edge TLS termination to ATS
      • Implement basic TLS termination for cache_text services (may not be final solution w/ real PKI)
      • Design new dynamic response architecture for future needs
      • Continuation of previous Q goal - Finish TLS deployment via ATS
    • N Postponed-or-N Blocked
      • Iterate on a process for running the incident documentation review board; review 90% of incident documents written this quarter
      • [stretch] Research possible implementations for synchronizing team contact information to everyone's phone
      • Produce a standardized template for a status document for ongoing major incidents
      • Automate the generation of management interface DNS records
      • Add safe push method for the configuration: interactive and sequential
      • Upgrade production PuppetDB to 6.2 in both data centers
      • Productionize existing configuration management software (jnt)
      • Upgrade all production Puppetmasters to Puppet 5.5
      • Define and document the process for service owners to deploy a new service onto the pipeline


Technical Engagement[edit]

Team Manager: Birgit Müller

Core Work
Yes Done - [IaaS] All out of warranty hardware used for offsite backups of Cloud Services data in the codfw datacenter is replaced
In progress In progress - [IaaS] 60% of the remaining Debian Jessie systems in the hardware layer underlying Cloud VPS are upgraded to Debian Buster or Stretch
In progress In progress - [IaaS] All Debian Jessie instances are removed/replaced in 95% of Cloud VPS hosted projects
Yes Done - [IaaS] Deploy a minimum viable Ceph cluster in eqiad and convert 1+ cloudvirt servers to use it for instance storage
To do To do - [IaaS] Measure IOPS as seen at the instance level, IOPS as seen at the Ceph cluster level, and network activity generated in delivering IOPS at the backbone network level to produce a forecast for impact of full conversion of cloudvirt servers to Ceph instance storage.
In progress In progress - [IaaS] Create a shared understanding of systems and service continuity and availability constraints in the current Cloud VPS product which can be used to design follow-on projects to reduce single points of failure and establish practices for testing and maintaining continuity and availability of Cloud VPS core services.
Yes Done - [IaaS] OpenStack APIs and services are upgraded to the "Ocata" release
Yes Done - [PaaS] Deploy a Kubernetes 1.15.2+ cluster in Toolforge which will be used to provide a more modern, secure, and performant PaaS baseline to Tool maintainers.
In progress In progress - [PaaS] Migrate 5+ early adopter/beta tester tools from legacy Kubernetes cluster to new Kubernetes cluster to validate integration with ingress proxy layer and sandboxing/isolation of new Kubernetes cluster deployment.
In progress In progress - [PaaS] Create timeline and operational plan for migrating all Kubernetes workloads in Toolforge to the new Kubernetes cluster and decommissioning the legacy cluster by the end of FY19/20.
Yes Done - [Docs] Create a functional template and content checklist for Help pages in the Toolforge and Cloud VPS technical content collections.
Yes Done - [Docs] Establish a technical content review process with developers on WMCS team.
Yes Done - [Docs] Noticeably improve readability for 5 instances of Toolforge and Cloud VPS "Help" documentation on Wikitech.
Reduce Complexity of the Platform, Movement Diversity
Increased visibility & knowledge of technical contributions, services and consumers across the Wikimedia ecosystem
In progress In progress - Create a blog by and for technical audiences where members of the technical community can post about their technical work
N Postponed - Publish 6 (min) technical blog posts
In progress In progress - Coordinate Tech Talks and increase views on tech talks by 10%/quarter
In progress In progress Prepare release of 2nd edition of the Tech Community Newsletter (publishing date: Jan 2020)
Yes Done - A dashboard for Wikimedia Cloud Services edit data is available to the Wikimedia movement
To do To do - Provide “showroom”, introducing newcomers to a variety of different tools to show what developers can do in Toolforge by Q3
To do To do - Find out what is needed to get data on all technical contributions/contributors
In progress In progress - Coordinate with Bitergia and get data on "Avg. Time Open (Days)" for Gerrit patchsets per affiliation and "time to first review" data for patches (by end of Q4).
To do To do - Gather and publish current numbers on technical contributions provided by Bitergia in the Quarterly Tech Community newsletter (by Jan 2020)
Reduce Complexity of the Platform, Movement Diversity
Support Wikimedia's diverse technical communities
To do To do - Develop workshop concept with partner community for technical workshops in Q3
To do To do - Conduct workshop and document the technical challenges small wikis face in North America
In progress In progress - Coordinate GCI. In Q2/Q3, in Google Code-in, > 35 mentors volunteer to provide tasks and mentor students in >70 task instances
In progress In progress - Coordinate Outreachy round 19. At least 5 featured projects are accepted for Outreachy round 19 by Oct 1st Yes Done. At least five projects are successfully completed by Outreachy interns by end of Q3.
In progress In progress - Prepare and hold session on Wikimedia's Tech internships at WikiCon North-America

Dependencies for core work is on: SRE/Data Center Operations team

Status

  • October 28, 2019 status:
    • ll Debian Jessie instances are removed/replaced in 95% of Cloud VPS hosted projects (Annual unused project/instance purge) In progress In progress
    • 60% of the remaining Debian Jessie systems in the hardware layer underlying Cloud VPS are upgraded to Debian Buster or Stretch (Cloud VPS Domain name(s) migration) In progress In progress
    • Create a shared understanding of systems and service continuity and availability constraints in the current Cloud VPS product which can be used to design follow-on projects to reduce single points of failure and establish practices for testing and maintaining continuity and availability of Cloud VPS core services. In progress In progress
    • Deploy a Kubernetes 1.15.2+ cluster in Toolforge which will be used to provide a more modern, secure, and performant PaaS baseline to Tool maintainers. In progress In progress
    • Technical internships + mentoring - Q2 In progress In progress
    • Coordinate new rounds, GCI In progress In progress
    • Create a blog by and for technical audiences where members of the technical community can post about their technical work. In progress In progress
    • Publish 6 (min) technical blog posts In progress In progress
    • Coordinate Tech Talks and increase views on tech talks by 10%/quarter In progress In progress
    • A dashboard for Wikimedia Cloud Services edit data is available to the Wikimedia movement In progress In progress
    • Coordinate with Bitergia and get data on "Avg. Time Open (Days)" for Gerrit patchsets per affiliation and "time to first review" data for patches (by end of Q4). In progress In progress
    • Coordinate GCI. In Q2/Q3, in Google Code-in, > 35 mentors volunteer to provide tasks and mentor students in >70 task instances In progress In progress
    • Coordinate Outreachy round 19. At least 5 featured projects are accepted for Outreachy round 19 by Oct 1st Yes Done. At least five projects are successfully completed by Outreachy interns by end of Q3. Yes Done
    • At least five projects are successfully completed by Outreachy interns by end of Q3. In progress In progress
    • Prepare and hold session on Wikimedia's Tech internships at WikiCon North-America In progress In progress
  • December 12, 2019 status:
    • Yes Done
      • Develop Technical Engagement narrative and shared understanding in the team
      • Technical internships and mentoring: Mentor 3 students in GSOD, GSOC, Outreachy
      • Blog posts on Small Wiki Toolkits & Coolest Tool Award
      • Conduct Coolest Tool Award 2019
      • Design & publish Tech Engagement quarterly newsletter Ed1
      • Continue Tech Talks
      • Develop support format: Coordinate Small Wiki Toolkits focus area, create toolkits & experiment, evaluate, iterate, document
      • Provide continuous bug management support in Phabricator (ongoing)
      • Publish Technical Contributors Map
      • Develop visualization tool for WMCS edit data/integrate WMCS edit data in existing tools
      • Advocate for better processes to support developer productivity (ongoing)
      • HA for OpenStack API endpoints (keystone, glance, nova, designate)
      • Improve Toolforge documentation (ongoing every quarter)
      • Jessie deprecation (infra + Cloud VPS)
      • OpenStack version upgrade(s)
      • Toolforge Kubernetes redesign/upgrade
      • Improve Cloud VPS documentation (ongoing)
    • In progress In progress
      • Hire Developer Advocate