Wikimedia Platform Engineering/MediaWiki Core Team/Backlog

This page contains the backlog for the MediaWiki Core team.

Items designated as high priority are either in active development, or are being considered for development during either the current or next quarter. Items are designated as medium priority if they lack the necessary details to be worked on or if they are not planned for the current or next quarter. Items are designated as low priority if they are recognised as good ideas, but are not currently being considered in the next few quarters.

For more details of the process behind this backlog, see the process subpage.

In progress

 * HHVM
 * CirrusSearch
 * SUL finalization

Library infrastructure for MediaWiki
See Library infrastructure for MediaWiki

Job queue work (done)
There have been repeated calls in the past for a "proper" job queue implementation for Wikimedia production use, where "proper" means "not relying on a shell script". Some recent discussion on the operations@ mailing list led the team to at least consider this as a possible near-term project.

The tracking bug for that comes closest to describing this project is Bug 46770 “Rewrite jobs-loop.sh in a proper programming language”. That is probably an oversimplification, though. In addition to removing shell scripting, we would at least like better monitoring of the system and more sophisticated system for assigning priorities to jobs. Aaron Schulz has already made the system much better with many incremental improvements over the past 18 months that, taken together, represent a pretty massive improvement to the features and reliability of the system. Hence, we've decided not to make further work on this a priority until an urgent need is identified.

Benefits
 * Better monitoring
 * Fix poor code quality
 * Better performance (less overhead due to less need to start processes)
 * The ability to wrap runJobs.php in a per-section PoolCounter pool. It has recently become apparent (Ie5bb11b0) that if the CPU power available to the job runners is going to be efficiently utilised, then the number of job runners running on any given DB master needs to be limited.
 * Make it straightforward to provide a configuration file. Currently Puppet configures jobs-loop.sh in two different ways simultaneously: by changing the command line parameters via JR_EXTRA_ARGS in mw-job-runner.default, and by altering the code itself by making the shell script be a template. I don't think either is an especially elegant configuration method.
 * The ability to check for the exit status from runJobs.php, somehow avoiding a tight loop of respawns.
 * An error log might be nice too.

Other work:
 * Implement immediate priority jobs (replaces DeferredUpdates basically?) <-- I don't think this is done, but everything above it is.

Push Phabricator to production
The Engineering Community Team is currently working toward consensus around standardizing project management, and ended up with the possibility of a much bigger project: replacing many of our tools (Gerrit, Bugzilla, Trello, Mingle, maybe even RT) with Phabricator.

File handling and thumbnailing
The Multimedia team will, at some point, need to embark on substantial work on our file upload pipeline, how we name files, and how we handle thumbnailing. Aaron Schulz in particular has deep expertise in this area. At this point, the Multimedia team hasn't requested help in this area, so it's premature for us to get involved, but this may come up again in July.
 * Next step: scoping by Aaron and/or Gilles
 * Include version in thumbnail URL (17577)

Elasticsearch category intersection

 * Elasticsearch category intersection with simple JS UI embedded on CategoryPage, presented to the user as "filtering" or "refining" a category.
 * Needs scoping with Chad and Nik
 * Possibly medium priority

Elasticsearch search quality

 * Tuning "fat fingering" and other stuff (talk to Howie for more

Setup.php improvements

 * Setup.php speed improvements and service registry

Next step: Aaron to explain what this means :-) (Maybe: speed up MediaWiki initialization? We've talked about having a configuration backend that is a good fit for the complexity and performance requirements of WMF's deployment)
 * Design work on service registry (e.g. advancing Andrew Green's dependency injection RFC)

Dedicated cluster for private/closed wikis

 * Dedicated cluster for private/closed wikis (e.g. officewiki)
 * Need to work with Ops on priority
 * Need to establish which piece of this would be owned by MediaWiki-Core
 * Look at existing things like the fundraising cluster and zero config wiki

Authn/z service to interact with other services
As we move towards independent services (SOA), we need a system for identifying users and their rights across services, which involves building infrastructure for providing proper authentication tokens for API access. Possible solutions for this include oAuth 2, Kerberos, or a custom solution built on JSON Web Tokens; the solutions for non-interactive applications are not as clear as they are for interactive web applications (where basic oAuth works pretty well).


 * Requests for comment/SOA Authentication

Performance metrics
There is quite a lot of data emitted and logged all over our infrastructure but there is a lot of work that needs to happen in actually specifying what the numbers represent, in a way that is both mathematically credible and convincing / understandable to people. Aggregation sometimes happen at all three of: application emitting metrics, statsd, graphite (carbon), graphite (whisper). So it's very hard to say exactly what the numbers represent MediaWiki profiling data is a good example: Profiler.php aggregates, mwprof aggregates that, asher's Python script aggregates that, carbon aggregates that, and whisper aggregates that.

So we need:
 * Infrastructure that collects and summarizes data in transparent and credible ways
 * A set of small set of well-defined metrics that capture site health and that people trust and understand
 * Regular reporting

Localisation cache do-over

 * We have a distributed data store cobbled together with shell scripts, rsync, etc. It's incredibly opaque (to me, at least).
 * There are some security issues with it (I don't remember the details..)
 * It's the biggest bottleneck on the deployment process. But truly novel data (i.e., new messages/translations) represents a tiny sliver of the byte payload, which consists primarily of unmodified messages that already exist on each deployment target.

One thing that makes our current deployments unreliable is the fact that we have many workarounds to make it possible to avoid rewriting the l10n cache. Ideally, we would use "scap" no matter how small the change we're making to the site, but that's not practical due to l10n cache rebuild and distribution often taking over 10 minutes. Some of us have some nascent ideas about how to make this faster, but we haven't yet agreed on a solution we want to pursue.

Next step: RFC from Ori and/or Bryan Davis

Central code repo
Gadgets, Lua, templates

API cleanup
Requests for comment/API roadmap

A/B testing framework

 * Port PlanOut to PHP: https://facebook.github.io/planout/
 * See ACM paper: "Designing and Deploying Online Field Experiments": https://www.facebook.com/download/255785951270811/planout.pdf (PDF)
 * Port BetaFeatures to core, or provide some other mechanism for feature flags in core.

Revision storage revamp
This would involve:
 * Specifying an API for storage of revisions; anything implementing that API could be trivially used with MediaWiki.
 * Refactor core so that revision storage is an implementation of that API.

There are two parts to revision storage: revision metadata and revision text. For revision text storage, we have External Storage, which was very innovative in its day, but is showing its age. Due to the design, reads and writes are largely focused on a single node (where the most recent revisions are), so adding nodes doesn't necessarily improve performance. The compression scripts may be fine, but we don't know for sure because we're a bit afraid of running them to find out. Thus, we're not really getting the benefit of the compression offered by the system.

A likely solution for storing revision text is Rashamon. Metadata could also be stored in Rashamon as well, but we may still need a copy of the metadata in our SQL database for fast and simple querying. Regardless of our plans for Rashamon, there is significant work involved in MediaWiki to abstract away the many built-in assumptions that our code has about retrieving revision metadata directly from a database.

This project would be done in service of a broader push toward a service-oriented architecture. We would use this project to set an example for how we foresee other aspects of MediaWiki being turned into modular services, and would likely use this as an opportunity to further establish use of the value-object pattern exemplified in TitleValue. As part of this work, we would also provide a simple SQL-based implementation for developer installations of MediaWiki. We would also like to work on the infrastructure for providing proper authentication tokens for API access (see below). Possible solutions for this include oAuth 2, Kerberos, or a custom solution built on JSON Web Tokens; the solutions for non-interactive applications are not as clear as they are for interactive web applications (where basic oAuth works pretty well).

This project also offers us an opportunity to make revision storage a nicely abstracted interface in core, and could be quite complementary to Rashamon, giving Gabriel some clean API points to plug it in. It also offers us the ability to establish the template for how MediaWiki can be incrementally refactored into cleanly-separated services.

After some investigation, we decided to pass on this project. What made us consider this project was indications that our current revision table for English Wikipedia is becoming very unwieldy, and it's really time for a sharded implementation. However, Sean Pringle made some adjustments, and we're comfortable that we have enough headway that this project isn't as urgent as we first thought.


 * Should we ping Sean to check if this is still his opinion? (Yes.)

OpenID connect
One thing we've long wanted to do as complementary work with our OAuth work is to implement some form of standardized federated login. We briefly considered work on OpenID connect, which integrates well with OAuth as a project. Doing work in this area would make Phabricator integration with our Wikimedia project logins very simple.

See also: OpenID provider (request for OpenID for use in Tool Labs)

Admin tools
See Admin tools development

Next step: Dan to offer specific choices for the group to consider.

git-deploy / deploy-tooling

 * will be fleshed out within the wider "Deployment status and improvements" work (on point: Greg)
 * Dependency: Localisation cache

Moving VCL logic to MediaWiki

 * Separate Cache-Control header for proxy and client (48835)
 * IMO Ops need to reconcile themselves with the fact that there will always be compelling reasons to keep *some* application logic on the edge. We should use a Lua VMOD and expose some of Varnish's APIs to Lua and use that to replace complicated VCL with inline C.
 * You're saying they disagree with this? They have always seemed keen to move logic from app to VCL, to me. [Tim]

LogStash

 * buuugs

User attribution

 * Infrastructure for "claim an edit" feature
 * Making user renaming simple and fast

Installation consolidation
Aligning MediaWiki developer and third party installation methods. Easy install of more complicated MediaWiki installs (e.g. Parsoid, PDF rendering, Math, etc). Possible use of Vagrant-Composer?
 * https://developers.google.com/compute/docs/containers


 * Composer package types:
 * https://github.com/composer/installers
 * there exists a very nominal 'mediawiki-extension' type: https://github.com/composer/installers/blob/master/src/Composer/Installers/MediaWikiInstaller.php
 * https://bugzilla.wikimedia.org/show_bug.cgi?id=65188#c3


 * Assess viability of deploying MediaWiki using Containers to both the production cluster and various cloud platforms
 * ...as a way of improving our internal deployment process
 * ...as a way of making MediaWiki substantially easier to install by third-parties

Configuration management

 * Allowing Stewards to set certain things in the UI (e.g. per-wiki logos)
 * Cleaner command line methods to make configuration changes
 * Get rid of configuration globals
 * Requests for comment/Extension registration


 * https://gerrit.wikimedia.org/r/#/c/109850/

Better Captcha infrastructure
possibly as part of thumbnailing revamp
 * Periodically, someone needs to manually generate a new batch of unused captchas.
 * Need more from Aaron

OAuth UX refinement
The OAuth user interface is in need of refinements and improvements to improve the user experience. The user experience on mobile also needs to be looked in to, as the mobile workflow was basically ignored in the initial release of OAuth.

User preferences
Possible high priority for another team.
 * Requests for comment/Redesign user preferences

Edit notices

 * 22102: Add edit notices/warnings in a consistent fashion, and put them in a sensible order

Structured license metadata
tl;dr: it would be good if Mediawiki had a way to include structured license/source data in the wiki, because the assumption that the page has a single source is slowly-but-increasingly inaccurate, and therefore legal is concerned about continuing to assume there is just one license and one set of authors.
 * Moving to Multimedia

The problem
Wikipedia has a gradually increasing number of pages that contain content not created in that wiki page - page merges, page translations, actual third-party sources, etc. These pages cause license problems, because the edit history doesn't have a standard/structured way to say "this came from somewhere else". Current ad-hoc bandaids include licensing information on talk pages; in page histories; or sometimes even in the body of the article. Commons also has a similar problem.

This generally works OK for Wikipedia - the site complies with the license when the article is viewed through the website. However, this may cause increasing problems for display outside of the main website. For example, generated pdfs of a translated article can arguably cause a license violation, because the list of authors doesn't include the original-language article. It is possible to imagine scenarios where it creates issues in the mobile and API context as well.

Solutions?
We probably need some structured metadata for things like (1) source information (it came from where, when?) (2) license information (3) authorship information, and some UI to add those when necessary. People who are putting together the output (pdf team, mobile team, etc.) would be responsible for accurately using the metadata.