Wikimedia Platform Engineering/MediaWiki Core Team/Backlog

This page contains the backlog for the MediaWiki Core team. This backlog is maintained by the Product Manager for Platform (currently Dan Garry).

Items designated as high priority are either in active development, or are being considered for development during either the current or next quarter. Items are designated as medium priority if they lack the necessary details to be worked on or if they are not planned for the current or next quarter. Items are designated as low priority if they are recognised as good ideas, but are not currently being considered in the next few quarters.

For more details of the process behind this backlog, see the process subpage.

In progress

 * HHVM
 * CirrusSearch

Library infrastructure for MediaWiki
Currently, MediaWiki encourages monolithic design by virtue of making tightly-coupled code the easiest way to incorporate new functionality into core. This project would accomplish the following:
 * Incorporate the infrastructure for splitting out libraries for third party use
 * Port some widely used functionality to this infrastructure with a goal of making new libraries only need to depend on the widely used functionality (now in library form) rather than the whole of MediaWiki

Candidates for this work
 * MediaWiki components
 * CLDR parser
 * cssmin
 * HashRing
 * Aaron's UUID generator
 * Zip directory reader
 * PHP JSON parser
 * Monolog
 * Profiler
 * There is a lot of code that is reusable save for wfDebug / WfProfile calls are often the only things
 * Other components
 * Pybal

Benefits:
 * Encourages open-source contributions
 * Encourages developers to think in terms of clearly-defined interfaces

Job queue work
There have been repeated calls in the past for a "proper" job queue implementation for Wikimedia production use, where "proper" means "not relying on a shell script". Some recent discussion on the operations@ mailing list led the team to at least consider this as a possible near-term project.

The tracking bug for that comes closest to describing this project is Bug 46770 “Rewrite jobs-loop.sh in a proper programming language”. That is probably an oversimplification, though. In addition to removing shell scripting, we would at least like better monitoring of the system and more sophisticated system for assigning priorities to jobs. Aaron Schulz has already made the system much better with many incremental improvements over the past 18 months that, taken together, represent a pretty massive improvement to the features and reliability of the system. Hence, we've decided not to make further work on this a priority until an urgent need is identified.

Benefits
 * Better monitoring
 * Fix poor code quality
 * Better performance (less overhead due to less need to start processes)
 * The ability to wrap runJobs.php in a per-section PoolCounter pool. It has recently become apparent (Ie5bb11b0) that if the CPU power available to the job runners is going to be efficiently utilised, then the number of job runners running on any given DB master needs to be limited.
 * Make it straightforward to provide a configuration file. Currently Puppet configures jobs-loop.sh in two different ways simultaneously: by changing the command line parameters via JR_EXTRA_ARGS in mw-job-runner.default, and by altering the code itself by making the shell script be a template. I don't think either is an especially elegant configuration method.
 * The ability to check for the exit status from runJobs.php, somehow avoiding a tight loop of respawns.
 * An error log might be nice too.

Other work:
 * Implement immediate priority jobs (replaces DeferredUpdates basically?)

Push Phabricator to production
The Engineering Community Team is currently working toward consensus around standardizing project management, and ended up with the possibility of a much bigger project: replacing many of our tools (Gerrit, Bugzilla, Trello, Mingle, maybe even RT) with Phabricator.

File handling and thumbnailing
The Multimedia team will, at some point, need to embark on substantial work on our file upload pipeline, how we name files, and how we handle thumbnailing. Aaron Schulz in particular has deep expertise in this area. At this point, the Multimedia team hasn't requested help in this area, so it's premature for us to get involved, but this may come up again in July.
 * Next step: scoping by Aaron and/or Gilles
 * Include version in thumbnail URL (17577)

Elasticsearch

 * Elasticsearch category intersection with simple JS UI embedded on CategoryPage, presented to the user as "filtering" or "refining" a category.

Setup.php improvements

 * Setup.php speed improvements and service registry

Next step: Aaron to explain what this means :-)

Dedicated cluster for private/closed wikis

 * Dedicated cluster for private/closed wikis (e.g. officewiki)

Authn/z service to interact with other services
As we move towards independent services (SOA), we need a system for identifying users and their rights across services, which involves building infrastructure for providing proper authentication tokens for API access. Possible solutions for this include oAuth 2, Kerberos, or a custom solution built on JSON Web Tokens; the solutions for non-interactive applications are not as clear as they are for interactive web applications (where basic oAuth works pretty well).

Revision storage revamp
This would involve:
 * Specifying an API for storage of revisions; anything implementing that API could be trivially used with MediaWiki.
 * Refactor core so that revision storage is an implementation of that API.

There are two parts to revision storage: revision metadata and revision text. For revision text storage, we have External Storage, which was very innovative in its day, but is showing its age. Due to the design, reads and writes are largely focused on a single node (where the most recent revisions are), so adding nodes doesn't necessarily improve performance. The compression scripts may be fine, but we don't know for sure because we're a bit afraid of running them to find out. Thus, we're not really getting the benefit of the compression offered by the system.

A likely solution for storing revision text is Rashamon. Metadata could also be stored in Rashamon as well, but we may still need a copy of the metadata in our SQL database for fast and simple querying. Regardless of our plans for Rashamon, there is significant work involved in MediaWiki to abstract away the many built-in assumptions that our code has about retrieving revision metadata directly from a database.

This project would be done in service of a broader push toward a service-oriented architecture. We would use this project to set an example for how we foresee other aspects of MediaWiki being turned into modular services, and would likely use this as an opportunity to further establish use of the value-object pattern exemplified in TitleValue. As part of this work, we would also provide a simple SQL-based implementation for developer installations of MediaWiki. We would also like to work on the infrastructure for providing proper authentication tokens for API access (see below). Possible solutions for this include oAuth 2, Kerberos, or a custom solution built on JSON Web Tokens; the solutions for non-interactive applications are not as clear as they are for interactive web applications (where basic oAuth works pretty well).

This project also offers us an opportunity to make revision storage a nicely abstracted interface in core, and could be quite complementary to Rashamon, giving Gabriel some clean API points to plug it in. It also offers us the ability to establish the template for how MediaWiki can be incrementally refactored into cleanly-separated services.

After some investigation, we decided to pass on this project. What made us consider this project was indications that our current revision table for English Wikipedia is becoming very unwieldy, and it's really time for a sharded implementation. However, Sean Pringle made some adjustments, and we're comfortable that we have enough headway that this project isn't as urgent as we first thought.

OpenID connect
One thing we've long wanted to do as complementary work with our OAuth work is to implement some form of standardized federated login. We briefly considered work on OpenID connect, which integrates well with OAuth as a project. Doing work in this area would make Phabricator integration with our Wikimedia project logins very simple.

See also: OpenID provider (request for OpenID for use in Tool Labs)

Localisation cache write performance improvement
One thing that makes our current deployments unreliable is the fact that we have many workarounds to make it possible to avoid rewriting the l10n cache. Ideally, we would use "scap" no matter how small the change we're making to the site, but that's not practical due to l10n cache rebuild and distribution often taking over 10 minutes. Some of us have some nascent ideas about how to make this faster, but we haven't yet agreed on a solution we want to pursue.

Next step: RFC from Ori and/or Bryan Davis

Admin tools
See Admin tools development

Next step: Dan to offer specific choices for the group to consider.

git-deploy / deploy-tooling

 * will be fleshed out within the wider "Deployment status and improvements" work (on point: Greg)


 * Dependency: Localisation cache

Moving VCL logic to MediaWiki

 * Separate Cache-Control header for proxy and client (48835)

Central code repo
Gadgets, Lua, templates

LogStash

 * buuugs

User attribution

 * Infrastructure for "claim an edit" feature
 * Making user renaming simple and fast

User preferences
Possible high priority for another team.
 * Requests for comment/Redesign user preferences

Edit notices

 * 22102: Add edit notices/warnings in a consistent fashion, and put them in a sensible order

Installation consolidation
Aligning MediaWiki developer and third party installation methods. Easy install of more complicated MediaWiki installs (e.g. Parsoid, PDF rendering, Math, etc). Possible use of Vagrant-Composer?
 * https://developers.google.com/compute/docs/containers


 * Composer package types:
 * https://github.com/composer/installers
 * there exists a very nominal 'mediawiki-extension' type: https://github.com/composer/installers/blob/master/src/Composer/Installers/MediaWikiInstaller.php
 * https://bugzilla.wikimedia.org/show_bug.cgi?id=65188#c3


 * Assess viability of deploying MediaWiki using Containers to both the production cluster and various cloud platforms
 * ...as a way of improving our internal deployment process
 * ...as a way of making MediaWiki substantially easier to install by third-parties

Configuration management

 * Allowing Stewards to set certain things in the UI (e.g. per-wiki logos)
 * Cleaner command line methods to make configuration changes
 * Get rid of configuration globals
 * Requests for comment/Extension registration


 * https://gerrit.wikimedia.org/r/#/c/109850/

SOA Authentication

 * Requests for comment/SOA Authentication

SUL finalization
The goal of the SUL finalisation is to forcibly rename any clashing user accounts so that every single user has their own single unified login across all Wikimedia sites. This long-overdue project benefits users of advanced permissions in the community that have to deal with the fallout of accounts not being unified. It also benefits engineering teams at the Wikimedia Foundation as they can develop features which exploit SUL, such as Flow boards which span multiple projects.

Better Captcha infrastructure
possibly as part of thumbnailing revamp

API cleanup
Requests for comment/API roadmap

OAuth UX refinement
The OAuth user interface is in need of refinements and improvements to improve the user experience. The user experience on mobile also needs to be looked in to, as the mobile workflow was basically ignored in the initial release of OAuth.

Structured license metadata
tl;dr: it would be good if Mediawiki had a way to include structured license/source data in the wiki, because the assumption that the page has a single source is slowly-but-increasingly inaccurate, and therefore legal is concerned about continuing to assume there is just one license and one set of authors.

The problem
Wikipedia has a gradually increasing number of pages that contain content not created in that wiki page - page merges, page translations, actual third-party sources, etc. These pages cause license problems, because the edit history doesn't have a standard/structured way to say "this came from somewhere else". Current ad-hoc bandaids include licensing information on talk pages; in page histories; or sometimes even in the body of the article. Commons also has a similar problem.

This generally works OK for Wikipedia - the site complies with the license when the article is viewed through the website. However, this may cause increasing problems for display outside of the main website. For example, generated pdfs of a translated article can arguably cause a license violation, because the list of authors doesn't include the original-language article. It is possible to imagine scenarios where it creates issues in the mobile and API context as well.

Solutions?
We probably need some structured metadata for things like (1) source information (it came from where, when?) (2) license information (3) authorship information, and some UI to add those when necessary. People who are putting together the output (pdf team, mobile team, etc.) would be responsible for accurately using the metadata.