Requests for comment/Moving database abstractions out of MediaWiki core

Request for comment (RFC)
Moving database abstractions out of MediaWiki core
Component	General
Creation date	24 July 2015
Author(s)	Legoktm, Skizzerz
Document status	in draft

Background

We "support" mysql, postgres, sqlite, oracle, and mssql as database abstraction layers in MediaWiki. But in reality only mysql and sqlite are well supported. MediaWiki has no support whatsoever for third-party database drivers, meaning that if someone wants to take it on themselves to add a new abstraction layer, the code must live in core. This leads to numerous problems, however, and is furthermore inconsistent with many other parts of the MediaWiki software where it is very easily customized and extended by third-party code.

Problem

For oracle and mssql especially, there are very few developers who can support those codebases, and don't frequently have the time to do so. Skizzerz, for example, maintains the mssql implementation, but only plans on maintaining the LTS releases and doesn't have time to keep up with master's development cycle. Even for DBMSes that are free (postgres), bundling the libraries in core when they may not fully support the latest version or are partially broken gives false pretenses to users about the level of support when they see the options available in the installer. If the user needed to explicitly install the libraries themselves, they would have knowledge of how well-supported said DBMS is and can make a more educated decision on whether or not to go with it for their mediawiki installation. Additionally, our unit test suite does not touch at all on alternate DBMSes, which further exacerbates the issue of not knowing how well-supported or buggy a particular implementation is.

A simple solution is to remove these abstractions from core and be done with it, however if they are removed from core it becomes impossible to use them without hacking core code to put them back in. This RFC aims to provide the underlying infrastructure support in mediawiki to allow for database abstractions to be installed/included much like extensions and skins are. This gives us the best of both worlds, where someone downloading the core code can be assured that the functionality and features they want to use are supported by all bundled database abstractions and also have the option of installing third-party abstractions that may offer less functionality, are less supported, or were made for older versions of mediawiki with full knowledge of the caveats that result by doing so.

Proposal

Allow a new type of extension for custom database abstractions. This has two needs, both during installation as well as runtime:
- For installation, we need to automatically detect any abstractions since they are needed as part of the install process where LocalSettings doesn't exist yet. This autodetection requires that there is a standardized place to put these abstractions in order for mediawiki to pick up on them. $IP/db is proposed for this in order to model where extensions and skins are stored ($IP/extensions and $IP/skins, respectively, e.g. a first level subdirectory of the mediawiki installation).
- After installation, abstractions are loaded on an as-needed basis. This is done to reduce code attack surface as well as belay any compatibility issues that may cause fatal errors and bring the wiki down due to outdated abstractions. Multiple abstractions must be able to load in order to allow querying disparate dbms systems should the need arise via the Autoloader. In order for the Autoloader to know what exists or is valid, either autodetection can be performed or a new configuration variable can be introduced to enumerate the valid abstractions. I prefer the former, since we need autodetection for the installer anyway and the abstraction type is well-specified in $wgDBtype or $wgExternalServers.
- The autodetection would either rely on a consistent naming scheme for class names or that some form of json metadata file is present in order to present the class name to use. A strategy employing both could be used (use json if present otherwise guess based on the naming scheme), however that is likely more complex than is really needed. Simply enforcing a consistent naming scheme is the preferred solution of this proposal, e.g. DatabaseDbms (such as DatabaseMysql, DatabasePostgres, etc.). Should multiple drivers for a dbms exist (such as mysql vs mysqli), the class that the Autoloader expects can initialize and return an instance of whatever class it feels like using, this is best accomplished by requiring a static factory method on the class.
Core abstractions should be moved to $IP/db as well (out of $IP/includes/db) to provide better guidance to developers/wiki owners on how to develop or load their own abstractions as well as ensuring that the machinery used to load 3rd party abstractions actually works (e.g. dogfooding). This is also consistent with core skins living alongside 3rd party skins.
The unit test suite should be modified to allow running on an arbitrary dbms backend. This would allow a number of benefits across the pipeline for third-party database abstractions, as well as first-party ones. Third-party developers can also use the unit tests as a guideline for determining that their abstraction supports all of the things it needs to in order to work properly with core code or any given extension that defines its own tests.
- Note that the above is only for manually running the test suite, getting jenkins to automatically run the test suite for an arbitrary dbms is out of scope of this RFC.
The bane of supporting a myriad of DBMSes is instances where raw SQL strings are passed in to be directly queried. As such, this RFC aims to deprecate the ability for an extension to pass in arbitrary SQL and instead enforce usage of the abstraction wrappers. There are cases in the wrappers as well where raw SQL fragments can be passed in, and eliminating those would be part of this as well.
- Hopefully no core code would be impacted by this, but I can foresee extensions being impacted by this change, especially extensions written for corporate environments where cross-DBMS functionality would be considered a waste of time rather than a desirable feature. As a result, a compromise position would be to still allow the execution of arbitrary SQL strings, but requiring an extra step to do so in a way that makes the developer explicitly signal that they are eliminating the ability to run the code across multiple DBMSes. This will likely be accomplished by renaming the function to run an arbitrary query in a way to indicate this, such as $db->queryMysql() for a mysql-based query, or $db->queryPostgres() for a postgres-based query. Such can be automatically handled by use of __call() forwarding to a protected/private query() function and throwing an exception if the abstraction layer does not match the dbms specified. Alternatively, we could rely on abstractions to explicitly implement this function by way of convention, but I'm less a fan of that way as there is no way to enforce that in code.