User:Ladsgroup/Abstract Schema


 * Copied from User:Anomie/Abstract schema

This page holds some notes on the idea of abstracting MediaWiki's schema.

Problem
MediaWiki claims to support five databases: MySQL/MariaDB, SQLite, PostgreSQL ("PG"), Microsoft SQL Server ("MSSQL"), and Oracle. For normal runtime-type queries, we have abstractions that make these all mostly work pretty well.

But at the DDL level it's a completely different story. One major piece of (and source of) technical debt is the fact that MediaWiki does not have a database schema, it has four. And most schema changes have to be written five times, one for each supported database. In practice, this means schema changes for the less-supported databases are often omitted, or when not omitted are often merged without being tested.


 * MySQL/MariaDB always works, since it's the one used on Wikimedia wikis and most third-party installations.
 * SQLite is fairly well supported, since the WMF CI infrastructure uses it for some tests. And its schema changes are often extremely annoying to write, since many have to be done as "create the new version of the table, copy all data over, drop the old, and rename".
 * PostgreSQL (PG) mostly works, and is tested in WMF CI. There are a few WMF developers who can easily run tests (and will do so if asked), and it's included in Linux distributions.
 * MSSQL and Oracle probably don't work all that well. No WMF developers both use either one and care enough to have made themselves known, and the few volunteer developers who're interested can't devote a lot of time to supporting them. Licensing prevents us from adding them to CI, even if someone were willing to do the work to make it technically possible.

Requirements
We can easily improve the situation by abstracting the schema and schema change definitions, with code per database to translate that into the actual DDL statements. To do this, we need:


 * An abstract data structure for schemas.
 * An abstract data structure for schema changes.
 * APIs to deal with these data structures. Functions needed include:
 * Turn a schema structure into something resembling tables.sql, for human review.
 * Populate an empty database with tables corresponding to a schema structure (i.e. the installer).
 * Compare a schema structure with an existing database and report differences.
 * Turn a schema change structure into a set of SQL commands, for human review and WMF DBAs.
 * Alter an existing database according to the instructions in a schema change structure (i.e. update.php).
 * Given a schema structure and a schema change structure, output a new schema structure that is the result of applying those changes.
 * Implementations of those APIs for supported databases.
 * Integration with MediaWiki, including translation of the existing tables.sql and schema change files.

Abstract data structures
The sections below are intentionally specifying a data structure rather than a file format. We can use any generic file format as long as it can represent the structure correctly.

The features we need from a file format are:
 * Arrays, i.e. an ordered list of values.
 * Unordered key-value maps. Keys are always strings.
 * Ordered key-value maps (although we could work around that by making "columns" an array of unordered maps and move the column name inside the map)
 * Basic data types: strings, integers, booleans, null. Maybe floating point numbers too.

Using JSON as the format on disk would seem to be the current standard, although T212460 suggests we should use PHP array format.

For simplicity, let's define the canonical on-disk ordering of an unordered map as sorting it by key.

Schema format
The root of the data structure is an unordered map:
 * "tables": map, probably unordered:
 * Each key is the table name. Values are unordered maps ("table definition"):
 * "comment": (string, optional) A description of the purpose of the table. This may be stored into the DB if the DB supports it.
 * "columns": An ordered map (but see below):
 * Each key is the column name. Values are unordered maps ("column definition"):
 * "comment": (string, optional) A description of the purpose of the column. This may be stored into the DB if the DB supports it.
 * "type": (string) Type of the column, see /DB Requirements.
 * "nullable": (boolean, default false) Whether the column is declared NOT NULL (false) or NULL (true). Some types don't support this flag.
 * "default": (type depends on "type", optional) If present, the DEFAULT for the column. If omitted and the column is nullable, the default should be NULL (most if not all of our DBs do this automatically anyway). If omitted and the column isn't nullable, then the column has no default and must be specified on every INSERT.
 * "references": (optional) Defines a foreign key constraint for this column. See /DB Requirements for semantics. An unordered map:
 * "table": (string) The table being referenced.
 * "column": (string) The column being referenced.
 * "on update": (string, default "restrict") The action to take if the referenced column is updated. See /DB Requirements.
 * "on delete": (string, default "restrict") The action to take if the referenced row is deleted. See /DB Requirements.
 * Other types may have additional flags. See the type documentation for details.
 * "pk": (array of 1+ strings, optional) The names of the columns making up the primary key.
 * "indexes": (optional) An unordered map. See /DB Requirements for semantics.
 * Keys are the index names. Values are unordered maps ("index definition"):
 * "comment": (string, optional) A description of the purpose of the index. This may be stored into the DB if the DB supports it.
 * "columns": (array of 1+ strings or unordered maps) The columns making up the index. If a string, the string is the column name. If an unordered map, it has the following keys:
 * "column": (string) The column name.
 * "length": (int, optional) The prefix length for text-type columns. See /DB Requirements for details.
 * "direction": (string, default 'ASC') One of the constants 'ASC' or 'DESC'.
 * "unique": (boolean, default false) If true, the index also defines a unique constraint. See /DB Requirements for semantics.
 * "initRows": (optional) A valid value for the  parameter to  . When the table is first created, these rows are inserted.

Two schema data structures can be merged. At the root level, values for each key are merged independently:
 * "tables": If both schema files define the same key, an error is raised.

Implementation notes:
 * Yes, a schema is just a bunch of tables. We could add views later, if we decide to start using views in the schema. Other things seem unlikely to be sanely abstracted.
 * Someday the "pk" should be non-optional. But first we'd need to fix MediaWiki's schema.
 * The "installer" will have to make at least two passes: one to create all the tables and one to create any foreign keys after all the tables are created.
 * While the columns are defined as an ordered map, MediaWiki must not actually depend on the order since some DBs don't support sane schema changes for addition of columns in the middle of the table.
 * This proposal has yet to specify how to deal with "fulltext" indexes. See /DB Requirements.

Schema change format
The root of the data structure is an array of unordered maps defining individual changes:
 * "name": (string) A short human-readable name, as might be displayed by update.php
 * "comment": (string, optional) A longer description of the change.
 * "order": (string) A unique key for this update that also defines its order. Sorting uses PHP's  function. The format string would look like "1.31.0-wmf.12+core+123" or "1.31.0-wmf.12+extension-MyCoolExtension+123", where "1.31.0-wmf.12" is the MediaWiki version number, the next field is a constant identifying the source, and the final field is a serial number in case there are multiple schema changes within one MW version.
 * "check": (array, optional) Precondition check to run. If the check returns false, the individual schema change is skipped. The array consists of an operation followed by 0 or more parameters to the operation. Operations are:
 * "AND": Parameters are "check" arrays. They are evaluated in order. If any of those checks returns false, the "AND" immediately returns false (without evaluating the rest). Returns true if all pass.
 * "OR": Parameters are "check" arrays. They are evaluated in order. If any of those checks returns true, the "OR" immediately returns true (without evaluating the rest). Returns false if none pass.
 * "NOT": There must be exactly one parameter, which is a "check" array. Evaluates that check and returns the logical inverse.
 * Any boolean-returning function defined in the interface.
 * "before": A snapshot of the abstract table schema before.
 * "after": A snapshot of the abstract table schema after.

Two schema change arrays can be merged by concatenating them. Note "order" values must remain unique after combining.

Implementation notes:
 * Ideally we'd say to wrap each change in a transaction so a failure in the middle of running the "updates" can be cleanly rolled back, but MySQL for one can't actually do that.
 * The reason we can't use SchemaOperations is that sqlite needs to know the table structure before to make temporary tables and move data, something simple as "drop field foo from table bar" is not possible.
 * Doctrine DBAL has a simple differ to build the alter table sql.

Introspector
This interface has methods for introspecting the schema of an existing database.

I'm not going to try to anticipate all the methods it might need, but off the top of my head the possibilities include:
 * bool tableExists( string $table )
 * bool columnExists( string $table, string $column )
 * bool columnIsType( string $table, string $field, string $type )
 * bool columnIsNullable( string $table, string $field )
 * bool columnHasDefault( string $table, string $field, mixed $value )
 * bool indexExists( string $table, string $index )
 * bool indexIsUnique( string $table, string $index )
 * string[] listTables
 * string[] listColumns( string $table )
 * string[] listIndexes( string $table )
 * array getTableDefinition( string $table )
 * array getColumnDefinition( string $table, string $column )
 * array getIndexDefinition( string $table, string $index )

Implementation notes:
 * We can very likely have a base class that implements most of the boolean methods in terms of the get-definition methods.
 * Should columnExists and indexExists return false if the table doesn't exist, or throw exceptions? I lean towards the former.
 * We'll also want a version that introspects a schema structure directly with no backing database. This will be used with the similar no-DB SchemaOperations to handle the "Given a schema structure and a schema change structure, output a new schema structure that is the result of applying those changes" use case.

SchemaUpdater
This is the class that turns a schema structure or schema update structure into calls to methods on Introspector and SchemaOperations.

Implementation notes:
 * I'm hoping we can completely leave out any equivalent to the existing sourceFile and sourceStream methods.

MediaWiki integration
Implementation notes:
 * For CI we currently have an old version of tables.sql, which we load into SQLite and then apply all the schema changes to to see if it result in an equivalent database. I propose we do the same thing with the abstract schema, and possibly that we run that against the testing database as well.
 * We might even save the abstract schema at every release and check upgrades from all of them.
 * We might make a maintenance script like maintenance/generateLocalAutoload.php
 * For MySQL, we'll need to make sure that the code will work with both the old tables.sql schema and the new abstract schema, and we'll want to hold needed schema changes to make an existing database match the abstract schema to a minimum. For other databases that restriction isn't needed.
 * We'll probably want to make use of foreign keys optional, and disable that on Wikimedia wikis. At least at first.
 * We'll likely (especially for non-MySQL) need to make a special schema change that does nothing to an installation using the abstract schema but will update an old tables.sql installation to match the abstract schema.

Prior art

 * Requests for comment/Abstract table definitions - declined due to implementation concerns with a custom DSL
 * abstract-schema SVN branch - was backed out of the installer rewrite to avoid delaying that project