User:Anomie/Abstract schema/DB Requirements

From mediawiki.org

This page lists some things MediaWiki assumes of a database, and other assumptions made by the Abstract schema proposal. See that page for background, details, and definitions.

Identifiers[edit]

This proposal suggests the following restrictions on schema identifiers (table names, column names, and index names):

  • Identifiers must match the regular expression /^[a-z][a-z0-9_]{0,62}$/.
  • Identifiers are case-insensitive. The canonical representation is lowercase.
  • Identifiers used in the schema must be at most 50 characters long. This leaves 13 characters for automatic prefixing, e.g. "unittest_" used by our PHPUnit testing.
  • Index names must be unique across all tables.
  • Index names matching /_[a-z]k[0-9]*$/ are reserved and may not be defined in the schema.
  • There are no reserved words, as all databases support quoting identifiers.

Implementation notes:

  • The minimum Oracle version would have to be raised to 12c for this to work, or else the Oracle implementation would have to remap identifiers all over the place, or else we'd have to lower the limits to 30 characters rather than 63.
  • MySQL and MSSQL have per-table indexes, while SQLite, PostgreSQL, and Oracle require the names to be unique across all tables. Thus we need to pass the table name to index-manipulation functions for the former but require all-table uniqueness for the latter.

Character sets and collations[edit]

Modern MediaWiki expects to write UTF-8 data to string and text-type fields. The database layer (in MediaWiki or in the actual database) may refuse to accept non-UTF-8 data, and probably should do so. The database itself need not store the data as UTF-8 as long as MediaWiki cannot tell the difference.

MediaWiki further expects that the database-level collation is "binary", in that it compares by Unicode code point (or it directly compares the bytes when encoded as UTF-8).

Implementation notes:

  • If we need to continue supporting MySQL without innodb_large_prefix in effect, we may need MediaWiki to have "utf8" and "binary" modes where it assumes byte-length limits for the latter and codepoint-length limits for the former.
  • In MySQL, we'd ideally want to use utf8mb4 with utf8mb4_bin as the collation, which is not the default. If we have a "binary" mode, the "binary" character set is what we'd use.
  • SQLite is going to be using either UTF-8 or UTF-16, determined at compile time. The default BINARY collation is what we want.
  • In PostgreSQL, UTF8 is the character set with "C" or "POSIX" as the collation, which is not necessarily the default. If we have a "binary" mode, the "SQL_ASCII" character set is the one to use.
  • In MSSQL we can use the nvarchar type to always use UTF-8 regardless of the database character set. I don't know what collation is the default or what a good collation is.
  • It looks like Oracle has an "nvarchar2" type as well. No idea about collation.

Data types[edit]

Basic data types are needed that can be mapped to database-specific types. Where the descriptions below mention "additional flags", that refers to the column definition map in the schema format.

General implementation notes:

  • Should we recommend optional use of CHECK constraints on databases that have them to enforce limits that aren't enforced natively? For example, byte length on string-type fields for DBs that use Unicode natively.
  • The practice of using VARBINARY for string types in MySQL means that we likely have a lot of bugs in the other DBs where MediaWiki normally stores Unicode text, and the non-MySQL DBs use a non-text field, but then someone decides to start gzipping the data because MySQL doesn't care.

string[edit]

A string column stores short Unicode text with no U+0000 characters. The underlying database field may use some other encoding as long as it can be losslessly converted.

The field must be able to distinguish between null and empty-string.

This is intended to be roughly equivalent to a VARCHAR type.

Additional flags are:

  • "length": (int, 1-255) The intended size of the field in bytes. The database may allow storing more than this many bytes.
  • "fixed": (boolean, default false) If true, the field is intended to store values of exactly "length" bytes. Creation as a CHAR type rather than VARCHAR is recommended.

Implementation notes:

  • While the description above talks in terms of bytes, if we can manage it it would probably be more sane to define it in terms of Unicode code points instead. See #Character sets and collations above.
  • Oracle can't satisfy the "must be able to distinguish between null and empty-string" requirement. A few options:
    • Drop Oracle support.
    • Have DatabaseOracle::addQuotes() pad any string matching /^ *$/ with one extra space, and ORAResult remove one space from any string matching that regex. And note that anything that hard-codes column = '' or column != '' instead of using column != {$db->addQuotes( '' )} will be broken on Oracle only (but it is already, so...).

binary[edit]

A binary column stores short binary data. It should only be used if you know the data can be something that's not a valid "string".

The field must be able to distinguish between null and empty-string.

Additional flags are:

  • "length": (int, 1-255) The intended size of the field in bytes. The database may allow storing more than this many bytes.
  • "fixed": (boolean, default false) If true, the field is intended to store values of exactly "length" bytes. Creation as a BINARY type rather than VARBINARY is recommended.

Implementation notes:

  • MySQL: VARBINARY again.
  • SQLite: BLOB is the only option.
  • PostgreSQL: BYTEA is technically correct.
  • MSSQL: VARBINARY is technically correct.
  • Oracle: BLOB or RAW, apparently. And possibly it has the same empty-vs-null bug.
  • In general, IDatabase has encodeBlob and decodeBlob methods that are supposed to be used for wrapping such values so the DB layer knows to encode them properly. Except MySQL doesn't care, so they hardly ever get used. Sigh.

text[edit]

The equivalent of "string" without the length limit.

The field must be able to distinguish between null and empty-string. The "default" flag cannot be used. The column cannot be part of a normal index when size!=tiny, but a fulltext index may be created regardless of size.

Additional flags are:

  • "size": (string, default 'medium') One of the following constants:
    • 'tiny': The field holds up to 255 bytes.
    • 'small': The field holds up to 65535 bytes.
    • 'medium': The field holds up to 16777215 bytes.
    • 'big': The field holds up to 1GB, maybe more.

Implementation notes:

  • While the description above talks in terms of bytes, if we can manage it it would probably be more sane to define it in terms of Unicode code points instead. See #Character sets and collations above.
  • MySQL: TINYBLOB, SMALLBLOB, MEDIUMBLOB, and LONGBLOB for the different sizes. Or replace BLOB with TEXT, with the same notes as VARBINARY vs VARCHAR for "string".
  • SQLite: TEXT.
  • PostgreSQL: TEXT.
  • MSSQL: NVARCHAR(max).
  • Oracle: CLOB. And possibly it has the same empty-vs-null bug.

blob[edit]

The equivalent of "binary" without the length limit.

The field must be able to distinguish between null and empty-string. The "default" flag cannot be used. The column cannot be part of a normal index when size!=tiny.

Additional flags are:

  • "size": (string, default 'medium') Same as for "text".

Implementation notes:

  • MySQL: TINYBLOB, SMALLBLOB, MEDIUMBLOB, and LONGBLOB for the different sizes.
  • SQLite: BLOB.
  • PostgreSQL: BYTEA.
  • MSSQL: VARBINARY(max).
  • Oracle: BLOB. And possibly it has the same empty-vs-null bug.
  • In general, IDatabase has encodeBlob and decodeBlob methods that are supposed to be used for wrapping such values so the DB layer knows to encode them properly. Except MySQL doesn't care, so they hardly ever get used. Sigh.

boolean[edit]

A field that holds two values representing TRUE and FALSE. Or three if you count the possibility of NULL.

Implementation notes:

  • MySQL: Has a BOOL type, but it's just an alias for TINYINT(1).
  • SQLite: INTEGER is the only real option.
  • PostgreSQL: BOOL.
  • MSSQL: BIT.
  • Oracle: CHAR(1) storing '0' or '1'.

integer[edit]

A field representing an integer value.

Additional flags:

  • "size": (string, default 'medium') One of the following constants:
    • 'tiny': The field holds values from 0–255 bytes.
    • 'small': The field holds values from -32768–32787.
    • 'medium': The field holds values from -2147483648–2147483647.
    • 'big': The field holds values from -9223372036854775808–9223372036854775807.
  • "unsigned": (boolean, default false) If true, the field will only store non-negative integers. The upper end of the range may or may not increase.
  • "serial": (boolean, default false) If true, this is the auto-incrementing primary key for the field.

A column with serial=true must be non-nullable, must not have a default (the default is the "serial" behavior), and must be the primary key. Thus, there can only be one in a table.

The auto-incrementing values for a serial column must not be reused if the row is deleted.

Implementation notes:

  • It looks like only MySQL actually supports unsigned. Can we just not support that?
  • MySQL: TINYINT UNSIGNED / SMALLINT / INT / BIGINT. "serial" is AUTO_INCREMENT.
  • SQLite: INTEGER. "serial" is AUTOINCREMENT.
  • PostgreSQL: SMALLINT / INT / BIGINT. "tinyint" will have to be SMALLINT too. "serial" could be SMALLSERIAL / SERIAL / BIGSERIAL (or in PG10, maybe, "GENERATED BY DEFAULT AS IDENTITY").
  • MSSQL: TINYINT / SMALLINT / INT / BIGINT. Yes, MSSQL's TINYINT is only unsigned while the rest are only signed. "serial" is IDENTITY.
  • Oracle: NUMBER is the only option? Although it seems to support INTEGER as an alias. "serial" is a sequence with the trigger as it's currently done.

float[edit]

A field representing a floating-point value.

Additional flags:

  • "precision": (string, default 'single') 'single' or 'double'.

Implementation notes:

  • MySQL: FLOAT or DOUBLE.
  • SQLite: REAL.
  • PostgreSQL: REAL and DOUBLE PRECISION.
  • MSSQL: REAL and DOUBLE PRECISION.
  • Oracle: BINARY_FLOAT and BINARY_DOUBLE?

decimal[edit]

A field representing a fixed-point decimal value.

Additional flags:

  • "precision": (int) Total number of digits, 1–38.
  • "scale": (int) Number of digits after the decimal. Must not be greater than "precision".

A "decimal" column with precision=5 and scale=3 would represent values like "00.000".

Implementation notes:

  • MySQL: DECIMAL(p,s) (or NUMERIC(p,s)). The DB-level limit on precision limit is 65.
  • SQLite: NUMERIC is the option. Possibly declaring it as "DECIMAL(p,s)" would be clearer.
  • PostgreSQL: NUMERIC(p,s) (or DECIMAL(p,s)). The DB-level limit on precision limit is 1000.
  • MSSQL: NUMERIC(p,s) (or DECIMAL(p,s)). The DB-level limit on precision limit is 38.
  • Oracle: NUMBER(p,s). The DB-level limit on precision limit is (effectively) 39.

timestamp[edit]

A field representing a date and time, with second precision and no associated timezone.

The implementation must be able to store date-time values between 1000-01-01T00:00:00 and 9999-12-31T23:59:59, plus a special value (distinct from NULL) representing "infinity".

Implementation notes:

  • Do we actually need the infinity value? It looks like it's only used for some expiry fields where NULL would probably work as well.
  • MySQL: Current schema uses VARBINARY(14), storing a "YYYYMMDDHHIISS" string or the string "infinity". MySQL's DATETIME type could also work, if we figure out some appropriate way to represent infinity.
  • SQLite: TEXT, storing the same values as MySQL currently does.
  • PostgreSQL: Current schema uses TIMESTAMPTZ, although a timestamp without timezone would be more appropriate. Both types specifically support "infinity".
  • MSSQL: Currently does what MySQL does. Could use a DATETIME2 type with the same caveats as MySQL.
  • Oracle: TIMESTAMP. Currently uses 2030-12-13T12:00:00 for infinity for some unknown reason.

Enumerations[edit]

A field representing a value from a list of predefined values.

Additional flags:

  • "value": (array of strings) The possible values, which should probably all be ASCII with a reasonable maximum length.

Implementation notes:

  • While they show up nicely in the database, it's a pain that adding a new possible value requires a schema change and sorting for MySQL and PG goes by integer ID rather than text value. There's also a comment in the code that "foo > 'value'" works unreliably in MySQL. We might do better to replace these with application-level enums using NameTableStore.
  • MySQL: ENUM.
  • SQLite: TEXT storing the value.
  • PostgreSQL: Current schema uses "CREATE TYPE foo AS ENUM" and TEXT storing the value in different places.
  • MSSQL: Uses various-length VARCHAR with a CHECK constraint to limit the values.
  • Oracle: Uses various-length VARCHAR2.

Foreign keys[edit]

A foreign key should probably have the following requirements:

  • The target column is the primary key. We might be able to loosen that to "is a unique index".
  • The source column is indexed too.
  • Both source and target columns are integers with the same size and unsigned flags.

Actions for "on update" and "on delete" are:

  • 'restrict': Don't allow the update/delete. At the database level this might be RESTRICT or NO ACTION, the only difference is that some DBs check RESTRICT immediately while waiting on NO ACTION until the transaction is committed.
  • 'cascade': Update the source column to match the new target column value, or delete the source column row.
  • 'set null': Set the source column to NULL.
  • 'set default': Set the source column to its default value.

We might also make it optional to actually create the foreign keys in the database; they give better integrity, but slow down inserts, updates, and deletes.

Indexes[edit]

MediaWiki's expectations of indexes are:

  • Indexes should work for most data types. It's acceptable if "text" and "blob" types require partial indexes (described below).
  • An index on a column should significantly speed up most queries that access the table with WHERE or JOIN conditions having an equality or comparison on that column and ORDER BY on the column.
  • Multi-column indexes are possible, and should significantly speed up most queries that access the table with WHERE or JOIN conditions having an equality or comparison on any prefix of the index, and ORDER BY as well.
    • For example, an index on (foo_col1, foo_col2, foo_col3) should be usable for queries that look at foo_col1 or both foo_col1 and foo_col2, and should also work for a query that looks at foo_col1 and orders by foo_col2 and foo_col3 or the like.
  • When using a well-indexed ORDER BY with a LIMIT clause, the database should not fetch all matching rows before applying the limit.
    • For example, with an index on (foo_col1, foo_col2) it should fetch only 50 rows to satisfy "SELECT * FROM foo ORDER BY foo_col1, foo_col2 LIMIT 50" or "SELECT * FROM foo WHERE foo_col1 > 'value' ORDER BY foo_col2 LIMIT 50"
  • When using a non-primary index, ORDER BY should be efficient when it appends the primary key.
    • For example, if a table has primary key (foo_id) and an index on (foo_col1, foo_col2), a query for "SELECT * FROM foo WHERE foo_col1 > 'value' ORDER BY foo_col2, foo_id LIMIT 50" should also fetch only 50 rows.

The database need not actually use the provided name for the index as long as the USE INDEX hint to IDatabase::select() and the various Introspector and SchemaOperations functions can find the index when supplied with the appropriate name.

Column-prefix indexes[edit]

"string", "binary", "text", and "blob" columns can be defined in the index such that only a prefix is indexed, using the optional "length" key in the index definition for the column. If a database does not support prefixes, it is free to index the whole column or do something else to achieve similar behavior. A functional index may only be used if "WHERE column_name = 'value'" will still properly use the index (i.e. something like "WHERE SUBSTR(column_name, 1, 20) = 'value'" is not needed).

Primary key[edit]

Every table should have a primary key. Every column in the primary key must be non-nullable. MediaWiki expects the primary key to also define a unique index on the columns.

For the purposes of the USE INDEX hint to IDatabase::select() and the various Introspector and SchemaOperations functions that take an index name, MediaWiki will supply "PRIMARY" as the name. The database should do the right thing, and are encouraged to use "{$table}_pk" as the index name if a name is needed for the primary key index.

Unique indexes[edit]

MediaWiki expects the following of unique indexes:

  • If all columns in the index are non-null, the database will raise an error if an INSERT or UPDATE would result in two rows with the same values for those columns.
    • This error can be suppressed by passing the IGNORE option to IDatabase::insert(), IDatabase::update(), and so on.
  • If any column in the index is null, no error will be raised.
  • The index should still function as an index, at least when no columns are null.

Implementation notes:

  • MSSQL will have to create a filtered index, filtering all (nullable) columns as IS NOT NULL.
  • Oracle will have to create a functional index where every column is actually CASE WHEN col1 IS NOT NULL AND col2 IS NOT NULL ... THEN colX ELSE NULL END for the unique constraint, and a second non-unique index to be used as the actual index.

Fulltext indexes[edit]

A "fulltext" index is an index for searching within large text fields, i.e. article search. Each database has its own way of declaring such indexes and its own code to search against them.

  • MySQL has "CREATE FULLTEXT INDEX" on CHAR, VARCHAR, or TEXT columns.
  • SQLite uses the FTS extension (disabled by default) for fulltext searches. It requires creating a virtual table, rather than creating an index on the table. However, given the inability to add PKs to such fulltext tables, it may not be advantageous to use them
  • PostgreSQL needs a column with type 'tsvector' and indexes with "USING gin(column)". There's a special function to turn text into tsvector.
  • MSSQL apparently has "CREATE FULLTEXT INDEX" much like MySQL. The table basically needs to have an integer primary key.
  • Oracle apparently does a CREATE INDEX with "INDEXTYPE IS ctxsys.context". Allowed types include CLOB and VARCHAR2.

I'm thinking it might work to have a special "fulltext" column type that automatically defines an index (named something like "{$table}_tk") without the schema having to say so, and then note that application code is going to have to branch on the database type to actually populate it and search against it.