Requests for comment/Abstract table definitions

From mediawiki.org
Request for comment (RFC)
Abstract table definitions
Component General
Creation date
Author(s) Dantman
Document status stalled
Seek comment on DSL details. -- Tim Starling (talk) 04:54, 19 July 2013 (UTC)[reply]

MediaWiki supports multiple database engines. Our query system is basically entirely abstract and works on any database engine.

Our table creation and updating code on the other hand is a mess. We write incompatible raw sql forcing us to write the same thing multiple times in different ways for different database engines. As a result we have a bad track of it being hard to run MediaWiki on other database engines and because extensions that define their own tables also have to write custom extra code to support other databases very few extensions with custom data tables support anything besides MySQL.

To fix that this is a proposal for an abstract language for defining tables, indexes, alters, etc... that; Can be sanely written and read. Is not raw sql (we'll probably PEG parse it) and can be used to build the database for any database engine we support. And can be used by extensions too to define their database pieces abstractly so the extension will work in other databases. It also uses abstractly defined types instead of relying on mysql types so that in the future we can start using data types that actually suit the database engine best (eg: Being able to use BOOL in some database engines instead of an INT type just because we use a TINYINT in MySQL).

Examples[edit]

table externallinks(el) {
	from reference(page.page_id) default(0),
	to string,
	index string,

	@index from (from, to(40));
	@index to (to(60), from);
	@index index (index(60));
}

table hitcounter {
	@ENGINE(HEAP)
	@MAX_ROWS(25000)

	hc_id uint
}

table logging(log) {
	log_id rowid,
	log_type string(32) default(''),
	log_action string(32) default(''),
	log_timestamp timestamptz default('19700101000000'),
	log_user reference(user.user_id) default(0),
	log_user_text string(255) default(''),
	log_namespace int default(0),
	log_title string(255) default(''),
	log_page uint nullable,
	log_comment unicode(255) default(''),
	log_params string,
	log_deleted uint(tiny) default(0),

	@index "type_time" (log_type, log_timestamp);
	@index "user_time" (log_user, log_timestamp);
	@index "page_time" (log_namespace, log_title, log_timestamp);
	@index "times" (log_timestamp);
	@index "log_user_type_time" (log_user, log_type, log_timestamp);
	@index "log_page_id_time" (log_page,log_timestamp);
	@index "type_action" (log_type, log_action, log_timestamp);
}

table page(page) {
	id rowid,
	namespace int,
	title string(255),
	restrictions string(tiny),
	counter uint(big) default(0),
	is_redirect bool default(false),
	is_new bool default(false),
	random udouble,
	touched timestamptz,
	latest reference(revision.rev_id),
	len int,

	@unique "name_title" (page_namespace,page_title);
	@index "page_random" (page_random);
	@index "page_len" (page_len);
	@index "page_redirect_namespace_len" (page_is_redirect, page_namespace, page_len);
}

table revision(rev) {
	@MAX_ROWS(10000000)
	@AVG_ROW_LENGTH(1024)

	id rowid,
	page reference(page.page_id),
	text_id uint,
	comment string(tiny),
	user reference(user.user_id),
	user_text string(255),
	timestamp timestamptz,
	minor_edit bool(old),
	deleted bool(old),
	len int nullable,
	parent_id uint nullable,
	sha1 string(32),

	@unique "rev_page_id" (page, id);
	@index "rev_timestamp" (timestamp);
	@index "page_timestamp" (page, timestamp);
	@index "user_timestamp" (user, timestamp);
	@index "usertext_timestamp" (user_text, timestamp);
	@index "page_user_timestamp" (page, user, timestamp);
}

table text(old) {
	id rowid,
	text string(medium),
	flags string(tiny),
}

table pagelinks(pl) {
	from reference(page.page_id) default(0),
	namespace int,
	title string(255) default(''),

	@unique from (from, namespace, title);
	@unique namespace (namespace,title,from);
}

Syntax[edit]

# A table is defined with one of these blocks
# tablename will be prefixed and converted with tableName()
table tablename { ... }
# A table name can also be specified in quotes to stop it from being prefixed and converted
table "rawtablename" { ... }
# A column name prefix can be defined by doing this you can use things like `id` instead of `rev_id`
table table(t_) {
	# @AT rules define table options that end up at the end of the CREATE TABLE definition
	# normally without these the table options in LocalSettings.php are just used
	# Most databases besides MySQL just ignore these
	@ENGINE(MyISAM)

	# A column definition always starts with a column name followed by a type
	# rowid is a special type referring to an auto-incremented unsigned integer set as the primary key
	# rowid does not need to be followed by "primary"
	id rowid,
	# Defaults are specified by a default('')
	data string(255) default(''),
	# All columns are NOT NULL by default, you must specify nullable to remove the NOT NULL constraint
	extra string(32) nullable,
	# Simple single column indexes can be defined with "primary", "unique" and "indexed" (- maybe we'll drop unique and indexed)
	key string(20) unique,

	# Unique indexes can be defined with @unique.
	@unique key (key);
	# Normal indexes can be defined with @index.
	@index id_data (id, data(32));
	# Normally the index name is prefixed with the prefix but you can use quotes just like with table names to stop that
	@index "table_idextra" (id, extra);
}
# A named enumerated type. This is here so pgsql can use CREATE TYPE in some cases. Other databases like MySQL
# just remember the values and use an ENUM.
type typename ('...', '...');
# An index defined outside of a table definition.
# Becase there is no prefix here quotes must be used to avoid ambiguity.
# We could drop the @, do something like `index name on tablename (...);` but on second thought using the same syntax for
# indexes in and outside of table definitions with a very simple [tablename] for outside makes this nice and simple to understand.
@index[tablename] "t_id" ("t_id");

# Alter a table in many ways
alter table foo(f) {
	# Drop the id column
	-id,
	# Modify data's type
	~data bytes,
	# Add a new extra column after data
	+extra bytes after(data),
	# Rename ts to timestamp and also make it nullable
	~ts -> time timestamp nullable,

	# Drop the key index
	-@index key;
	# Make key the primary key
	+@primary (key);
	# Add an index on extra
	+@index extra (extra(80));
}

# Occasionally migrations need to update or insert a bit of data for that we have some extra syntax
# These just become simple $db->update() and $db->insert() calls.
update foo { f_timestamp = NULL } where { f_timestamp = '0' };

# A single key based insert uses the same ='s syntax as an update
insert[ignore] foo { key = 'foo', data = '...' };

# If you have a lot of data you can use this syntax
insert foo (key, data) values (
	('bar', '...'),
	('baz', '...')
);

# Rename a table
rename table foo -> bar;

#Drop a table
drop table bar;

Types[edit]

Type Description MySQL PGSQL SQLite
rowid [Special] The row id for a table, typically an auto-incremented unsigned integer set as the primary key int unsigned PRIMARY KEY AUTO_INCREMENT INTEGER PRIMARY KEY DEFAULT nextval('{tablename}_{colname}_seq')
reference(table.colname) [Special] Uses the same type as rowid and in some databases adds a reference/foreign key int unsigned INTEGER REFERENCES table(col) ???
bool(old) or bool A boolean. The "(old)" triggers our current behaviour of always using a {tiny,small}int for bools. tinyint unsigned smallint or bool
int or uint A normal integer. int is signed while uint is unsigned. int [unsigned] INTEGER
int(tiny) or uint(tiny) A small integer, no more than 0-255 for unsigned tinyint [unsigned] smallint
int(big) or uint(big) A bigint. bigint [unsigned] BIGINT
double or udouble A real/double sized floating-point number. real [unsigned] NUMERIC(15,14) ?
char(##) or unichar(##) A fixed-length string. In MySQL we use binary instead of char for char. Using unichar will force the use of char in MySQL. binary(##) or char(##) TEXT
string(##) or unicode(##) A variable length string. In MySQL we use varbinary instead of varchar for string. Using unicode will force the use of varchar in MySQL. varbinary(##) or varchar(##) TEXT
string(tiny) or unicode(tiny) A small string no larger than 256, uses blob types. tinyblob or tinytext TEXT
string or unicode A big block of textual data. blob or text TEXT
string(large) or unicode(large) mediumblob or mediumtext TEXT
string(huge) or unicode(huge) bigblob or bigtext TEXT
bytes A large block of raw binary data. (used by objectcache and image metadata) mediumblob BYTEA
timestamp A timestamp. binary(14) TIMESTAMPTZ
timestamp(hack) A hack timestamp type added for categorylinks that uses mysql's timestamp type instead of using our binary pattern. timestamp TIMESTAMPTZ
datetime A slight-hack to make MySQL use datetime for objectcache. datetime TIMESTAMPTZ
enum(...) An enumerated type. ENUM(...) TEXT
type(thename) A predefined named enumerated type. Uses ENUM in some databases like MySQL but CREATE TYPE in some databases like pgsql. ENUM(...) thename

ToDo[edit]

  • For REFERENCES we need to figure out how to declare all the different things our PostgreSQL does after it:
    • ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    • ON DELETE RESTRICT DEFERRABLE INITIALLY DEFERRED
    • ON DELETE SET NULL DEFERRABLE INITIALLY DEFERRED
    • ON DELETE CASCADE ON UPDATE CASCADE DEFERRABLE INITIALLY DEFERRED
  • The pgsql code uses a lot of strange code and procedures. We need to pick out what we want to expose abstractly and what we leave as custom pgsql code.
  • indexes
  • Alter table
  • Find a better way to handle the schema oddities we have than creating special types just for them. ie: Stick with a standard set of types but add some overrides for oddities.
  • Rethink the string/unicode types. Rather than the idea of varbinary vs. varchar these types were based on, there are actually strong differences between varbinary, varchar binary, and varchar.
  • Instead of debating between "@index ..." and something like "index name on tablename ..." perhaps new indexes can be done inside of an alter table block even though adding an index is done with a different sql command.

See also[edit]