Multi-Content Revisions/Database Schema

From MediaWiki.org
Jump to navigation Jump to search

This is an overview over the database schema for representing multi-content revisions.

See the Glossary for the intended meaning of the terms used in this document.

The basic entity/relationship structure is as follows: revision <-- slot --> content --> text

For a discussion of the scalability and efficiency of this design, see Multi-Content Revisions/Content Meta-Data#Database Schema. The source code of the proposed schema is available on gerrit. For notes on the initial population of the new tables, see Multi-Content_Revisions/Schema_Migration.

Below there is a description of all the tables involved, and comments about what should be done with individual fields. The status column gives a quick indication of what changes a field may undergo, and what decisions are to be made:

Status Explanation
add needed for MCR
keep no change needed
keep* subject to change/removal by another project or idea, no change needed for MCR.
drop redundant or misleading with MCR. Should be removed, can be kept and ignored for a while, though.
drop... redundant with MCR, but keep for backwards compat for now.

slots[edit]

Field Type Ref Index Status Comment
slot_revision_id bigint unsigned rev_id PRI add slot_revision+slot_role is unique and can be used as primary.

Note: benchmark performance implications of joining a BIGINT against an INT field.

slot_role_id smallint unsigned role_id PRI add Normalized representation of the role, as a reference to the roles table. The mapping can be aggressively cached.
slot_content_id bigint(10) unsigned content_id IDX add Indexed so we can find out where which content is used.
slot_origin bigint unsigned rev_id IDX add The revision that last changed this slot's content.

Can be used to determine whether a slot's content was inherited from a previous revision by checking if slot_revision_id != slot_origin.

(Note: We originally decided against tracking the origin, and in favor of a slot_inherited flag, but went back on this decision due to the need to track the origin of content in the presence of revision deletion and undeletion)

content[edit]

Field Type Ref Index Status Comment
content_id bigint(10) unsigned PRI add Auto-increment key.
content_size int(10) unsigned add Nominal size if the content, in bogo-bytes. Used to re-calculate rev_len for new revisions.

This is not the length of the blob in bytes!

content_sha1 varbinary(32) add Needed to re-calculate rev_sha1 for new revisions.

Note: Should be nullable in case we want to switch to on-the-fly calculation to save space.

content_model smallint unsigned model_id add Normalized representation of the model, as a reference to the models table. The mapping can be aggressively cached.
content_address varbinary(255) IDX? add URL-Like content address, e.g. "tt:7681347" or "ex:DB://s15/8734/7623".

revision[edit]

Field Type Ref Status Comment
rev_id int(10) unsigned keep* Should probably become a BIGINT
rev_page int(10) unsigned page_id keep* This could be used for sharding if we had a separate contributions table. Listing user contributions the the only access pattern for this table that cannot be made to pre-select by page.
rev_text_id int(10) unsigned old_id drop Replaced by slot_rev, slot_content, and content_address
rev_comment varbinary(767) keep* Subject to the comment extraction project.
rev_user int(10) unsigned user_id keep* Subject to the actor extraction project.
rev_user_text varchar(255) keep* Subject to the actor extraction project.
rev_timestamp binary(14) keep
rev_minor_edit tinyint(3) unsigned keep
rev_deleted tinyint(3) unsigned keep
rev_len int(10) unsigned keep Calculated as the sum of the length of all slots associated with the revision. This is the nominal size of the content of the revision in bogo-bytes. Its only legitimate use is to given an approximate idea of how much information an edit added or removed.
rev_parent_id int(10) unsigned rev_id keep* BIGINT?
rev_sha1 varbinary(32) keep* Calculated from the hashes of all slots associated with the revision.

(Note: we decided to keep this for now, but may change to calculating it on the fly when generating dumps etc).

rev_content_model varbinary(32) drop... Semantically, a revision no longer has a content model, each slot may have one. We can however keep this field around and use the model of the main slot.
rev_content_format varbinary(64) drop... Semantically, a revision no longer has a content format, each slot may have one. We can just ignore the field for now. However, the benefit of storing the serialization format is dubious: it can be derived from the model, or at least be auto-detected by code associated with the model.

archive[edit]

Field Type Ref Status Comment
ar_id int(10) unsigned keep* Ideally, this would be the same as ar_rev_id. Since nothing should refer to these IDs, it should be possible to transition legacy data easily. Removing one of the two fields would be a little more tricky.

May need to become a BIGINT.

ar_namespace int(11) keep
ar_title varchar(255) keep* Subject to the title normalization project.
ar_text mediumblob keep* Apparently disused in favor of ar_text_id
ar_comment varbinary(767) keep* Subject to the comment extraction project.
ar_user int(10) unsigned user_id keep* Subject to the actor extraction project.
ar_user_text varchar(255) keep* Subject to the actor extraction project.
ar_timestamp binary(14) keep
ar_minor_edit tinyint(4) keep
ar_flags tinyblob keep* Flags for disused ar_text
ar_rev_id int(10) unsigned (rev_id) keep Does not point to an existing revision! Ideally, could be replaced by plain ar_id. But that's for another time.
ar_text_id int(10) unsigned old_id drop Replaced by slot_rev, slot_content, and content_address. Note that slot_rev refers to ar_rev_id, not to ar_id.
ar_deleted tinyint(3) unsigned keep
ar_len int(10) unsigned keep Calculated as the sum of the length of all slots associated with the revision (or just copied from rev_len)
ar_page_id int(10) unsigned (page_id) keep May not point to an existing page
ar_parent_id int(10) unsigned rev_id keep May not point to an existing revision
ar_sha1 varbinary(32) keep* see rev_sha1
ar_content_model varbinary(32) drop... Semantically, a revision no longer has a content model, each slot may have one. We can however keep this field around and use the model of the main slot.
ar_content_format varbinary(64) drop... Semantically, a revision no longer has a content format, each slot may have one. We can however just ignore this field for now. Anyway, the benefit of storing the serialization format is dubious: it can be derived from the model, or at least be auto-detected by code associated with the model.

page[edit]

Field Type Ref Status Comment
page_id int(10) unsigned keep
page_namespace int(11) keep* We may want to introduce a namespaces table, like the tables we use for normalized storage of the content model, etc.
page_title varchar(255) keep* Subject to the title normalization project.
page_restrictions tinyblob keep* Has been replaced by the page_restrictions table for a long time.
page_is_redirect tinyint(3) unsigned keep
page_is_new tinyint(3) unsigned keep
page_random double unsigned keep
page_touched binary(14) keep
page_links_updated varbinary(14) keep
page_latest int(10) unsigned rev_id keep May need to become a BIGINT
page_len int(10) unsigned keep* Redundant to the current revision's length. Whether it's useful to keep this depends on the access patterns.
page_content_model varbinary(32) drop... Semantically, a page no longer has a content model, each stream may have one. We can however keep this field around and use it as the default for the main slot, but the exact semantics of the interaction between page_content_model and rev_content_model have been causing confusion before. Also, this usually NULL.
page_lang varbinary(35) keep* This should really be defined per stream. Would need a separate table. Perhaps later. We could take that opportunity to normalize language code storage.

text[edit]

Field Type Ref Status Comment
old_id int(10) unsigned keep May need to grow into a BIGINT if we keep using it
old_text mediumblob keep Content blob, or some sort of reference to the content blob, depending on old_flags
old_flags tinyblob keep

Note that MCR can use ExternalStore directly by using the appropriate URL in content_address, bypassing the text table entirely.

Also, it was suggested that the text table could become the content table, instead of being referenced from the content table.

Name Tables[edit]

All name tables (roles, models, formats, namespaces, etc) have the same structure:

Field Type Ref Status Comment
xxx_id smalint unsigned keep
xxx_name varbinary(255) keep The canonical (human readable) name. Normalization rules and character set restrictions may vary.

The table must allow lookups in both directions. Both columns are unique. The id is auto-incrementing.

The mapping defined by the table may be cached aggressively. The mapping should update automatically when attempting to look up the id for an unknown name.