Requests for comment/Schema update for multiple content objects per revision (MCR) in XML dumps

DRAFT for discussion

We need to update the XML export schema (https://www.mediawiki.org/xml/) so that it accommodates multiple content revisions.

Background
Currently, each revision is associated with one piece of content, which may reside directly in the text table or may be retrieved via an address in the text table pointing to some external storage cluster.

By October 1, 2018, Multi-Content Revisions is expected to be writeable on Commons (citation needed); this means that each revision may be associated with multiple pieces of content, connected via entries in the slots table. These pieces of content may, as before, reside directly in the text table or be retrievable from some external storage cluster. In either case, a reference will now be stored in the content table.

XML dumps of page content with full revision history are made available every month for various uses, including bots that fix up content, researchers that do analysis, and sites that maintain local or public mirrors of Wikimedia projects. Additionally, users may export collections of pages from Wikimedia projects as XML, using Special:Export. The schema for these dumps will need to be updated so that multiple pieces of content can be provided for a revision.

Tables introduced by MCR that will need to be added to the dumps, either directly or as part of XML formatted output: slots, content, content_models and slot_roles.

Problem
XML dumps of revision content are generated so that we re-use the previous dump content to the extent possible; this is faster than querying the database server for each content blob, and it avoids extra load on those servers. Thus, the content dumps are generated in two passes, first writing out all of the metadata for each piece of content (the so-called 'stub dumps'), and then writing out the content itself (the 'revision content dumps'). We should be sure that the new schema permits this.

The October 1 2018 deadline is not so far away. If this RFC were to be adopted and code were to be written and published by then to generate dumps containing multi-content revisions without maintaining basic backwards compatibility, there would be virtually no time for dumps users to rewrite their tools or reconfigure their workflows for processing of the new dumps.

It would be nice if the schema treated the content in all slots identically as to format. Since bckwards compatibility is desired for the short term, we may want two schemas: first, a transitionary schema which could be processed by any tool that ignores unknown tags, still extracting content from the 'main slot', and then after a period of some months, a final schema that does not provide backwards compatibility but formats content in all slots in the same fashion.

Discussion/background reading

 * Multi-Content_Revisions/Dumps
 * User:ArielGlenn/MCR_and_dumps
 * User_talk:ArielGlenn/MCR_and_dumps

Proposal
The new tables to be accounted for are:

Of these, the fields content_id, content_size, content_sha1 and content_model correspond to fields or attributes of the text in the existing dumps and their information should simply be swapped in for those; slot_role_id should be published since it tracks a specific piece of content over multiple revisions; slot_origin should be published so that dumps users can easily see which pieces of content have been changed for a given revision, even for 'stubs' dumps; and the rest are either duplicate information or can be ignored.

Possible header changes
We might decide that slot role names are important to publish, in which case those might be added to the siteinfo header after the namespaces element; we might also decide that content model names should be similarly published along with their id names in the siteinfo header, and content model id numbers used in the main body of the XML dumps. We cannot just provide dumps of those tables separately, because the Special:Export facility should provide all of the information needed about the pages being exported, and this necessarily will be in a single XML file.

Changes to the header in order to include slot role name and content model name information, if we decide this is a good idea:

Current header format:

New format:

Revision changes
Under the current XML schema, pages are written out with one or all of their revisions; ordering is not specified but we assume that ordering is by revision id. We won't need to alter anything else, so only the portion of the schema dealing with revisions is shown here.

Schema
Below, the current, proposed transitional and proposed final schemas:

Current revision format:

Proposed transitional format:

Proposed final format:

Stubs dumps output
Below, the current, proposed transitional and proposed final schemas applied to the same revision and content, 'stubs' dump output (values for content and slot table fields are made up for this example, to show at least one extra chunk of content in the output):

Current revision format (sample):

Proposed transitional format (sample):

Proposed final format (sample):

Revision content dumps output
Below, the current, proposed transitional and proposed final schemas applied to the same revision and content, 'revision content dump' output (values for content and slot table fields are made up for this example):

Current revision format (sample):

Proposed transitional format (sample):

Proposed final format (sample):