Requests for comment/Schema update for multiple content objects per revision (MCR) in XML dumps

WIP WIP WIP

We need to update the XML export schema (https://www.mediawiki.org/xml/) so that it accommodates multiple content revisions.

Background
Currently, each revision is associated with one piece of content, which may reside directly in the text table or may be retrieved via an address in the text table pointing to some external storage cluster.

By October 1, 2018, Multi-Content Revisions is expected to be writeable on Commons (citation needed); this means that each revision may be associated with multiple pieces of content, connected via entries in the slots table. These pieces of content may, as before, reside directly in the text table or be retrievable from some external storage cluster. In either case, a reference will now be stored in the content table.

XML dumps of page content with full revision history are made available every month for various uses, including bots that fix up content, researchers that do analysis, and sites that maintain local or public mirrors of Wikimedia projects. The schema for these dumps will need to be updated so that multiple pieces of content can be provided for a revision.

Tables introduced by MCR that will need to be added to the dumps, either directly or as part of XML formatted output: slots, content, content_models and slot_roles.

Problem
XML dumps of revision content are generated so that we re-use the previous dump content to the extent possible; this is faster than querying the database server for each content blob, and it avoids extra load on those servers. Thus, the content dumps are generated in two passes, first writing out all of the metadata for each piece of content (the so-called 'stub dumps'), and then writing out the content itself (the 'revision content dumps'). We should be sure that the new schema permits this.

The October 1 2018 deadline is not so far away. If this RFC were to be adopted and code were to be written and published by then to generate dumps containing multi-content revisions without maintaining basic backwards compatibility, there would be virtually no time for dumps users to rewrite their tools or reconfigure their workflows for processing of the new dumps.

It would be nice if the schema treated the content in all slots identically as to format. Since bckwards compatibility is desired for the short term, we may want two schemas: first, a transitionary schema which could be processed by any tool that ignores unknown tags, still extracting content from the 'main slot', and then after a period of some months, a final schema that does not provide backwards compatibility but formats content in all slots in the same fashion.

Discussion/background reading

 * Multi-Content_Revisions/Dumps
 * User:ArielGlenn/MCR_and_dumps
 * User_talk:ArielGlenn/MCR_and_dumps

Proposal
MOSTLY PLACEHOLDERS, IT'S BEING DRAFTED RIGHT NOW

Under the current XML schema, pages are written out with one or all of their revisions; ordering is not specified but we assume that ordering is by revision id. We won't need to alter anything else, so only the portion of the schema dealing with revisions is shown here.

Schema
Below, the current, proposed transitional and proposed final schemas:

Current revision format:

Proposed transitional format:

Proposed final format:

Stubs dumps output
Below, the current, proposed transitional and proposed final schemas applied to the same revision and content, 'stubs' dump output (values for content and slot table fields are made up for this example, to show at least one extra chunk of content in the output):

Current revision format (sample):

Proposed transitional format (sample):

Proposed final format (sample):

Revision content dumps output
Below, the current, proposed transitional and proposed final schemas applied to the same revision and content, 'revision content dump' output (values for content and slot table fields are made up for this example):

Current revision format (sample):

Proposed transitional format (sample):

Proposed final format (sample):