Requests for comment/Schema update for multiple content objects per revision (MCR) in XML dumps

From MediaWiki.org
Jump to navigation Jump to search
Request for comment (RFC)
Schema update for multiple content objects per revision (MCR) in XML dumps
Component General
Creation date 2018-08-08
Author(s) ArielGlenn, Daniel Kinzler (WMDE)
Document status
See Phabricator.
General2018-08-08ArielGlenn, Daniel Kinzler (WMDE)T199121

DRAFT for discussion

We need to update the XML export schema (https://www.mediawiki.org/xml/) so that it accommodates multiple content revisions.

Background[edit]

Currently, each revision is associated with one piece of content, which may reside directly in the text table or may be retrieved via an address in the text table pointing to some external storage cluster.

By October 1, 2018, Multi-Content Revisions [1] is expected to be writeable on Commons (citation needed); this means that each revision may be associated with multiple pieces of content, connected via entries in the slots table. These pieces of content may, as before, reside directly in the text table or be retrievable from some external storage cluster. In either case, a reference will now be stored in the content table.

XML dumps of page content with full revision history are made available every month [2] for various uses, including bots that fix up content, researchers that do analysis, and sites that maintain local or public mirrors of Wikimedia projects. Additionally, users may export collections of pages from Wikimedia projects as XML, using Special:Export. The schema for these dumps will need to be updated so that multiple pieces of content can be provided for a revision.

Tables introduced by MCR that will need to be added to the dumps, either directly or as part of XML formatted output: slots, content, content_models and slot_roles.

Problem[edit]

XML dumps of revision content are generated so that we re-use the previous dump content to the extent possible; this is faster than querying the database server for each content blob, and it avoids extra load on those servers. Thus, the content dumps are generated in two passes, first writing out all of the metadata for each piece of content (the so-called 'stub dumps'), and then writing out the content itself (the 'revision content dumps'). We should be sure that the new schema permits this.

The October 1 2018 deadline is not so far away. If this RFC were to be adopted and code were to be written and published by then to generate dumps containing multi-content revisions without maintaining basic backwards compatibility, there would be virtually no time for dumps users to rewrite their tools or reconfigure their workflows for processing of the new dumps.

It would be nice if the schema treated the content in all slots identically as to format, but doing this right away means that we'd break backwards compatibility; doing it later means folks would have to update their tools twice in a short (some months) period of time. Instead we have a compromise which will make everyone just a little bit unhappy.

Discussion/background reading[edit]

Proposal[edit]

The new tables to be accounted for are:

table fields
content content_id, content_size, content_sha1, content_model, content_address
slots slot_revision_id, slot_role_id, slot_content_id, slot_origin
content_models model_id, model_name
slot-roles role_id, role_name

Of these, the fields content_id, content_size, content_sha1 and content_model correspond to fields or attributes of the text in the existing dumps and their information should simply be swapped in for those; the role_name corresponding to a given slot_role_id should be published since it tracks a specific piece of content over multiple revisions; slot_origin should be published so that dumps users can easily see which pieces of content have been changed for a given revision, even for 'stubs' dumps; and the rest are either duplicate information or can be ignored.

In general, id numbers associated with role names or content model names aren't useful to dump processors; we should avoid exposing those and use the full names, expecting that they will not change over time.

Revision changes[edit]

Under the current XML schema, pages are written out with one or all of their revisions; ordering is not specified but we assume that ordering is by revision id. We won't need to alter anything else, so only the portion of the schema dealing with revisions is shown here.

Schema[edit]

Below, the current and proposed new schemas:

Current revision format:

<complexType name="RevisionType"> <sequence> <element name="id" type="positiveInteger" /> <element name="parentid" type="positiveInteger" minOccurs="0" /> <element name="timestamp" type="dateTime" /> <element name="contributor" type="mw:ContributorType" /> <element name="minor" minOccurs="0" maxOccurs="1" /> <element name="comment" type="mw:CommentType" minOccurs="0" maxOccurs="1" /> <element name="model" type="mw:ContentModelType" /> <element name="format" type="mw:ContentFormatType" /> <element name="text" type="mw:TextType" /> <element name="sha1" type="string" /> </sequence> </complexType> <complexType name="TextType"> <simpleContent> <extension base="string"> <attribute ref="xml:space" use="optional" default="preserve" /> <!-- This allows deleted=deleted on non-empty elements, but XSD is not omnipotent --> <attribute name="deleted" use="optional" type="mw:DeletedFlagType" /> <!-- This isn't a good idea; we should be using "ID" instead of "NMTOKEN" --> <!-- However, "NMTOKEN" is strictest definition that is both compatible with existing --> <!-- usage ([0-9]+) and with the "ID" type. --> <attribute name="id" type="NMTOKEN" /> <attribute name="bytes" use="optional" type="nonNegativeInteger" /> </extension> </simpleContent> </complexType>

Proposed new format:

Note that all 'use=optional' references will be removed, and a comment near the top of the schema will be added reminding folks that attributes are optional by default.

<complexType name="RevisionType"> <sequence> <element name="id" type="positiveInteger" /> <element name="parentid" type="positiveInteger" minOccurs="0" /> <element name="timestamp" type="dateTime" /> <element name="contributor" type="mw:ContributorType" /> <element name="minor" minOccurs="0" maxOccurs="1" /> <element name="comment" type="mw:CommentType" minOccurs="0" maxOccurs="1" /> <element name="model" type="mw:ContentModelType" /> <element name="format" type="mw:ContentFormatType" /> <element name="text" type="mw:TextType" minOccurs="0" maxOccurs="1"/> <element name="content" type="mw:ContentType" minOccurs="0" maxOccurs="unbounded"/> <!-- sha1 of the revision, a combined sha1 of content in all slots --> <element name="sha1" type="string" /> </sequence> </complexType> <complexType name="ContentType"> <sequence> <!-- corresponds to slot origin --> <element name="origin" type="positiveInteger" /> <!-- corresponds to slot role_name; the contents in 'main' are always exposed as a text element --> <element name="role" type="mw:SlotRoleType" /> <element name="model" type="mw:ContentModelType" /> <element name="format" type="mw:ContentFormatType" /> <element name="text" type="mw:ContentTextType" minOccurs="0" maxOccurs="1"/> </sequence> </complexType> <simpleType name="SlotRoleType"> <restriction base="string"> <pattern value="[a-zA-Z][-+./a-zA-Z0-9]*" /> </restriction> </simpleType> <complexType name="ContentTextType"> <simpleContent> <extension base="string"> <attribute ref="xml:space" default="preserve" /> <!-- This allows deleted=deleted on non-empty elements, but XSD is not omnipotent --> <attribute name="deleted" type="mw:DeletedFlagType" /> <attribute name="location" type="xsd:anyURI" /> <attribute name="bytes" type="nonNegativeInteger" /> </extension> </simpleContent> </complexType> <complexType name="TextType"> <simpleContent> <extension base="string"> <attribute ref="xml:space" default="preserve" /> <!-- This allows deleted=deleted on non-empty elements, but XSD is not omnipotent --> <attribute name="deleted" type="mw:DeletedFlagType" /> <!-- This isn't a good idea; we should be using "ID" instead of "NMTOKEN" --> <!-- However, "NMTOKEN" is strictest definition that is both compatible with existing --> <!-- usage ([0-9]+) and with the "ID" type. --> <attribute name="id" type="NMTOKEN" /> <!-- This attribute should not be used until locations with other schema than 'tt' are enabled. --> <attribute name="location" type="xsd:anyURI" /> <attribute name="sha1" type="string"> <attribute name="bytes" type="nonNegativeInteger" /> </extension> </simpleContent> </complexType>

Stubs dumps output[edit]

Below, the current, proposed transitional and proposed final schemas applied to the same revision and content, 'stubs' dump output (values for content and slot table fields are made up for this example, to show at least one extra chunk of content in the output):

Current revision format (sample):

    <revision>
      <id>308722154</id>
      <parentid>303602635</parentid>
      <timestamp>2018-06-30T11:42:24Z</timestamp>
      <contributor>
        <username>Rudolphous</username>
        <id>128310</id>
      </contributor>
      <comment>wi</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="305112983" bytes="143" />
      <sha1>a9kdtqq3buy5tribez2u0ad4b6fdxq2</sha1>
    </revision>

Proposed new format (sample), note new sha1 and location attrs in main slot text tag:

     <revision>
       <id>308722154</id>
       <parentid>303602635</parentid>
       <timestamp>2018-06-30T11:42:24Z</timestamp>
       <contributor>
         <username>Rudolphous</username>
         <id>128310</id>
       </contributor>
       <comment>wi</comment>
       <model>wikitext</model>
       <format>text/x-wiki</format>
       <text location="tt:305112983" sha1="d74e1ddb9b916bd3dd0714d5cd4521534c5f5af9" bytes="143" />
       <sha1>a9kdtqq3buy5tribez2u0ad4b6fdxq2</sha1>
       <content>
         <origin>308722098</origin>
         <role>wd_entity</role>
         <model>metadata</model>
         <format>text/json</model>
         <text location="tt:305113486" sha1="..." bytes="234" />
       </content>
       <content>
         ...
       </content>
     </revision>

Revision content dumps output[edit]

Below, the current, proposed transitional and proposed final schemas applied to the same revision and content, 'revision content dump' output (values for content and slot table fields are made up for this example):

Current revision format (sample):

    <revision>
      <id>308722154</id>
      <parentid>303602635</parentid>
      <timestamp>2018-06-30T11:42:24Z</timestamp>
      <contributor>
        <username>Rudolphous</username>
        <id>128310</id>
      </contributor>
      <comment>wi</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">{{wikidata Infobox}}
[[Category:NGC objects 3000–3999|179]]
[[Category:NGC lenticular galaxies|3179]]
[[Category:Ursa Major (constellation)]]</text>
      <sha1>a9kdtqq3buy5tribez2u0ad4b6fdxq2</sha1>
    </revision>

Proposed new format (sample), note new sha1 attr in main slot text tag:

     <revision>
       <id>308722154</id>
       <parentid>303602635</parentid>
       <timestamp>2018-06-30T11:42:24Z</timestamp>
       <contributor>
         <username>Rudolphous</username>
         <id>128310</id>
       </contributor>
       <comment>wi</comment>
       <model>wikitext</model>
       <format>text/x-wiki</format>
       <text sha1="d74e1ddb9b916bd3dd0714d5cd4521534c5f5af9" xml:space="preserve">{{wikidata Infobox}}
[[Category:NGC objects 3000–3999|179]]
[[Category:NGC lenticular galaxies|3179]]
[[Category:Ursa Major (constellation)]]</text>
       <sha1>a9kdtqq3buy5tribez2u0ad4b6fdxq2</sha1>
       <content>
         <origin>308722098</origin>
         <role>wd_entity</role>
         <model>metadata</model>
         <format>text/json</model>
         <text sha1="..." xml:space="preserve">{"QID":1103379, ...}</text>
       </content>
       <content>
         ...
       </content>
     </revision>