Talk:Requests for comment/Schema update for multiple content objects per revision (MCR) in XML dumps

About this board

Notes on reviewing the new schema

2
EpochFail (talkcontribs)

I'm looking at https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/docs/export-0.11.xsd#241

A few notes:

  • It seems like the "sha1" field is missing from the new ContentTextType that appears inside of ContentType. It seems to show up in the examples on this page.
  • It appears as though every <content> tag (ContentTextType) has a "deleted" field. Is it possible to deleted individual slots? Is the deleted status just duplicated across all of the slots?
  • Can we add DELECTED_RESTRICTED to the schema while we're refactoring?

A general note:

It's probably too late for this, but I would find the following change way more intuitive:

original current proposed
<revision>
  <id>123</id>
  <text>Some text</text>
  <sha1>abc123</sha1>
</revision>
<revision>
  <id>123</id>
  <text sha1="abc123">Some text</text>
  <sha1>ebf234</sha1>
  <content>
    <role>wd_entity</role>
    <format>text/json</format>
    <text sha1="cc23de">{"QID":1103379, ...}</text>
  </content>
</revision>
<revision>
  <id>123</id>
  <text>Some text</text>
  <sha1>abc123</sha1>
  <slots sha1="ebf234">
    <content role="wd_entity" sha1="cc23de" format="text/json">{"QID":1103379, ...}</content>
  </slots>
</revision>

This strategy preserves backwards compatibility and puts the slots-level sha1 in a more obvious location.

EpochFail (talkcontribs)

One other note. It appears there's a new "id" attribute for the old <text> tag but it doesn't appear in the new <content> tag. Should the <content> tag have a sub-tag for "id"? Or maybe the new <text> tag should have an attribute? I'm unclear why "format" is a new tag but "sha1" remains an attribute of the new <text> tag.

<!-- This isn't a good idea; we should be using "ID" instead of "NMTOKEN" -->
<!-- However, "NMTOKEN" is strictest definition that is both compatible with existing -->
<!-- usage ([0-9]+) and with the "ID" type. -->
<attribute name="id" type="NMTOKEN" />
Reply to "Notes on reviewing the new schema"
(talkcontribs)

Writing from a Wikimedia Commons perspective, could someone confirm that the use of "metadata" in this RfC is meant as structured data or data from Wikidata, not the "metadata" that exists for files on the database which is extracted from EXIF data?

197.218.87.249 (talkcontribs)

Technically it refers to any content stored in Requests for comment/Multi-Content Revisions's slots. Such data can be generated by an extension such as Wikibase or even mediawiki itself. In terms of commons, it is currently obtaining data from Wikidata and from wikibase installed in commons itself.


Media's metadata like Exif doesn't seem to get dumped at all(https://dumps.wikimedia.org).

197.218.87.249 (talkcontribs)

Although technically, it can probably be found in elasticsearch dumps, but that's unrelated to this RFC.

Reply to "Metadata"
Tgr (WMF) (talkcontribs)

Probably the main slot's SHA1 should also be published somewhere. Also, reusers running old scripts and relying on B/C should be mindful that the (revision) SHA1 semantics change somewhat: in the past they used to predict changes in the main slot content and now they don't necessarily do that anymore. Also <sha1> = sha1(<text>) does not hold anymore, in case someone used that for some kind of error detection.

Reply to "Main slot SHA1"
Duesentrieb (talkcontribs)

The proposed dump format is still using numeric text IDs. That cannot be guaranteed to work, text blobs are now identified by URL-like blob addresses: "tt:12345" is the address of text row 12345, and we may start using "ext:DB:..." for ExternalStore soon.

So, instead of <text id="305112983" bytes="143" /> we need to use <text id="tt:305112983" bytes="143" />. The numeric form could still be supported for backwards compatibility, with the prefix "tt:" being assumed if none is given.

Tgr (WMF) (talkcontribs)

Or it could be hidden as internal detail. That's a B/C break but seems like some kind of break is necessary anyway? The blob IDs do not seem to serve any useful purpose that's not already served by the sha1.

Reply to "Blob Addresses"
There are no older topics