User talk:ArielGlenn/MCR and dumps

About this board

Duesentrieb (talkcontribs)

I agree with what's the the "XML" section.

I'd also like it better to represent all slots in the same way; I have added an alternative proposal to Multi-Content_Revisions/Dumps now.

For compatibility with existing consumers of dumps: I expect that most of these consumers just grab the data from the first <text> tag. That will continue to work exactly as before. With the new proposal, there will now just be more text tags, which will probably be ignored be legacy code. And more attributes on the text tag, which will also be ignored by old code.

Removing the <model> tag from the revision is a breaking change. Perhaps we can just keep the main slot's model there for a while, but include a comment that this is deprecated.

Reply to "XML schema"

Dump process with MCR

1
Duesentrieb (talkcontribs)

I agree with the assessment in the "dumps" section. The idea is to use "blob addresses" in the place where currently "text ids" are used. Instead of looking up text in the text table, blob addresses are looked up using a BlobStore.

For Wikimedia sites, using ExternalStore, the access pattern of fetchText.php was: select row from text table, load from External store. This behavior will stay exactly the same when using BlobStore.

The dumpTextPass script will have to join three tables (revision, slots, and content) instead of using just the revision table, but that should not have much of an impact.

Reply to "Dump process with MCR"

Growth of the Commons dump due to MediaInfo

1
Duesentrieb (talkcontribs)

Some thoughts on how commons will develop with MediaInfo is introduced:

  • The slots-per-edit count for commons will stay at 1, since MediaInfo doesn't support atomic edits of metadata and wikitext.
  • We expect that bots will be used to convert meatdata from wikitext to MediaInfo. So commons will probably quickly approach 2 slots (resp streams) per page in the file namespace. But it will stay at 1 for other namespaces (for now), so the total average may not be too far from the originally estimated 1.33.
Reply to "Growth of the Commons dump due to MediaInfo"
There are no older topics