User:ArielGlenn/MCR and dumps

Some rough notes on MultiContent Revisions (MCR), how this may impact xml dumps of revision content and metadata, and related musings

First, a digression about growth:

According to the growth estimates given here, a wiki with 50 million pages might be expected to get to 100 million in 8 years, with the side effects that entails. Commons is already at about 48 million files, having grown by nearly 10 million in the past year, and so at that rate of growth it will hit 100 million files in 5 years, see the content figures. Its growth curve is not flat so 4 years is more likely.

Commons is, as I understand it, one of the primary consumers of MCR, since this would permit the storage of file-related information in a secondary 'slot' in a revision. I can easily imagine bots passing over the project to automate the conversion of this data; I would be a proponent of this myself, as a user of the metadata. So we might see the number of slots per page grow much faster than predicted in the initial growth estimates.

Now, on to the dumps.

The most expensive dumps get done in two passes, first generating metadata about the revisions of each page, called 'stubs'. Then a pass is taken through the stubs in page id order, while the previous dump of revision content is read; any revisions found in the old dump are copied into the new one as is, and only revision content that is missing is requested from the database servers.

For dumps under the MCR model to be as kind to the servers, we'll want something analogous. We don't want to request all content for all slots of a revision if only one slot's content has been updated. We can probably be clever about this by examining slot_origin and comparing it to the rev-id and slot_origin information from the revision we just wrote, and thus request content from the external store only if we need it.

This does require access to content at a low level. The dump scripts currently get content out of external store via Revision::getRevisionText, which used to do ExternalStore::fetchFromURL and now has a few levels of indirection in there, none of which should be particularly heavy. There is no serialization done of any sort; we do xml escape the content before it gets written out to a file, but that's in the dump script itself, so that it can be changed or extended if we want to dump content in other formats.

We'll want this same approach for the slot content retrieval for MCR, whether for dumps of content of the main slot only or for all slots.

XML schema thoughts

As far as the xml schema goes, my gut feeling is that we'll probably want to fix the slot order (alphabetical by slot name, except that 'main' is always first?) for readers' convenience. But I'm not sure about that yet.

As soon as any MCR content makes it into any wiki, that dump will no longer be readable with current tools that are out there in the wild, so I wonder how much we gain from having the 'main' slot be treated as plain-old-rev-content and dumped in a 'backwards-compatible' style. Let's discuss this! I also dislike having one slot recorded differently from the rest in the output; it makes reading and writing these a little more complex.