Multi-Content Revisions/Blob Storage

It should be possible to use different storage backends for different slots (or for different content models, TBD). For each storage mechanism, a BlobStore service would be provided:

Besides the SQL-Based storage used currently used by MediaWiki, BlobStores could be implemented on top of the raw file system, Cassandra, Apache Swift, higher level HTTP based services like RESTBase, etc.

Addressing blobs:


 * The BlobStore has full control over the address that shall be used later to retrieve the blob. storeBlob returns the address that can be used with loadBlob to retrieve the blob.
 * The address is completely opaque. It may be based on the blob's content hash, use incremental numbering, or GUIDs, or some other scheme.

TBD: should the BlobStore be able to store, and be aware of, any meta-data such as the blob's MIME type? Do we need getBlobInfo( $address )?

Note: for now, storeBlob is atomic, and cannot be undone. This appears to be consistent with the current behavior of ExternalStore::insertToDefault. If we need a transactional context here, there are two approaches: pass the context to storeBlob, or provide the context to the BlobStore's constructor. In the latter case, we would need a BlobStoreFactory for each storage backend, for creating BlobStores for the current transaction context, as in newBlobStore( $trx ). The question how such a transactional context would be implemented is worth it's own RFC.

The mapping between slots and storage backends is maintained by the SlotDataStore service:

Addressing slot data:


 * The mapping between slots and storage backends is implemented in two steps:
 * BlobStore names are associated with BlobStore implementatiosn and configurations, designating a concrete storage location. This association must NEVER change, otherwise any stored data will become inaccessible (this is similar to how externalstore clusters are configured).
 * slot names (or content models, TBD) are associated with a BlobStore name. This indicates which store is to be used when storing new data. This association can be changed at will.
 * The string returned by storeSlotData is an opaque URL for later loading the slot data using loadSlotData. It is composed of two parts: the name of the BlobStore, and the address returned by the BlobStore.
 * loadSlotData relies on the prefix in the $url to find the correct BlobStore to load the slot data.

Once the SlotDataStore service is available, PageUpdater, RevisionUpdater, and RevisionContentLookup shall be implemented on top of them. Maintaining any meta-information about the slots (in the new revision_slots table), and handling serialization, remains the responsibility of PageUpdater and RevisionUpdater. Interpreting such meta-data from the revision_slots table, and handling deserialization, remains the responsibility of RevisionContentLookup.

Hints
TBD

Intergation with ExternalStore
The proposed BlobStore interface defines the following methods:

A multiplexing BlobStore can be made by using prefixes to indicate the actual blob store when loading. For storing, a separate interface is needed to pick a blob store explicitly, and compose the address:

Alternatively, a registry interface could be used, but this would leave it to the caller to correctly compose the address from prefix and suffix:

Current draft for composition: A BlobStoreMux instance that implements BlobLookup and MultiplexingBlobStore. The MuxInstance knows a BlobStoreRegistry instance.

The old ExternalStoreMedium exposes similar methods:

The main difference is that the caller has to always specify the "location", the specific store. The interface resembles the multiplexer interface more than the basic BlobStore. ExternalStoreMedium is not accessed directly; instead, a purely static interface in the ExternalStore class is used to do the multiplexing between different ExternalStoreMedium objects. It exposes the following methods:

Note that insertWithFallback doesn't get one store, but a fallback chain of stores; this doesn't seem to be used in practice, though.

The $params argument means that for each call to insertWithFallback or fetchFromURL, a new ExternalStoreMedium instance is created based on these parameters. The main purpose of $params seems to be to select a wiki, in case we are trying to load blobs that belong to another site. With the BlobStore interface, each service instance would be permanent, and bound to a specific wiki. To access blobs for a different wiki, an appropriate BlobStore instance would have to be acquired from a factory.

When introducing BlobStore, the existing older interface needs to be integrated. There seem to be three options:


 * 1) Don't introduce a new blob storage interface, expand and adopt the existing one.This would be a big logical break: ExternalStore would no longer be an implementation detail hidden by the code that manages the text table - to the contrary, it would be the primary way to access content data, and the text table would just be one possible ExternalStoreMedium. Addresses for direct storage in the text table would look something like "TT:7641432", externally stored blobs would keep using addresses like "DB://cluster5/873284".
 * 2) Turn ExternalStore into a BlobStore (with MultiplexingBlobStore interface), in addition to the purely static interface. This means two-level multiplexing, once for the BlobStore interface, and once for the ExternalStoreMedium interface. Blob addresses using the external store would look something like "ES:DB://cluster5/873284".
 * 3) Create an ExternalStoreMediumBlobStore adapter, that implements the BlobStore interface on top of an ExternalStoreMedium instance. The multiplexing BlobStore would use these adapters directly, bypassing the old EntityStore class completely. Blob addresses from this adapter would be the same as the old external store URLs, e.g. "DB://cluster5/873284".