Delta Storage

From mediawiki.org

In addition to marshaling article changes in svndiff format, it might also be nice to store article changes in svndiff format. This would optimize space by not storing a second copy of a large article when a small change is made. It would certainly have some cost in performance, though it might be negligible, or it might be considerable, possibly depending how it is implemented.

The Subversion source includes a couple of notes on how revisions are stored in its Berkeley DB backend. I have not yet found notes on how revisions are stored in the Subversion FSFS backend.

At present, Subversion generally stores the youngest strings in "fulltext" form, and older strings as "delta"s against them (unless the delta would save no space compared to the fulltext). However, there's nothing magic about that particular arrangement. Other interesting alternatives:

  • We could store the N most recently accessed strings as fulltexts, letting access patterns determine the most appropriate representation for each revision.
  • We could occasionally store deltas against the N'th younger revision, storing larger jumps with a frequency inverse to the distance covered, yielding a tree-structured history.

The simplest implementation would be to copy Subversion and store the current revision as fulltext, and previous revisions as incremental deltas. This should imply no performance change to retrieving the current revision, but retrieving the oldest revision would involve retrieving deltas for every revision of an article. Is this acceptable given MediaWiki's access pattern?

However, that would mean reacalculating all deltas for previous revisions whenever the page is edited, which I suspect would be prohibitively expensive unless the history is very short. --HappyDog 22:59, 16 June 2010 (UTC)[reply]