Talk:Proposed Database Schema Changes/August 2006

From mediawiki.org
The following discussion has been transferred from Meta-Wiki.
Any user names refer to users of that site, who are not necessarily users of MediaWiki.org (even if they share the same username).

Page renaming[edit]

How should page renaming work under the new scheme? It's confusing to change previous entries in RC to the new name, and slightly confusing to change them in Special:Contributions. Perhaps the revision should contain the namespace/title the entry had when it was written. -- Tim Starling 07:58, 10 Jan 2004 (UTC)

Separate texts[edit]

Based on the benchmarks in effects of large blobs on InnoDB speed, I suggest we add an additional table for just the texts.

CREATE TABLE `texts` (
  `txt_id` int(11) NOT NULL auto_increment,
  `txt_text` mediumtext,
  `txt_md5` varchar(32) default NULL,
   PRIMARY KEY  (`txt_id`),
   KEY (`txt_md5`(8))
)  TYPE=InnoDB;

Besides the mentioned speed improvements (large, hopefully, since this means all tables but 'texts' will fit in RAM), this will allow a nice segementation of data dumps, since a row in texts will never change after its insertion. The md5 hash can be used to avoid duplicates in the table, so that successive reverts doesn't use more space than necessary. Many revisions can point to the same text.

Update: from analyzing the 'old' table for en, I can say that storing only unique texts would reduce the data volume by no more than about 3%.

Also, since 'texts' doesn't change, it can take full advantage of memcached. Every text will be uniquely tied to its row id. // E23 00:51, 4 Feb 2004 (UTC)

An additional thought on the above: if we also add a datetime field containing the timestamp of the latest revision that uses the text, we can segment table dumps chronologically in a nice way. "current.sql", "old-february-2004.sql", "old-january-2004" and so on. Backups would not be any easier, since they can be segmented on id in any case, but it would make downloading easier on the users. They would always download the full revisions table (not very large) and then the text dumps for only as far back as needed. It can be done without an extra field, of course, but that would require more from the database and the dump creation script. // E23 17:33, 9 Feb 2004 (UTC)
This would be wonderful. I would love to be able to get something like "all pages and the last 20 revisions" or "all pages and the last month's revisions", instead of the full multi-year history. Was there any progress with this? +sj+ 17:33, 1 Aug 2004 (UTC)

Size, compression[edit]

Any objection to adding compression flag and [uncompressed] size fields?

I guess that may be acceptable. Would you like an index with that? // E23 01:15, 4 Feb 2004 (UTC)