Talk:Proposed Database Schema Changes/October 2004

JD_A
MySQL/InnoDB stores records in leaf index nodes in primary key order. Our troublesome accesses to old are for page history, which currently requires about one disk seek per revision because articles are ordered by id/time. So, changed the primary key to be the ID of the article, then the unique ID in old/revision/text. Now a typical history page view will require relatively few seeks because many revisions on that page are likely to be in the same index node. Added a unique index for the old/revision/text id to ensure that remains unique. This adds space for each other index (each must store the primary key) but that is far offset by dodging lots of disk seeks. This requires no additional programming work - it just delivers the speed benefits. Jamesday 00:40, 3 Oct 2004 (UTC)


 * Sounds fine for the revision table, but I'm not sure about for text. In particular, we can't add fields or change indexes on it without rebuilding the old old table, which is one of the key things we were trying to avoid. --Brion VIBBER 00:57, 3 Oct 2004 (UTC)


 * Adding it in text can wait until we actually chop it up. Documenting the desire for it so we'll know that we want it later. For text, it means that both versions needed for a diff are fairly likely to be in the same page. Jamesday 01:19, 3 Oct 2004 (UTC)


 * You might also think of adding a table for page_text to cache just the current version. Old/text is destined to be (able to be) on a completely different database server - with big, cheap disks instead of small, fast - because it's growing so fast. We could go with two different arrays in one server, one RAID 10,small/fast, the other RAID 5, big/slow but that's a lot of disks. Or we could continue filling relatively costly disks with seldom-read text, which seems like a bad idea. In this situation it'll be very convenient to have the current text available on the same server for a quick join. After this split to a different server, the primary servers and the query slaves will have much reduced disk space needs, so we'll save very significant amounts of money as things grow. The actual split can wait (for a few months), but now's the time to start preparing. Compressing old will put this off for a while, but I don't think we can avoid it. JeLuF seemed to appreciate this approach - might want to discuss with him also. For smaller sites, the "different" database server will jsut point to the same place. Jamesday 01:19, 3 Oct 2004 (UTC)

JD_B
Going back in page history 10,000 revisions currently requires a limit 10000,50 and a read of all 10,050 revisions. JD_B adds a per-page revision ID number to the revision table. Now going back 10,000 revisions can use a range check in the index instead: where rev_page=nnn and rev_seq > (max/current rev_seq for this record - 10050) order by rev_seq, limit 50. The current sequence value is added to page so it's easily accessible. The code needs to be modified to set the rev_seq value on save. For conversion, the new field is initialised initially with cur_id. The revised seek code won't work with that but it's something which gets the conversion done without duplicate key value errors. The change to use this for queries can wait. Jamesday 01:23, 3 Oct 2004 (UTC)


 * A sequence value is potentially a bit more problematic in that it restricts our ability to merge or otherwise alter individual page histories. How would these cases be handled?


 * History paging could also be made more efficient by keying off the timestamp rather than absolute sequence numbers. --Brion VIBBER 06:04, 3 Oct 2004 (UTC)


 * An explicit revision sequence for each page, counting up from its first revision, will be very useful for any future CMS hooks (for instance, for allowing add-ons to cleanly handle publishing workflow, translation synchronization, etc). What are the cases in which it is useful to merge or alter page history?  Should a merged history not be distinguishable from an unmerged history?  +sj+ 09:09, 4 Oct 2004 (UTC)