Talk:Proposed Database Schema Changes/October 2004

JD_A
MySQL/InnoDB stores records in leaf index nodes in primary key order. Our troublesome accesses to old are for page history, which currently requires about one disk seek per revision because articles are ordered by id/time. So, changed the primary key to be the ID of the article, then the unique ID in old/revision/text. Now a typical history page view will require relatively few seeks because many revisions on that page are likely to be in the same index node. Added a unique index for the old/revision/text id to ensure that remains unique. This adds space for each other index (each must store the primary key) but that is far offset by dodging lots of disk seeks. This requires no additional programming work - it just delivers the speed benefits. Jamesday 00:40, 3 Oct 2004 (UTC)


 * Sounds fine for the revision table, but I'm not sure about for text. In particular, we can't add fields or change indexes on it without rebuilding the old old table, which is one of the key things we were trying to avoid. --Brion VIBBER 00:57, 3 Oct 2004 (UTC)


 * Adding it in text can wait until we actually chop it up. Documenting the desire for it so we'll know that we want it later. For text, it means that both versions needed for a diff are fairly likely to be in the same page. Jamesday 01:19, 3 Oct 2004 (UTC)


 * You might also think of adding a table for page_text to cache just the current version. Old/text is destined to be (able to be) on a completely different database server - with big, cheap disks instead of small, fast - because it's growing so fast. We could go with two different arrays in one server, one RAID 10,small/fast, the other RAID 5, big/slow but that's a lot of disks. Or we could continue filling relatively costly disks with seldom-read text, which seems like a bad idea. In this situation it'll be very convenient to have the current text available on the same server for a quick join. After this split to a different server, the primary servers and the query slaves will have much reduced disk space needs, so we'll save very significant amounts of money as things grow. The actual split can wait (for a few months), but now's the time to start preparing. Compressing old will put this off for a while, but I don't think we can avoid it. JeLuF seemed to appreciate this approach - might want to discuss with him also. For smaller sites, the "different" database server will jsut point to the same place. Jamesday 01:19, 3 Oct 2004 (UTC)

JD_B
Going back in page history 10,000 revisions currently requires a limit 10000,50 and a read of all 10,050 revisions. JD_B adds a per-page revision ID number to the revision table. Now going back 10,000 revisions can use a range check in the index instead: where rev_page=nnn and rev_seq > (max/current rev_seq for this record - 10050) order by rev_seq, limit 50. The current sequence value is added to page so it's easily accessible. The code needs to be modified to set the rev_seq value on save. For conversion, the new field is initialised initially with cur_id. The revised seek code won't work with that but it's something which gets the conversion done without duplicate key value errors. The change to use this for queries can wait. Jamesday 01:23, 3 Oct 2004 (UTC)


 * A sequence value is potentially a bit more problematic in that it restricts our ability to merge or otherwise alter individual page histories. How would these cases be handled?


 * History paging could also be made more efficient by keying off the timestamp rather than absolute sequence numbers. --Brion VIBBER 06:04, 3 Oct 2004 (UTC)


 * An explicit revision sequence for each page, counting up from its first revision, will be very useful for any future CMS hooks (for instance, for allowing add-ons to cleanly handle publishing workflow, translation synchronization, etc). What are the cases in which it is useful to merge or alter page history?  Should a merged history not be distinguishable from an unmerged history?  +sj+ 09:09, 4 Oct 2004 (UTC)


 * Current practice is that page histories may be merged when they are broken, eg by cutting off an existing page to make a redirect to a new title instead of renaming it.


 * Very occasionally revisions may be removed, eg due to copyright infringing materials stuck into an established page. In this case you could leave a 'hole' or fill in a bogus "this revision removed" revision, so renumbering is not needed.


 * For the merging case, the histories may include some overlap in time when edits were being made under both titles, which requires either renumbering revisions or some freaky non-integer scheme. --Brion VIBBER 08:34, 6 Oct 2004 (UTC)


 * I'd be in favour of including sequence numbers. We already have sequence numbers in the user interface, they just have to be recalculated on demand. Including sequence numbers in the schema could be seen as a kind of aggregation or caching. My possibly naive understanding is that history page views are O(N) and history merges are O(1) in the total number of revisions associated with a given article. Considering the relative frequency of those two operations, I'd rather have it the other way around. Let's work on the concept of timestamps being canonical, and sequence numbers being derived. We can make a function to regenerate the sequence numbers from the timestamps, which can be called on undelete.


 * The other problem with timestamps is that collisions are possible. Sequence numbers would give the correct order in history views even if the edits occurred in the same few milliseconds. The renumbering algorithm could check the sequence number ordering on collision. Edit conflict checks could just use rev_id in the new schema, so we won't need sequence numbers there. -- Tim Starling 23:41, 8 Oct 2004 (UTC)

My changes
Can we include support for differential storage and application-level splitting? I've added my ideas for the necessary schema changes directly to the page. In the text table, there's an integer old_diff field. The idea is that if this field is zero, the text is literal. Otherwise, old_diff is the old_id of another revision. The text of the revision of interest is constructed by retrieving the text of the linked revision, then applying the diff given in old_text. This data structure allows for arbitrarily long chains of diffs, with diffs going in either the forward or reverse chronological directions.

For application-level splitting I added a rev_link field. This could contain a URI or similar, where the revision text can be found. -- Tim Starling 23:56, 8 Oct 2004 (UTC)


 * We're doing our darnedest to not require altering the old table's field layout here. :)


 * Consider whether an old_flag flag and handling in Article::getRevisionText would be appropriate. --Brion VIBBER 02:01, 9 Oct 2004 (UTC)


 * Jamesday points out that a linked list isn't exactly ideal for performance. There needs to be some way to determine in advance which rows need to be loaded, so that all the data for a revision can be fetched in one or two queries. -- Tim Starling 23:42, 9 Oct 2004 (UTC)

Page change?
For page it's probably a good idea to add a field for the CRC32 of page_namespace plus page_title. And an index on that CRC32 (and no other fields). Then a page existence check becomes CRC32 matches (probably fairly unique) and namespze matches and title matches. This is of benefit because the CRC32 index is very small compared to namespace plus title, so it will be easy to cache, and because integer comparisons are fast and easy. Doesn't have to be CRC32 - anything producing a mall integer result and of uniform hash distribution (so some variation on a cryptographic hash might be better than CRC32, which is sometimes not very uniformly distributed). This one is one of the routine performance improvement suggestions from the MySQL people. Jamesday 01:39, 23 Oct 2004 (UTC)

WikiStats
I know I'm a bit late with this, but it is pretty hard to keep in touch with everything going on.

I'm trying to grasp what will be the consequences for the wikistats scripts, which use raw database dumps for everything. The current setup seems to imply I need to build huge tables which exceed phys. memory, hence sharp performance penalties (the job already runs +/- 24 hrs), or I need to sort and merge these huge files several times before real work starts.

I need to know which user edits were for namespace 0. Also which article texts belong to the same title (or any unique article id), were in namespace 0 and at what time they were saved.

If I understand correctly I would have to sort Page and Revision on page_id=rev_page and merge into a new file, say PageRev. Then sort PageRev file on rev_id and merge with Text.

All of his would not be necessary if a few small fields were replicated across tables. Impact on db size would be trivial, on page save time zero.

Namespace -> Revision. Namespace, Rev_Page, Timestamp -> Text.

Please comment. Erik Zachte 14:59, Dec 19, 2004 (UTC)

Unrelated, will there be a (costly) query after each db dump to produce something similar to the cur dump, which is used by quite a few scripts. Downloading all complete db's is not feasible.