Proposed Database Schema Changes

This page contains proposals for database schema changes.

Archive: /October 2004, /August 2006, /November 2006

table for auto-suggest page title search
An OpenSearch standard allows browsers to get a list of suggestions when they type in the search box (upper right corner). A query  returns a list of titles that begin with what the user typed, but the list is alphabetical, and not very relevant when only a few characters have been typed.

To improve relevancy, the list can be sorted by how many other pages link to the title, for example:.
 * Add a  field to the page table, and create an index.
 * Pros: efficient space - just one more integer field. Cons: Counter update will happen in the same table.
 * Or, add a new table with title and backlinkscount fields, and also create an index.
 * Pros: When generating a backlinkcount, it might be faster to create a new table, and once complete, swap it in place. The replication is (IIUC) much more efficient. Cons: Duplication of titles. Pages might be out of date.

revisions table: Revision size field

 * Bug 6277 - add revision length to revisions table

Tables: add rev_len unsigned integer field to revisions table Indexes: no changes at this time

This field would contain the length of the revision's raw text, same as page_len in page table. Having this field would tremendously help vandal-fighting bots, as it will allow simple queries for page blanking and bulk imports (fairly common forms of vandalism). It will also reduce the load on the server from such tools, because the raw text will not be needed in many cases. The length will, potentially, allow much more sophisticated analysis then what the next, rc_change field would allow.


 * Drawback
 * Very expensive operation, as all revision rows have to be updated. Possibly requires considerable downtime.

''The schema change can be made offline on slaves [addition of a field, easy], and the values can be filled in lazily or as a batch operation. If done carefully this should not be disruptive. --brion 20:01, 11 November 2006 (UTC)''

recentchanges table: add "Change size" field

 * Bug 6277 - add change in length (delta) to the recentchanges table

Tables: add rc_change signed integer field to recentchanges table Indexes: need?

This field would contain the size of the change (delta) between two subsequent revisions (either positive or negative).

The field would allow for quick vandalism detection, as considerable number of vandals either erase large amount of content, or inject large quantities of bogus text.

Interwiki Links table
Bug: not filed

Currently, there is a table langlinks that contains interwiki links from all articles to the titles in all other languages of the same family. langlinks contains these fields: +--+-+--+-+-+---+ +--+-+--+-+-+---+ +--+-+--+-+-+---+
 * Field   | Type            | Null | Key | Default | Extra |
 * ll_from | int(8) unsigned | NO   | PRI | 0       |       |
 * ll_lang | varchar(10)     | NO   | PRI | NULL    |       |
 * ll_title | varchar(255)   | NO   |     | NULL    |       |


 * Drawbacks : The table does not contain two important types of interwiki links:
 * the inline langlink  he:Einstein  is a link to a page in another language embedded in the text
 * the cross-family link  Langlinks table  is a link to the page in another site (i.e. Meta)


 * Proposed solutions
 * Introduce a new table with identical structure as langlinks, containing both the inline language links and interwikies, but without the language links shown on the left hand side. It can later be determined if it's a language link by cross-referencing the interwiki table's iw_local field.
 * Introduce a "type" field into the langlinks table, determining the type of the link - 0=language, 1=inline language, 2=interwiki, etc

Index optimization

 * Bug 6440 - Change indexes on pagelinks, templatelinks, and imagelinks tables to considerably optimize several queries.

pagelinks: pl_namespace index: add pl_from templatelinks: tl_namespace index: add tl_from imagelinks: il_to index: add il_from
 * Changes

The additional field will allow multi-page queries to limit results and have an easy way to continue the query without using filesort.

New field in table page for languages with language variants
Bug: not filed

Add new field page_no_title_convert to table page to complete the implementation of the __NOTITLECONVERT__ magic word. This magic prevents conversion of page titles into different variants (e.g. cyrillic, latin, ...). The additional field enables a Parser::replaceLinkHolders function to figure out if the link text can be converted to different variants, without fetching the article text (and checking for the magic word).

Full patch (of directory maintenance/) is available at svn diff -r16523:16525.

ALTER TABLE /*$wgDBprefix*/page ADD page_no_title_convert bool NOT NULL default 0;
 * Changes

user_options to own table
The user_options field in the user table is not properly normalized. It is proper to have a user_options table with the fields useroption_id (PK), useroption_user (foreign key to user_id), useroption_key (option key), useroption_value (option value). This format would allow extension writers to do things like associate specific user options with a user and then search for all users with a specific option value. Space requirements might even be less because it would be possible to prune rows that correspond to default values, as opposed to storing a serialized array in the row as is currently done.

Archive table add primary key
The  table doesn't have a primary key currently; this is generally annoying, and means that we refer to deleted revisions with a (namespace,title,timestamp) tuple which isn't guaranteed to be unique. (Dupe timestamps can occur in a page's history due to funny bugs or history merging from two formerly separate pages; very quick consecutive operations may also produce dupes on second-resolution timestamps.)

Adding an auto-increment primary key should be fairly easy to do offline [via slaves + master switch].

Additional indexes may also be wise, for instance to pull items by name to maintain user names or other such operations.

Other fields that might be useful on archive could be:
 * original page_id, to maintain across restores
 * timestamp of deletion action
 * maybe the user and reason fields also?