Database optimization

From MediaWiki.org
Jump to: navigation, search

Database optimization is an important consideration in MediaWiki coding, particularly when the code is intended for deployment on Wikimedia projects.

Database selection tips[edit]

Typical read queries[edit]

Calling code should use the DB_REPLICA constant when using wfGetDB(), getConnection(), getConnectionRef(), and getLazyConnectionRef(). By using replicas when possible, site scalability is improved and wikis spanning multiple datacenters remain efficient by using local replica databases for all queries for most requests, rather than sometimes using the (remote) master database (from a read-only datacenter). If possible, master queries should only happen on POST requests, with exceptional cases using the DeferredUpdates system. Doing otherwise can result in slow cross-datacenter queries. Additionally, the master database is a single point of failure (SPOF).

Write queries and read-for-write queries[edit]

Write queries must use a DB_MASTER handle, as all replicas are read-only and MediaWiki is targeted at master/replica scale-out setups.

In the case where a read query needs to see the latest data, the DB_MASTER constant should be used to select the master database. The main case for this is when a SELECT query is used (at least partly) to determine the next write queries. Using DB_REPLICA here is not advisable as temporary replication lag could be amplified into permanent data anomalies. Be aware that additional query flags such as "LOCK IN SHARE MODE" or "FOR UPDATE" or the use of Database::lock() may also be needed to avoid race conditions. Always be aware of stale reads caused by REPEATABLE-READ isolation, flushing transactions as needed.

In some cases, DB_REPLICA may be used for read-for-write queries anyway, typically when both:

  • (a) there is a large performance advantage
  • (b) only minimal changes or tolerable anomalies are likely to result from replication lag

For example, when parsing pages, the exact state of each template and data source may not be as important as reflecting primary changes due to the source text of a page being edited. The user might only expect to see *their* changes be reflected while otherwise being oblivious about changes by other users to templates and assets used by the page they just edited.

Relation to GET/POST method[edit]

It is good practice to decide at a high level whether a request will need database write access or not. Ideally, once the entry point (API module, SpecialPage class, or Action class, ect...) is known, whether DB writes are allowed for POST requests should be known early one without have to execute the entire entry point. DB writes should not be allowed for GET and HEAD requests. Exceptional cases can use the DeferredUpdates system.

The main methods for specifying which endpoints require database writes are:

  • SpecialPage::doesWrites()
  • Action::doesWrites()
  • ApiBase::isWriteMode()

GET requests for entry points should be able to be efficiently executed in any datacenter, while POST requests are assumed to be routed to the master datacenter (where the master databases reside) to avoid high latency.

Query tips[edit]

General[edit]

ORDER BY expression --> filesort == bad. Use COUNT(), SUM(), GROUP BY, etc.; there is no limit on rows scanned. MAX()/MIN() of indexed field on entire table is okay. In some cases, denormalize for performance. Add information duplicated from other tables. Use summary tables, counter tables, cache tables, etc. Think about how the DB will run your query. Add indexes where needed. Batch queries (when it makes sense).

Extensions generally shouldn't add fields to core tables. Instead, they should make their own table(s), and JOIN them if necessary.

Profiling[edit]

When in doubt, don't make assumptions; profile.

Select only needed columns[edit]

Unless all columns are needed, avoid 'SELECT * FROM ...' as this consumes resources transferring extraneous data. Select only the needed columns.

WHERE[edit]

Avoid full table scans, except in edge cases in which it would actually improve performance since almost everything is going to be selected anyway.

Indices[edit]

Weigh the advantages of an index in making queries more efficient against the efficiency losses of having to change the index when insertions, deletions, and updates are done. Generally, you shouldn't run unindexed queries. WHERE on rarely false conditions is usually okay. Unindexed ORDER BY (filesort) is never okay.

LIMIT[edit]

Use LIMIT to limit number of rows returned. When possible, also limit the number of rows scanned. Remember, LIMIT 10 OFFSET 5000 scans 5010 rows. If you can, use WHERE foo_id >= 12345 instead.

JOIN[edit]

Use JOINs rather than looping SELECT queries.

Calculations[edit]

Do calculations in such a way that the indexes are still being used.

Multiple row operations[edit]

Use multiple row operations to avoid looped INSERT and UPDATE queries.[1]

Caching tips[edit]

Expensive queries that are likely to be frequently requested are good candidates for caching. If there is a convenient and cheap database value, such as page_touched for example, that changes when the computed value might change, it's often useful to include that cheap value in the cache key. This avoids the needs for explicit purging of cache keys. Also, consider keying on something that makes the cache value immutable, such as rev_sha1 or img_sha1. Both of these approaches can graciously handle replication lag.

In general, the best interface for caching SQL queries is WANObjectCache. Cache-aside logic is made easy by the getWithSetCallback method. This method also makes it easy to avoid high TTL anomalies due to replication lag by calling Database::getCacheSetOptions() at the top of the callback. Explicit purges can use the delete() or touchCheckKey() methods of WANObjectCache.

Cache-aside caching example:

$catInfo = $cache->getWithSetCallback(
    // Key to store the cached value under
    $cache->makeKey( 'cat-attributes', $catId ),
    // Time-to-live (in seconds)
    $cache::TTL_MINUTE,
    // Function that derives the new key value
    function ( $oldValue, &$ttl, array &$setOpts ) {
        $dbr = wfGetDB( DB_REPLICA );
        // Account for any snapshot/replica DB lag
        $setOpts += Database::getCacheSetOptions( $dbr );

        return $dbr->selectRow( ... );
    }
);

Cache purging example:

// Purge the key and disable caching for a few seconds
$cache->delete( $cache->makeKey( 'cat-attributes', $catId ) );

Communication[edit]

If you have questions on this topic that are not answered by the documentation, you may wish to go on #wikimedia-databases and talk to meta:User:Jynus, who is WMF's Senior Database Administrator and an expert on MySQL performance. If possible, be prepared to explain which queries are involved in the situation. Roan Kattouw, another MySQL optimization expert, answers questions on the wikitech-l mailing list and is sometimes available at #mediawiki.

See also[edit]

References[edit]

  1. Galperin, Eran (14 May 2009) Multiple row operations in MySQL / PHP, Techfounder.

External links[edit]

DatabasesManual:Database layout Engines: MySQL/MariaDBManual:MySQLOracleManual:OraclePostgreSQLManual:PostgreSQLSQLiteManual:SQLiteSQL ServerManual:Microsoft SQL Server

Technical documentation: SchemaManual:Database layout (tablesCategory:MediaWiki database tables) – API property associationsAPI:Database field and API property associationsField prefixesDatabase field prefixesPrimary key storage in other fieldsManual:Primary key storage in other fieldsWikimedia extension tablesCategory:Wikimedia extension database tables
Configuration: SettingsManual:Configuration_settings#Database_settingsSharingManual:Shared database
Development: AccessManual:Database accessOptimizationDatabase optimizationPolicyDevelopment policy#Database_patchesUpdaterManual:DatabaseUpdater.phpExtension schema updatesManual:Hooks/LoadExtensionSchemaUpdatesPatch fileManual:SQL patch file

Core tables: archivebot_passwordscategorycategorylinkschange_tagcommentconfigexternallinksfilearchivehitcounterimageimagelinksimage_comment_tempinterwikiiwlinksipblocksip_changesjobl10n_cachelanglinkslogginglog_searchmsg_resourcemsg_resource_linksmodule_depsobjectcacheoldimagepagepagelinkspage_propspage_restrictionsprotected_titlesquerycachequerycachetwoquerycache_inforecentchangesredirectrevisionrevision_comment_tempsearchindexsitessite_identifiers - site_statstag_summarytemplatelinkstexttranscacheupdateloguploadstashuseruser_former_groupsuser_groupsuser_newtalkuser_propertiestag_summaryvalid_tagwatchlist