|This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
Database optimization is an important consideration in MediaWiki coding, particularly when the code is intended for deployment on Wikimedia projects.
- 1 Database selection tips
- 2 Query tips
- 3 Caching tips
- 4 Communication
- 5 See also
- 6 References
- 7 External links
Database selection tips
Typical read queries
Calling code should use the DB_REPLICA constant when using
getLazyConnectionRef(). By using replicas when possible, site scalability is improved and wikis spanning multiple datacenters remain efficient by using local replica databases for all queries for most requests, rather than sometimes using the (remote) master database (from a read-only datacenter). If possible, master queries should only happen on POST requests, with exceptional cases using the
DeferredUpdates system. Doing otherwise can result in slow cross-datacenter queries. Additionally, the master database is a single point of failure (SPOF).
Write queries and read-for-write queries
Write queries must use a DB_MASTER handle, as all replicas are read-only and MediaWiki is targeted at master/replica scale-out setups.
In the case where a read query needs to see the latest data, the DB_MASTER constant should be used to select the master database. The main case for this is when a SELECT query is used (at least partly) to determine the next write queries. Using DB_REPLICA here is not advisable as temporary replication lag could be amplified into permanent data anomalies. Be aware that additional query flags such as "LOCK IN SHARE MODE" or "FOR UPDATE" or the use of
Database::lock() may also be needed to avoid race conditions. Always be aware of stale reads caused by REPEATABLE-READ isolation, flushing transactions as needed.
In some cases, DB_REPLICA may be used for read-for-write queries anyway, typically when both:
- (a) there is a large performance advantage
- (b) only minimal changes or tolerable anomalies are likely to result from replication lag
For example, when parsing pages, the exact state of each template and data source may not be as important as reflecting primary changes due to the source text of a page being edited. The user might only expect to see *their* changes be reflected while otherwise being oblivious about changes by other users to templates and assets used by the page they just edited.
ORDER BY expression --> filesort == bad. Use COUNT(), SUM(), GROUP BY, etc.; there is no limit on rows scanned. MAX()/MIN() of indexed field on entire table is okay. In some cases, denormalize for performance. Add information duplicated from other tables. Use summary tables, counter tables, cache tables, etc. Think about how the DB will run your query. Add indexes where needed. Batch queries (when it makes sense).
Extensions generally shouldn't add fields to core tables. Instead, they should make their own table(s), and JOIN them if necessary.
When in doubt, don't make assumptions; profile.
Select only needed columns
Unless all columns are needed, avoid 'SELECT * FROM ...' as this consumes resources transferring extraneous data. Select only the needed columns.
Avoid full table scans, except in edge cases in which it would actually improve performance since almost everything is going to be selected anyway.
Weigh the advantages of an index in making queries more efficient against the efficiency losses of having to change the index when insertions, deletions, and updates are done. Generally, you shouldn't run unindexed queries. WHERE on rarely false conditions is usually okay. Unindexed ORDER BY (filesort) is never okay.
Use LIMIT to limit number of rows returned. When possible, also limit the number of rows scanned. Remember, LIMIT 10 OFFSET 5000 scans 5010 rows. If you can, use WHERE foo_id >= 12345 instead.
Use JOINs rather than looping SELECT queries.
Do calculations in such a way that the indexes are still being used.
Multiple row operations
Use multiple row operations to avoid looped INSERT and UPDATE queries.
Expensive queries that are likely to be frequently requested are good candidates for caching. If there is a convenient and cheap database value, such as page_touched for example, that changes when the computed value might change, it's often useful to include that cheap value in the cache key. This avoids the needs for explicit purging of cache keys. Also, consider keying on something that makes the cache value immutable, such as rev_sha1 or img_sha1. Both of these approaches can graciously handle replication lag.
In general, the best interface for caching SQL queries is
WANObjectCache. Cache-aside logic is made easy by the
getWithSetCallback method. This method also makes it easy to avoid high TTL anomalies due to replication lag by calling
Database::getCacheSetOpts() at the top of the callback. Explicit purges can use the
touchCheckKey() methods of
If you have questions on this topic that are not answered by the documentation, you may wish to go on #wikimedia-databases and talk to meta:User:Jynus, who is WMF's Senior Database Administrator and an expert on MySQL performance. If possible, be prepared to explain which queries are involved in the situation. Roan Kattouw, another MySQL optimization expert, answers questions on the wikitech-l mailing list and is sometimes available at #mediawiki.
- Development policy#Database policy and Development policy#Database patches
- Manual:Database access#Basic query optimization
- Manual:Database layout/MySQL Optimization/Tutorial
- Galperin, Eran (14 May 2009) Multiple row operations in MySQL / PHP, Techfounder.
- Galperin, Eran (25 March 2011) Database optimization techniques you can actually use, Techfounder.