Wikimedia Performance Team/Backend performance

This page explains what we do to make MediaWiki code (core, extensions, user scripts, and gadgets) run fast. If you write code like that, please follow these guidelines so you don't slow down the site, whether you're writing a one-line bugfix, a new gadget feature, a whole new extension, or a big change to core. The aim is to optimize user experience, especially minimizing latency in the user experiences.

MediaWiki developers care a lot about performance, and this guide interrelates with the Architecture guidelines and Security for developers/Architecture.

This page talks about how to make your code fast given how the site works right now. We will have a separate document about what direction we want our performance architecture to go.

How to think about performance

 * Be prepared to be surprised by the performance of your code - we are famously bad at predicting this.
 * Be scrupulous about measuring performance (in your development environment AND in production) and know where time is being spent.
 * When latency is identified, take responsibility for it and make it a priority; you have the best idea of usage patterns & what to test.
 * Performance is often related to other code smells; think about the root cause.
 * Expensive but valuable actions that miss the cache should take, at most, 5 seconds; 2 seconds is better.
 * If that isn't enough, consider using the job queue to perform a task on background servers.

General performance principles
MediaWiki application development:
 * Deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. The ResourceLoader does this.
 * Defer loading modules that don't affect the initial rendering of the page, particularly the "above the fold" part of the page visible on the user's screen. So load as little JavaScript as possible with ; instead load more components asynchronously or with the default ResourceLoader behavior of loading at the bottom of the page. See loading modules for more information.
 * Users should have a smooth experience; different components should render progressively. Preserve positioning of elements (e.g. avoid pushing down content in a reflow). need code from a bad example

Wikimedia infrastructure:
 * Your code is running in a shared environment. Thus, long-running SQL queries can't run as part of the web request. Instead they should be made to run on a dedicated server (use the JobQueue), and watch out for deadlocks and lock-wait timeouts.
 * The tables you create will be shared by other code. Every query must be able to use one of the indexes (including write queries!).
 * Choose the right persistence layer for your needs: Redis job queue, MariaDB database, or Swift file store. Maybe use a cache instead of a persistence layer. need a bad example
 * Wikimedia uses and depends heavily on many different caching layers, so your code needs to work in that environment! (But it also must work if everything misses cache.) need a good example
 * We share our caches and want to increase the cache hit ratio; watch out if you're introducing new cookies, shared resources, bundled requests or calls, or other changes that will vary requests and reduce cache hit ratio.

How to think about performance
Measure: Measure how fast your code works, so you can make decisions based on facts instead of superstition or feeling. Use these principles together with the Architecture guidelines and Security guidelines.

Always consider 99% numbers rather than averages. In other words, you don't want half of your users to have a good experience, you want all of them to. So you need to look at the 99th slowest sample to really understand performance.

Latency: We aim for an acceptable speed of service regardless of network latency, but some operations may have surprisingly variable network latency, such as looking up image files when Instant Commons is enabled. There can be some ways to manage this:
 * first, be aware of which code paths are meant to always be fast (database, memcache) and which may be slow (fetching File info or spam blacklists that might be cross-wiki and go over the internet)
 * when creating a code path that may be intermittently slow, document this fact.
 * be careful not to pile on requests -- for instance an external search engine might be slow to return under poor conditions while it's normally fast. Bottlenecking may cause all web servers to get caught up.
 * consider breaking operations into smaller pieces which can be separated
 * alternately, consider running operations in parallel -- this can be tricky though, we don't have good primitives for doing multiple HTTP fetches at once right now

(Latency of course depends on the user's connection somewhat. We serve many people on mobile or dialup connections. Our goal is reasonable up to 300 milliseconds round-trip time or so. If someone's on satellite with 2000ms RTT, then they can expect everything to be slow, but that's a small minority of users.)

In the worst case, an action that is sort of expensive but valuable, if it misses hitting the cache, should take at most 5 seconds. Strive for two seconds.


 * example: saving a new edit to a page
 * example: rendering a video thumbnail

How often will this happen? We have several critical code paths in MediaWiki. It's important to think about how often the site or the browser will have to execute your code. Here are the main cases:
 * Always. This is obviously the most critical.
 * On page views (when actually sending HTML) -- that is, nearly always, unless we reply with  or some such.
 * When rendering page content. This is executed a lot less, so more expensive operations are acceptable. Rendering is typically not done while a user is waiting -- unless the user just edited the page, which leads to...
 * When saving an edit. This is the rarest code path, and the one on which the largest delays are acceptable. Users tend to accept a longer wait after performing an action that "feels heavy", like editing a page.

You're not alone: Work with your product managers/development manager/Site performance and architecture to understand general performance targets before you start architecting your system. For example, a user-facing application might have an acceptable latency of 200 ms while a database might have something like 20 ms or less, especially if further access is decided based on the results of previous queries. You don't want to prematurely optimize, but you need to understand if your targets are physically possible.

You might not need to design your own backend; consider using an existing service, or having someone design an interface for you. Consider modularization. Performance is hard; do not try to reinvent the wheel.

ResourceLoader
ResourceLoader is the delivery system for the optimized loading and managing of modules. Learn how to develop code with ResourceLoader and its features.

Examples:
 * The Example extension demonstrates how to register a ResourceLoader module, and how to load it on a page.
 * Extension:SyntaxHighlight GeSHi used to embed the styles in the section of every page HTML directly. It has been migrated to instead use ResourceLoader to manage and load these styles.

Modules requested over HTTP are cached by timestamp. If your module is made up of wiki pages or plain files, the default ResourceLoaderModule classes take care of measuring the invalidation timestamp for you. When implementing a custom module, you're responsible for timestamp measuring and freshness computation.

Examples:
 * TODO: "ResourceLoaderModule hash", uses hash of data inside the key, or hash of request context. In a previous version it used a generic key and then whenever a request came in for a different language than the previous request, it invalidated the cache. This cached the module to infinitely invalidate its own cache causing it essentially to not be cached at all.

Deferring loading
Defer loading modules that don't affect the initial rendering of the page (above the fold). Load as little JavaScript as needed from the top loading queue; load more components asynchronously or from the bottom queue.

Downloading and executing resources (styles, MediaWiki messages, and scripts) can slow down a user's experience. Per "Developing with ResourceLoader", when possible, load modules that the user will not immediately need via the bottom queue, rather than the top queue.

Chrom(e|ium)'s developer tools are a good way to introspect the order in which your code is loading resources. For further advice on tuning front-end performance, Ilya Grigorik's book "High Performance Browser Networking" is excellent and available to read for free.

Don't load anything synchronously that will freeze the user interface.

Good example: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/client/resources/wikibase.client.linkitem.init.js performs lazy loading.

MultimediaViewer follows the init model pattern. mmv.bootstrap.autostart.js https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FMultimediaViewer.git/832cbf3f030fede74f76f4f26e2137813cdf2edf/resources%2Fmmv%2Fmmv.bootstrap.autostart.js - is a bootstrapper that loads everything else.

Bad examples:
 * That Wikibase JS code used to be a script, "a big JS module in Wikibase that would have loaded ~90kib of javascript in the wikipedias". hoo says, "note that it has *huge* dependencies on other things not usually loaded in client at all. that's the actual point, introducing dependencies".
 * See Bug 55550 - Reduce amount of code loaded in MwEmbedSupport and TimedMediaHandler startup and BeforePageDisplay, addressing MwEmbedSupport's and TimedMediaHandler's premature loading of several modules.

Preserving positioning
Users should have a smooth experience; different components should render progressively. Preserve positioning of elements (e.g. avoid pushing content in a reflow).

Don't have your code discover that an element is needed and then cause them to appear. Instead, have your code preserve a place for the element ahead of time, or display an option greyed out until you know whether it's active.

It's good to explicitly state width/height on elements, or to reserve space for elements that will be rendered by JavaScript if you're sure they will be there.

Good example:
 * VisualEditor used to have its "Edit" tab appear last, effectively pushing the "Edit source" one step to the left. Aside from performance impact of having the "Edit source" be moved to a different place after it was initially rendered, it also causes additional user experience problems. When users clicked on "Edit source", they often ended up unintentionally clicking the (new) "Edit" tab instead, as that one was inserted in place of where there cursor was on "Edit source".
 * The Media Viewer within Extension:MultimediaViewer - see mmv.bootstrap.js starts the display by displaying a large black background, and then starts displaying the content from top to bottom and also absolutely positioned, so that items render smoothly and do not jump around or cause reflows.

Bad example:
 * Fundraising and other site info banners on Wikipedia may cause a "jump" of content when the banner is inserted.
 * This sort of thing can be tricky to fix "right" because the presence or height of the banner is semi-randomized. Cf. CentralNotice/Optimizing banner loading.

Shared environment
Your code is running in a shared environment. Thus, long-running queries should be on a dedicated server, and watch out for deadlocks and lock-wait timeouts. When assessing whether your queries will take "a long time" or cause contention, profile them. These numbers will always be relative to the performance of the server in general, and to how often it will be run.

Long-running queries
Long-running queries that do reads should be on a dedicated server, as we do with analytics, whether the read is longrunning or you have repeatable read. Keeping one transaction open for more than (ideally) seconds is a bad idea on production servers. MySQL has to keep various rows open in indices, and that makes things slow in general - including slowing down other queries on other tables! We have research databases - use those.

Locking
Wikimedia's MySQL/MariaDB servers use InnoDB, which supports repeatable read transactions.

Gap locking is part of "Next-key Locks" which is how InnoDB implements REPEATABLE READ transaction isolation level. We at Wikimedia have repeatable read transaction isolation on by default (unless the code is running in Command-Line Interaction (CLI) mode, as with the maintenance scripts), so all the SQL SELECT queries you do within one request will automatically get bundled into a transaction. For more on understanding this, look at the Wikipedia articles on en:Isolation (database systems) and look up repeatable read (snapshot isolation), to understand why we want to avoid phantom reads and other phenomena.


 * Anytime you are doing a write/delete/update query that updates something, it will have gap locks on it unless it is by a unique index. Even if you are not in repeatable read, even if you are doing one SELECT, it will be internally consistent if, for example, it returns multiple rows.

Thus: do your operations, e.g., DELETE or UPDATE or REPLACE, on a unique index, such as a primary key. The situations where you were causing gap locks and you want to switch to doing operations on a primary key are ones where you want to do a SELECT first to find the ID to operate on -- also, this can't be SELECT FOR UPDATE. This also means you might have to deal with consistency issues, so you may want to use INSERT IGNORE instead of INSERT.

Here's a common mistake that causes inappropriate locking: take a look at, for instance, the table  (line 208 of tables.sql), in which you have a three-column table that follows the "Entity-value-attribute" pattern.
 * 1) Column 1: the object/entity (here, UserID)
 * 2) Column 2: the name of a property for that object
 * 3) Column 3: the value associated with that property for the object

That is, you have a bunch of key-values for each entity that are all in one table. (This table schema is kind of an antipattern. But at least this is a reasonably specific table that just holds user preferences.)

In this situation, it's tempting to create a workflow for user preference changes that deletes all the rows for that userID, then reinserts new ones. But this causes a lot of work for the database. Instead, change the query so you only delete by the primary key. SELECT it first, and then, when you INSERT new values, you can use INSERT IGNORE (which ignores the insert if the row already exists). This is more efficient. (Alternatively, you can use a JSON blob, but this is hard to join on individual rows.)


 * On gap locking - Aaron Schulz wants to create some slides or diagrams to help you understand what ranges get gap locked under certain circumstances. Contact him if you want them! Read the MySQL docs, but in this case you may still be confused.

Transactions
Every web request and every database operation, in general, should occur within a transaction. However, be careful when mixing a database transaction with an operation on something else, such as another database transaction or accessing an external service like Swift. Be particularly careful with locking order. Every time you update or delete or insert anything, ask:
 * what you are locking?
 * are there other callers?
 * what are you doing, after making the query, all the way to making the commit?

Avoid excessive contention. Avoid locking things in an unnecessary order, especially when you're doing something slow and committing at the end. For instance, if you have a counter column that you increment every time something happens, then DON'T increment it in a hook just before you parse a page for 10 seconds.

Do not use READ UNCOMMIT (if someone updates a row in a transaction and has not committed it, another request can still see it) or SERIALIZABLE (every time you do SELECT, it's as if you did SELECT FOR UPDATE, a.k.a. lock-and-share mode -- locking every row you select until you commit the transaction -leads to lock-wait timeouts and deadlocks).

Examples
Good example:. When we update message blobs (JSON collections of several translations of specific messages), we have to conditionally update certain rows and deal with concurrent attempts to update. In a previous version of the code, the code locked a row in order to write to it and avoid overwrites, but this could have led to contention. In contrast, in the current codebase, the  method performs a repeated attempt at update until it determines (by checking timestamps) that there will be no conflict. See lines 212-214 for an explanation and see and 208-234 for the outer do-while loop that processes  until it is empty.

Bad example: How we used to do ArticleFeedbackv5. Code included:

Bad practices here include the multiple counter rows with id = '0' updated every time feedback is given on any page, and the use of DELETE + INSERT IGNORE to update a single row. Both result in locks that prevent >1 feedback submission saving at a time (due to the use of transactions, these locks persist beyond than the time needed by the individual statements). See minutes 11-13 of Asher Feldman's performance talk & page 17 of his slides for more explanation.

Indexing
The tables you create will be shared by other code. Use indexing, yes, including writes.

Unless you're dealing with a tiny table, you need to index writes (similarly to reads). Watch out for deadlocks and for lock-wait timeouts (e.g., doing an update or a delete by primary query, rather than some secondary key). And make sure join conditions are cleanly indexed.

You cannot index blobs, but you can index blob prefixes (the substring comprising the first several characters of the blob).

Compound keys - namespace-title pairs are all over the database. You need to order your query by asking for namespace first, then title!

Use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query. (If it says "Using temporary table" or "Using filesort" in the EXTRA column, that's bad! If "possible_keys" is NULL, that's bad!) See the Performance profiling for Wikimedia code guide, and for more details, see Roan Kattouw's 2010 talk on security, scalability and performance for extension developers, Roan's MySQL optimization tutorial from 2012 (slides), and Tim Starling's 2013 performance talk.

Good example: See the sections starting at line 802 and line 1429 of tables.sql, specifically the  and   tables. One of them also offers a reverse index, which gives you a cheap alternative to SORT BY.

Bad example: See this changeset (a fix). As the note states, "needs to be id/type, not type/id, according to the definition of the relevant index in :  ". Rather than using the index that was built on the id-and-type combination, the previous code (that this is fixing) attempted to specify an index that was "type-and-id", which did not exist. Thus, MariaDB did not use the index, and thus instead tried to order the table without using the index, which caused the database to try to sort 20 million rows with no index.

Persistence layer
Choose the right persistence layer for your needs: Redis job queue, MariaDB database, or Swift file store. Maybe use a cache instead of a persistence layer.

At Wikimedia, Redis, MariaDB, Swift, and memcached are all local services that we expect to fetch things from. (Also things like Parsoid that plug in for specific things like VisualEditor.) We expect them to be on a low-latency network. They are local services, as opposed to remote services like Varnish.

People often put things into databases that ought to be in a cache or a queue. Here's when to use which:
 * 1) MySQL/MariaDB database - longterm storage of structured data and blobs.
 * 2) Swift file store - longterm storage for binary files that may be large. See Media storage for details.
 * 3) Redis jobqueue - you add a job to be performed, the job is done, and then the job is gone. You don't want to lose the jobs before they are run. But you are ok with there being a delay.
 * (in the future maybe we should have a high-latency and a low-latency queue.)

A cache, such as memcached, is storage for things that persist between requests, and that you don't need to keep - you're fine with losing any one thing. Use memcached to store objects if the database could recreate them but it would be computationally expensive to do so, so you don't want to recreate them too often. You can imagine a spectrum between caches and stores, varying on how long we expect objects to live in the service before getting evicted; see the Caching layers section for more.

Permanent names: In general, store resources under names that won't change. We made the mistake of storing files under their "pretty names" - if you click Move, it ought to be fast (renaming title), but other versions of the file also have to be renamed. And Swift is distributed, so you can't just change the metadata on one volume of one system.

Object size: Memcached sometimes gets abused by putting big objects in there, or where it would be cheaper to recalculate than to retrieve. So don't put things in memcached that are TOO trivial - that causes an extra network fetch for very little gain. A very simple lookup, like "is a page watched by current user", does not go in the cache, because it's indexed well so it's a fast database lookup.

When to use the job queue: If the thing to be done is fast (~5 milliseconds) or needs to happen synchronously, then do it synchronously. Otherwise, put it in the job queue. Examples using the job queue:


 * Updating link table on pages modified by a template change
 * Transcoding a video that has been uploaded

HTMLCacheUpdate used to be partly synchronous (it would update a few backlinks immediately) but this workflow started causing deadlocks, and users do not want to see deadlock notifications! So we changed that in 2012. We also used to think that, for instance, file uploads needed to be synchronous, but we moved them to an async workflow because users started experiencing timeouts.

In some cases it may be valuable to create separate classes of job queues -- for instance video transcoding done by Extension:TimedMediaHandler is stored in the job queue, but a dedicated runner is used to keep the very long jobs from flooding other servers. Currently this requires some manual intervention to accomplish (see TMH as an example).

Extensions that use the jobqueue a lot include RenameUser, TranslationNotification, Translate, GWToolset, and MassMessage.

Additional examples:
 * large uploads. UploadWizard has API core modules and core jobs take care of taking chunks of file, reassembling, turning into a file the user can view. The user starts defining the description, metadata, etc., and the data is sent 1 chunk at a time.
 * purging all the pages from Varnish & bumping the  col in the database, which tells parser cache it's invalid and needs to be regenerated
 * refreshing links: when a page links to many pages, or it has categories, it's better to refresh links or update categories after saving, then propagate the change. (For instance, adding a category to a template or removing it, which means every page that uses that template needs to be linked to the category -- likewise with files, externals, etc.)

How slow or contentious is the thing you are causing? Maybe your code can't do it on the same web request the user initiated. You do not want an HTTP request that a user is waiting on to take more than a few seconds.

Example: You create a new kind of notification. Good idea: put the actual notification action (emailing people) or adding the flags (user id n now has a new notification!) into the jobqueue. Bad idea: putting it into a database transaction that has to commit and finish before the user gets a response.

Good example: The Beta features extension lets a user opt in for a "Beta feature" and displays, to the user, how many users have opted in to each of the currently available Beta features. We store those counts in the database as part of  (it's a preference and therefore belongs there). It's important to immediately update the user's own preference and be able to display the updated preference on page reload, but it's not important to immediately report to the user the increase or decrease in the count, and this information doesn't get reported in Special:Statistics.
 * Therefore, we update those user counts every half hour or so, and no more. Specifically, we do a SELECT query in a job that gets executed in the job queue. This takes a long time - upwards of 5 minutes each time! So we do it once, and then on the next user request, it gets cached in memcached for the page https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures . (They won't get updated at all if no one tries to fetch them, but that is unlikely.) If a researcher needs a realtime count, she can directly query the database outside of MediaWiki application flow.
 * Code: UpdateBetaFeatureUserCountsJob.php and BetaFeaturesHooks.php.

Bad example:

Work when cache hits and misses
Wikimedia uses and depends heavily on many different caching layers, so your code needs to work in that environment! (But it also must work if everything misses cache.)

Cache-on-save: We have a preemptive cache-repopulation strategy at Wikimedia: if your code will create or modify a large object when the user hits "save" or "submit", then along with saving the modified object in the database/filestore, populate the right cache with it (or schedule a job in the job queue to do so). This will give users faster results than if we regenerated those large things dynamically when someone hit the cache. We aggressively cache localization (i18n) messages, SpamBlacklist data, and parsed text upon save. (See "Caching layers" for more.)

If something is VERY expensive to recompute, then use a cache that is somewhat closer to a store. For instance, you might use the backend (secondary) Varnishes, which we call a cache, but which are closer to a store on the cache-to-store spectrum because objects tend to persist longer there (on disk).

Cache misses are normal: Avoid writing code that, on cache miss, is ridiculously slow. (For instance, it's not okay to  and assume that a memcache between the database and the user will make it all right; cache misses and timeouts eat a lot of resources. Caches are not magic.) The cluster has a limit of 180 seconds per script (see the limit in Puppet); if your code is so slow that a function exceeds the max execution time, it will be killed.

Write your queries such that an uncached computation will take a reasonable amount of time. To figure out what is reasonable for your circumstance, ask the Site performance and architecture team.

If you can't make it fast, see if you can do it in the background. For example, see some of the statistics special pages that run expensive queries. These can then be run on a dedicated time on large installations. But again, this requires manual setup work -- only do this if you have to.

Watch out for cached HTML: HTML output may sit around for a long time and still needs to be supported by the CSS and JS. Problems where old JS/CSS hang around are in some ways more obvious, so it's easier to find them early in testing, but stale HTML can be insidious!

Good example:

Bad example: a change "disabled varnish cache, where previously it was set to cache in varnish for 10 seconds. Given the amount of hits that page gets, even a 10 second cache is probably helpful."

Caching layers
We share our caches and want to increase the cache hit ratio; watch out if you're introducing new cookies, shared resources, bundled requests or calls, or other changes that will vary requests and reduce cache hit ratio.

Caching layers that you need to care about:
 * 1) Browser caches
 * 2) native browser cache
 * 3) LocalStorage. See meta:Research:Module storage performance to see the statistical proof that storing ResourceLoader storage in LocalStorage speeds page load times and causes users to browse more.
 * 4) Front-end Varnishes
 * The Varnish caches cache entire HTTP responses, including thumbnails of images, frequently-requested pages, ResourceLoader modules, and similar items that can be retrieved by URL. The front-end Varnishes keep these in memory. A weighted-random load balancer (LVS) distributes web requests to the front-end Varnishes.
 * Because we distribute our front-end Varnishes geographically (in our Amsterdam & San Francisco caching centers as well as our Texas and Virginia data centers) to reduce latency to users, we also call our front-end Varnishes "edge caching" and you may hear us refer to them as a CDN (content delivery network).
 * 1) Back-end Varnishes
 * If a frontend Varnish doesn't have a response cached, it passes the request to the back-end Varnishes via hash-based load balancing (on the hash of the URI). The backend Varnishes hold more responses, storing them ondisk. Every URL is on at most one backend Varnish.
 * 1) object cache (implemented in memcached in WMF production, but other implementations include Redis, APC, etc.)
 * 2) database's query cache

The object cache is generic service used for many things. One: user object cache. Generic service that many services can stash things in.

You can also use that service as a layer in a larger caching strategy, which is what the parser cache does in Wikimedia's setup. One layer of the parser cache lives in the object cache.

Generally, don't disable the parser cache. See: How to use the parser cache.

Also watch out for putting inappropriate objects into a cache.

How do you choose which cache(s) to use? See "Picking the right cache: a guide for MediaWiki developers".

Figure out how to appropriately invalidate content from caching by purging, directly updating (pushing data into cache), or otherwise bumping timestamps or versionIDs. Your application needs will determine your Cache purging strategy.

Good example: (from an in-progress change) of not poisoning the cache with request-specific data (when cache is not split on that variable). Background:  will use MediaWiki's cookie settings, so client-side developers don't think about this. These are passed via the ResourceLoader startup module. Issue: However, it doesn't use Manual:$wgCookieSecure (instead, this is documented not to be supported), since the default value (' ') varies by the request protocol, and the startup module does not vary by protocol. Thus, the first hit could poison the module with data that will be inapplicable for other requests.

Bad examples:
 * GettingStarted error: Don't use Token in your cookie name - https://git.wikimedia.org/commit/mediawiki%2Fextensions%2FGettingStarted.git/22fbeedfc51fdd3ed780ed52bc880df8d07bfc19
 * WikidataClient was fetching a large object from memcached just to decide which project group it was on, when it would have been more efficient to simply recompute it by putting the very few values needed into a global variable. (See the changeset that fixed the bug.)
 * Template parse on every page view is a bad thing, as it obviates the advantage of the parser cache (the cache that caches parsed wikitext).