Wikimedia Performance Team/Backend performance

How to think about performance

 * Be prepared to be surprised by the performance of your code - we are famously bad at predicting this.
 * Be scrupulous about measuring performance (in your development environment AND in production) and know where time is being spent.
 * When latency is identified, take responsibility for it and make it a priority
 * (you have the best idea of usage patterns & what to test)
 * Performance is often related to other code smells; think about the root cause.
 * Expensive but valuable actions that miss the cache should take, at most, 5 seconds; 2 seconds is better.
 * If that isn't enough consider using the job queue to perform a task on background servers

General performance principles

 * Front-end:
 * We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. Thus, we use ResourceLoader.
 * Defer loading modules that don't affect the initial rendering of the page (above the fold). Load as little JavaScript as needed from the top loading queue.
 * Users should have a smooth experience; different components should render progressively. Preserve positioning of elements (e.g. avoid pushing content in a reflow). need good and bad examples
 * Back-end:
 * Your code is running in a shared environment. Thus, long-running queries should be on a dedicated server, and watch out for deadlocks and lock-wait timeouts.
 * The tables you create will be shared by other code. Use indexing, yes, including writes. have a good example, need a bad example
 * Wikimedia-specific gotchas
 * Choose the right persistence layer for your needs: Redis job queue, MariaDB database, Swift file store, or memcached cache. need good and bad examples
 * Wikimedia uses and depends heavily on many different caching layers, so your code needs to work in that environment! (But it also must work if everything misses cache.) need good and bad examples
 * We share a cache and want to increase the cache hit ratio; watch out if you're introducing new cookies, shared resources, bundled requests or calls, or other changes that will vary requests and reduce cache hit ratio. need good example, have bad examples

How to think about performance
Measure how fast your code works, so you can make decisions based on facts instead of superstition or feeling. Use these principles together with the Architecture guidelines and Security guidelines.

Dos and Don'ts
In the worst case, an action that is sort of expensive but valuable, if it misses hitting the cache, should take at most 5 seconds. Strive for two seconds.


 * example: saving a new edit to a page
 * example: rendering a video thumbnail

ResourceLoader
We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. Thus, we use ResourceLoader.

Learn how to develop code with ResourceLoader; we have reasonably complete documentation in the ResourceLoader section.

Be careful about timestamp and freshness computation.

Good example: Line 104 and onwards of the Example extension demonstrates how to use ResourceLoader well.

Bad examples:
 * See the "before" parts of the "before-and-after" explanations in ResourceLoader/Developing with ResourceLoader. Fix: There is no such explanation
 * Extension:SyntaxHighlight GeSHi before used to embed the styles in the section of page HTML directly, without using ResourceLoader.

Deferring loading
Defer loading modules that don't affect the initial rendering of the page (above the fold). Load as little JavaScript as needed from the top loading queue.

Downloading and executing resources (styles, MediaWiki messages, and scripts) can slow down a user's experience. Per "Developing with ResourceLoader", when possible, load modules that the user will not immediately need via the bottom queue, rather than the top queue.

Chrome's developer tools are a good way to introspect the order in which your code is loading resources. For further advice on tuning front-end performance, Ilya Grigorik's book "High Performance Browser Networking" is excellent and available to read for free.

Good example: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/client/resources/wikibase.client.linkitem.init.js performs lazy loading.

Bad example: That Wikibase JS code used to be a script, "a big JS module in Wikibase that would have loaded ~90kib of javascript in the wikipedias". hoo says, "note that it has *huge* dependencies on other things not usually loaded in client at all. that's the actual point, introducing dependencies".

Preserving positioning
Users should have a smooth experience; different components should render progressively. Preserve positioning of elements (e.g. avoid pushing content in a reflow).

Don't have your code discover that an element is needed and then cause them to appear. Instead, have your code preserve a place for the element ahead of time, or display an option greyed out until you know whether it's active.

Good example:
 * Putting width/height on elements!
 * Reserve space for elements that will be rendered by JavaScript if you're sure they will be there

Bad example:
 * Fundraising and other site info banners on Wikipedia may cause a "jump" of content when the banner is inserted.
 * This sort of thing can be tricky to fix "right" because the presence or height of the banner is semi-randomized.
 * The VisualEditor tab ("Edit") used to display last, pushing the "Edit source" tab to one side just as some users clicked on an "Edit source" pixel. Users ended up using an unintended editing interface.

Shared environment
Your code is running in a shared environment. Thus, long-running queries should be on a dedicated server, and watch out for deadlocks and lock-wait timeouts.

long-running queries that do reads should be on a dedicated server like we do with analytics, whether the read is longrunning or you have repeatable read - 1 transaction open for more than (ideally) seconds: bad idea on production servers. MySQL has to keep various rows open in indices, makes things slow in general. For other queries on other tables! ** There are research databases - use those.

Numbers: relative to perf server & to how often it will be run. (also more esoteric considerations)

gap locking - there are antipatterns (could be link tables, etc.) Entity doesn't have to be userID. BETTER: a few ways.
 * entity value attribute - key value map per entity - user id, pref id, pref value - it's tempting to - when you change prefs, delete all the rows for that userID, then reinsert new ones.
 * Have a JSON blob (hard to join on indiv rows)
 * change the query so you only delete by the primary key (which means you have to SELECT it first)
 * Locking select? don't do that.
 * So: select first, then decide what to do.... when you INSERT, you can INSERT IGNORE - if the row already exists, meh.


 * Careful in mixing ops on an external thing like SWIFT or another database with a db transaction. Be careful. This is also about locking order. Every time you update or delete or insert anything, ask what you are locking, are there other callers, what are you doing after making the query all the way to making the commit?


 * Every web request, everything should happen in a transaction


 * avoid excessive contention
 * avoid locking things in an unnecessary order, espec slow, committing at the end
 * counter column you increment every time something happens. DON'T increment it in a hook before you parse a page for 10 seconds.

Good example:. When we update message blobs (JSON collections of several translations of specific messages), we have to conditionally update certain rows and deal with concurrent attempts to update. In a previous version of the code, the code locked a row in order to write to it and avoid overwrites, but this could have led to contention. In contrast, in the current codebase, the  method performs a repeated attempt at update until it determines (by checking timestamps) that there will be no conflict. See lines 212-214 for an explanation and see and 208-234 for the outer do-while loop that processes  until it is empty.

Bad example: How we used to do ArticleFeedbackv5. See minutes 11-13 of Asher's talk & https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=17

Indexing
The tables you create will be shared by other code. Use indexing, yes, including writes.

Unless you're dealing with a tiny table, you need to index writes (similarly to reads). Watch out for deadlocks and for lock-wait timeouts (e.g., doing an update or a delete by primary query, rather than some secondary key).

Use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query. (will go into HOWTO) (If it says "Using temporary table" or "Using filesort" in the EXTRA column, that's bad! If "possible_keys" is NULL, that's bad!)

Make sure join conditions are cleanly indexed.

Compound keys - namespace-title pairs are all over the database. You need to order your query by asking for namespace first, then title!

You cannot index blobs, but you can index blob prefixes (the substring comprising the first several characters of the blob).

Good example: See the sections starting at line 802 and line 1429 of tables.sql, specifically the  and   tables. One of them also offers a reverse index, which gives you a cheap alternative to SORT BY.

Bad example: See this changeset (a fix). As the note states, "needs to be id/type, not type/id, according to the definition of the relevant index in wikibase.sql: wb_entity_per_page (epp_entity_id, epp_entity_type)". Rather than using the index that was built on the id-and-type combination, the previous code (that this is fixing) attempted to specify an index that was "type-and-id", which did not exist. Thus, MariaDB did not use the index, and thus instead tried to order the table without using the index, which caused the database to try to sort 20 million rows with no index.

Persistence layer
Choose the right persistence layer for your needs: Redis job queue, MariaDB database, Swift file store, or memcached cache.

These four tools are all local services that we expect to fetch things from. (Also things like Parsoid that plug in for specific things like VisualEditor.) We expect them to be on a low-latency network. They are local services, as opposed to remote services like Varnish.

People often put things into databases that ought to be in a cache or a queue. Here's when to use which:
 * 1) MySQL/MariaDB database - longterm storage of structured data and blobs.
 * 2) Swift file store - longterm storage for binary files that may be large.
 * 3) memcached cache - storage for quick things that you don't need to keep - you're fine with losing any one thing
 * 4) Redis jobqueue - you put them in, the job is done, and then they are done. You don't want to lose them before they are run. But you are ok with there being a delay.
 * (in the future maybe we should have a high-latency and a low-latency queue.)

Images: Image loading involves Swift but not just Swift! First, your code must get the metadata about the image, which might come from the local database, or memcached, or Commons. Then, you need to get a thumbnail of the image at the dimensions your page requires. Rather than create the thumbnail immediately on demand via parsing the filename and dimensions, Wikimedia's MediaWiki is configured to use the "404 handler." (see Manual:Thumb_handler.php) Your page first receives a URL indicating the eventual location of the thumbnail, then the browser asks for that URL; the web server initially gets an internal 404 error; the 404 handler then kicks off the thumbnailer to create the thumbnail, and the response gets sent to the client.


 * As it is sent to the client, each thumbnail is stored in a SWIFT store (and later discarded if the canonical image changes or is deleted) and in our Varnish caches (and evicted if the image changes or is deleted).

Permanent names: In general, store resources under names that won't change. We made the mistake of storing files under their "pretty names" - if you click Move, it ought to be fast (renaming title), but other versions of the file also have to be renamed. And Swift is distributed, so you can't just change the metadata on one volume of one system.

Object size: Memcached sometimes gets abused by putting big objects in there, or where it would be cheaper to recalculate than to retrieve. So don't put things in memcached that are TOO trivial - that causes an extra network fetch for very little gain. A very simple lookup, like "is a page watched by current user", does not go in the cache, because it's indexed well so it's a fast DB lookup.

When to use the job queue: If the thing to be done is fast (~5 milliseconds) or needs to happen synchronously, then do it synchronously. Otherwise, put it in the job queue. Examples using the job queue:


 * Updating link table on pages modified by a template change
 * Transcoding a video that has been uploaded

In some cases it may be valuable to create separate classes of job queues -- for instance video transcoding done by Extension:TimedMediaHandler is stored in the job queue, but a dedicated runner is used to keep the very long jobs from flooding other servers. Currently this requires some manual intervention to accomplish (see TMH as an example).

Good example:

Bad example:

Work when cache hits and misses
Wikimedia uses and depends heavily on many different caching layers, so your code needs to work in that environment! (But it also must work if everything misses cache.)

axes:
 * avoid things that, on cache miss, are ridiculously slow. (People think that it's ok to count * and put memcache in front of it, but misses and timeouts eat a lot of resources. Caches are not magic.)
 * Make your queries such that uncached is okay
 * If you can't make it fast, see if you can do it in the background -- such as some of the statistics special pages that run expensive queries. These can then be run on a dedicated time on large installations. But again, this requires manual setup work -- only do this if you have to.

Good example:

Bad example:

Caching layers
We share a cache and want to increase the cache hit ratio; watch out if you're introducing new cookies, shared resources, bundled requests or calls, or other changes that will vary requests and reduce cache hit ratio.

Also watch out for putting inappropriate objects into a cache.

Good example:

Bad examples:
 * GettingStarted error: Don't use Token in your cookie name - https://git.wikimedia.org/commit/mediawiki%2Fextensions%2FGettingStarted.git/22fbeedfc51fdd3ed780ed52bc880df8d07bfc19
 * WikidataClient was fetching a large object from memcached just to decide which project group it was on, when it would have been more efficient to simply recompute it by putting the very few values needed into a global variable. (See the changeset that fixed the bug.)
 * Template parse on every page view is a bad thing, as it obviates the advantage of the parser cache (the cache that caches parsed wikitext).