Wikimedia Performance Team/Backend performance

How to think about performance

 * Be prepared to be surprised by the performance of your code - we are famously bad at predicting this.
 * Be scrupulous about measuring performance (in your development environment AND in production) and know where time is being spent.
 * When latency is identified, take responsibility for it and make it a priority
 * (you have the best idea of usage patterns & what to test)
 * Performance is often related to other code smells; think about the root cause.
 * Expensive but valuable actions that miss the cache should take, at most, 5 seconds; 2 seconds is better.
 * If that isn't enough consider using the job queue to perform a task on background servers

General performance principles

 * Front-end:
 * We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. Thus, we use ResourceLoader.
 * Defer loading modules that don't affect the initial rendering of the page (above the fold). Load as little JavaScript as needed from the top loading queue.
 * Users should have a smooth experience; different components should render progressively. Preserve positioning of elements (e.g. avoid pushing content in a reflow). need good and bad examples
 * Back-end:
 * Your code is running in a shared environment. Thus, long-running queries should be on a dedicated server, and watch out for deadlocks and lock-wait timeouts. need good example, have a bad example
 * The tables you create will be shared by other code. Use indexing, yes, including writes. need good and bad examples
 * Wikimedia-specific gotchas
 * Choose the right tool for the job: job queue versus database versus SWIFT versus memcached. need good and bad examples
 * Wikimedia uses and depends heavily on many different caching layers, so your code needs to work in that environment! (But it also must work if everything misses cache.) need good and bad examples
 * We share a cache and want to increase the cache hit ratio; watch out if you're introducing new cookies, shared resources, bundled requests or calls, or other changes that will vary requests and reduce cache hit ratio. need good example, have a bad example

How to think about performance
Measure how fast your code works, so you can make decisions based on facts instead of superstition or feeling. Use these principles together with the Architecture guidelines and Security guidelines.

Dos and Don'ts
In the worst case, an action that is sort of expensive but valuable, if it misses hitting the cache, should take at most 5 seconds. Strive for two seconds.


 * example: saving a new edit to a page
 * example: rendering a video thumbnail

ResourceLoader
We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. Thus, we use ResourceLoader.

Learn how to develop code with ResourceLoader; we have reasonably complete documentation in the ResourceLoader section.

Good example: Line 104 and onwards of the Example extension demonstrates how to use ResourceLoader well.

Bad examples: See the "before" parts of the "before-and-after" explanations in ResourceLoader/Developing with ResourceLoader.

Deferring loading
Defer loading modules that don't affect the initial rendering of the page (above the fold). Load as little JavaScript as needed from the top loading queue.

Ilya Grigorik's book "High Performance Browser Networking" is excellent and available to read for free.

Good example: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/client/resources/wikibase.client.linkitem.init.js performs lazy loading.

Bad example: That Wikibase JS code used to be a script, "a big JS module in Wikibase that would have loaded ~90kib of javascript in the wikipedias". hoo says, "note that it has *huge* dependencies on other things not usually loaded in client at all. that's the actual point, introducing dependencies".

Preserving positioning
Users should have a smooth experience; different components should render progressively. Preserve positioning of elements (e.g. avoid pushing content in a reflow). need good and bad examples

Shared environment
Your code is running in a shared environment. Thus, long-running queries should be on a dedicated server, and watch out for deadlocks and lock-wait timeouts.

long-running queries that do reads should be on a dedicated server like we do with analytics, whether the read is longrunning or you have repeatable read - 1 transaction open for more than (ideally) seconds: bad idea on production servers. MySQL has to keep various rows open in indices, makes things slow in general. For other queries on other tables! ** There are research databases - use those.

Numbers: relative to perf server & to how often it will be run. (also more esoteric considerations)

gap locking - there are antipatterns (could be link tables, etc.) Entity doesn't have to be userID. BETTER: a few ways.
 * entity value attribute - key value map per entity - user id, pref id, pref value - it's tempting to - when you change prefs, delete all the rows for that userID, then reinsert new ones.
 * Have a JSON blob (hard to join on indiv rows)
 * change the query so you only delete by the primary key (which means you have to SELECT it first)
 * Locking select? don't do that.
 * So: select first, then decide what to do.... when you INSERT, you can INSERT IGNORE - if the row already exists, meh.


 * Careful in mixing ops on an external thing like SWIFT or another database with a db transaction. Be careful. This is also about locking order. Every time you update or delete or insert anything, ask what you are locking, are there other callers, what are you doing after making the query all the way to making the commit?


 * Every web request, everything should happen in a transaction


 * avoid excessive contention
 * avoid locking things in an unnecessary order, espec slow, committing at the end
 * counter column you increment every time something happens. DON'T increment it in a hook before you parse a page for 10 seconds.

Good example:. When we update message blobs (JSON collections of several translations of specific messages), we have to conditionally update certain rows and deal with concurrent attempts to update. In a previous version of the code, the code locked a row in order to write to it and avoid overwrites, but this could have led to contention. In contrast, in the current codebase, the  method performs a repeated attempt at update until it determines (by checking timestamps) that there will be no conflict. See lines 212-214 for an explanation and see and 208-234 for the outer do-while loop that processes  until it is empty.

Bad example: How we used to do ArticleFeedbackv5. See minutes 11-13 of Asher's talk & https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=17

Indexing
The tables you create will be shared by other code. Use indexing, yes, including writes.

Unless you're dealing with a tiny table, you need to index writes (similarly to reads). Watch out for deadlocks and for lock-wait timeouts (e.g., doing an update or a delete by primary query, rather than some secondary key).

Use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query. (will go into HOWTO) (If it says "Using temporary data" or "Using filesort" in the EXTRA column, that's bad! If "possible_keys" is NULL, that's bad!)

Make sure join conditions are cleanly indexed.

Compound keys - namespace-title pairs are all over the database. You need to order your query by asking for namespace first, then title!

Good example: See the sections starting at line 802 and line 1429 of tables.sql, specifically the  and   tables.

Persistence layer
Choose the right tool for the job: job queue versus database versus SWIFT versus memcached

These are all services we expect to fetch things from. (Also things like Parsoid that plug in for specific things like VisualEditor.)

We expect them to be on a low-latency network.

There are local services & remote services.


 * 1) MySQL/MariaDB database - longterm storage of structured data.
 * 2) SWIFT file store - longterm storage for binary files that may be large.
 * 3) memcached cache - storage for quick things that you don't need to keep - you're fine with losing any one thing
 * 4) Redis jobqueue - you put them in, the job is done, and then they are done. You don't want to lose them before they are run. But you are ok with there being a delay.
 * (in the future maybe we should have a high-latency and a low-latency queue.)

People often put things into DBs that ought to be in a cache or a queue.

Memcached sometimes gets abused by putting big objects in there, or where it would be cheaper to recalculate than to retrieve. So don't put things in memcached that are TOO trivial - that causes an extra network fetch for very little gain. Very simple lookups like "is a page watched by current user" - we do not put that in the cache because it's indexed well so it's a fast DB lookup.


 * image loading?


 * 1) get the metadata about image
 * 2) from the local database, or memcached, or Commons, or InstantCommons from another site (very slowly through API)
 * 3) Then: have to produce an image.
 * 4) 2 modes in MW. Either create thumbnail on demand via parsing, or via 404 handler (which is what we do)

Thumbnails are stored in a SWIFT store.
 * we discard if image changes or if there's a mistake!


 * We don't store files in a database. We store files Somewhere Else. Usually SWIFT for Wikimedia, but you could make the case to put it somewhere else.
 * If you are storing blobs, ... see if you can reuse current sys. In general, store resources under names that won't change.
 * We made the mistake of storing files under their "pretty names" - if you click Move, it ought to be fast (renaming title), but other versions of the file also have to be renamed. And Swift is distributed, so you can't just change the metadata on one volume of one system.

When to use the job queue: If the thing to be done is fast (~5 milliseconds) or needs to happen synchronously, then do it synchronously. Otherwise, put it in the job queue.


 * example: updating link table on pages modified by a template change
 * example: transcoding a video that has been uploaded

Work when cache hits and misses
Wikimedia uses and depends heavily on many different caching layers, so your code needs to work in that environment! (But it also must work if everything misses cache.)

axes:
 * avoid things that, on cache miss, are ridiculously slow. (People think that it's ok to count * and put memcache in front of it, but misses and timeouts eat a lot of resources. Caches are not magic.)
 * Make your queries such that uncached is okay

Caching layers
We share a cache and want to increase the cache hit ratio; watch out if you're introducing new cookies, shared resources, bundled requests or calls, or other changes that will vary requests and reduce cache hit ratio. need good example, have a bad example