Wikimedia Performance Team/Backend performance

How to think about performance

 * Be prepared to be surprised by the performance of your code - we are famously bad at predicting this.
 * Be scrupulous about measuring performance (in your development environment AND in production) and know where time is being spent.
 * When latency is identified, take responsibility for it and make it a priority
 * (you have the best idea of usage patterns & what to test)
 * Performance is often related to other code smells; think about the root cause.
 * Expensive but valuable actions that miss the cache should take, at most, 5 seconds; 2 seconds is better.
 * If that isn't enough consider using the job queue to perform a task on background servers

General performance principles

 * Front-end:
 * We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. Thus, we use ResourceLoader. have good examples, need bad examples
 * Users and clients should get the first useful bit of JavaScript first. Thus: adopt the init module pattern, a.k.a. lazy loading. already have good and bad examples
 * Users should have a smooth experience; avoid reflows. need good and bad examples
 * Back-end:
 * Your code is running in a shared environment. Thus, long-running queries should be on a dedicated server, and watch out for deadlocks and lock-wait timeouts. need good example, have a bad example
 * The tables you create will be shared by other code. Use indexing, yes, including writes. need good and bad examples
 * Wikimedia-specific gotchas
 * Choose the right tool for the job: job queue versus database versus SWIFT need good and bad examples
 * We are massively cached! Your code needs to work in that environment! (but also work if everything misses cache.) need good and bad examples
 * We share a cache and want to increase the cache hit ratio; watch out if you're introducing new cookies, shared resources, bundled requests or calls, or other changes that will vary requests and reduce cache hit ratio. need good example, have a bad example

Details
In the worst case, an action that is sort of expensive but valuable, if it misses hitting the cache, should take at most 5 seconds. Strive for two seconds.


 * example: saving a new edit to a page
 * example: rendering a video thumbnail

Back-end performance (hitting the database)
Use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query. (will go into HOWTO)

Long-running queries
long-running queries that do reads should be on a dedicated server like we do with analytics, whether the read is longrunning or you have repeatable read - 1 transaction open for more than (ideally) seconds: bad idea on production servers. MySQL has to keep various rows open in indices, makes things slow in general. For other queries on other tables! ** There are research databases - use those.

Numbers: relative to perf server & to how often it will be run. (also more esoteric considerations)

Writes
Unless you're dealing with a tiny table, you need to index writes (similarly to reads). Watch out for deadlocks and for lock-wait timeouts (e.g., doing an update or a delete by primary query, rather than some secondary key).

gap locking - there are antipatterns (could be link tables, etc.) Entity doesn't have to be userID. BETTER: a few ways.
 * entity value attribute - key value map per entity - user id, pref id, pref value - it's tempting to - when you change prefs, delete all the rows for that userID, then reinsert new ones.
 * Have a JSON blob (hard to join on indiv rows)
 * change the query so you only delete by the primary key (which means you have to SELECT it first)
 * Locking select? don't do that.
 * So: select first, then decide what to do.... when you INSERT, you can INSERT IGNORE - if the row already exists, meh.

Bad example: how we used to do ArticleFeedbackv5. See minutes 11-13 of Asher's talk & https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=17

Mixing DB and non-DB transactions

 * Careful in mixing ops on an external thing like SWIFT or another database with a db transaction. Be careful. This is also about locking order. Every time you update or delete or insert anything, ask what you are locking, are there other callers, what are you doing after making the query all the way to making the commit?


 * Every web request, everything should happen in a transaction

Front-end performance (JavaScript and the browser)

 * We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching.
 * Thus, we use ResourceLoader. See good examples.
 * Adopt the init module pattern.
 * Avoid reflows (lots of good and bad examples, ask for help)

Wikimedia-specific gotchas

 * image loading?
 * cache & cachebusting?

axes:
 * avoid excessive contention
 * avoid locking things in an unnecessary order, espec slow, committing at the end
 * counter column you increment every time something happens. DON'T increment it in a hook before you parse a page for 10 seconds.
 * avoid things that, on cache miss, are ridiculously slow. (People think that it's ok to count * and put memcache in front of it, but misses and timeouts eat a lot of resources. Caches are not magic.)
 * Make your queries such that uncached is okay


 * What other hooks and weird extensions are happening?


 * We don't store files in a database. We store files Somewhere Else. Usually SWIFT for Wikimedia, but you could make the case to put it somewhere else.
 * If you are storing blobs, ... see if you can reuse current sys. In general, store resources under names that won't change.
 * We made the mistake of storing files under their "pretty names" - if you click Move, it ought to be fast (renaming title), but other versions of the file also have to be renamed. And Swift is distributed, so you can't just change the metadata on one volume of one system.

When to use the job queue
If the thing to be done is fast (~5 milliseconds) or needs to happen synchronously, then do it synchronously. Otherwise, put it in the job queue.


 * example: updating link table on pages modified by a template change
 * example: transcoding a video that has been uploaded