Wikimedia Performance Team/Backend performance

How to think about performance

 * Be prepared to be surprised by the performance of your code - we are famously bad at predicting this.
 * Be scrupulous about measuring performance (in your development environment AND in production) and know where time is being spent.
 * When latency is identified, take responsibility for it and make it a priority
 * (you have the best idea of usage patterns & what to test)
 * Performance is often related to other code smells; think about the root cause.
 * Expensive but valuable actions that miss the cache should take, at most, 5 seconds; 2 seconds is better.

General performance principles

 * Front-end:
 * We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. Thus, we use ResourceLoader.
 * Users and clients should get the first useful bit of JavaScript first. Thus: adopt the init module pattern.
 * Users should have a smooth experience; avoid reflows.
 * (more specific HOWTOs: look at your Chrome developer tools, the general Wikimedia grid graph, the bits Ganglia graph, and the app servers Ganglia graph, and see Performance statistics)
 * Back-end:
 * Your code is running in a shared environment. Thus, long-running queries should be on a dedicated server, and watch out for deadlocks and lock-wait timeouts.
 * The tables you create will be shared by other code. Use indexing, yes, including writes.
 * (more specific HOWTOs: follow Manual:Profiling, look at your local log, the general Wikimedia grid graph, the errors/fatals graph, the app servers Ganglia graph, and the Graphite graphs for the sections you've profiled, and look for slow queries and parses, and see Performance statistics)
 * Wikimedia-specific gotchas
 * Choose the right tool for the job: job queue versus database versus SWIFT
 * We are massively cached! Your code needs to work in that environment! (but also work if everything misses cache.)

Details
In the worst case, an action that is sort of expensive but valuable, if it misses hitting the cache, should take at most 5 seconds. Strive for two seconds.


 * example: saving a new edit to a page
 * example: rendering a video thumbnail

Back-end performance (hitting the database)
Use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query. (will go into HOWTO)

Long-running queries
long-running queries that do reads should be on a dedicated server like we do with analytics, whether the read is longrunning or you have repeatable read - 1 transaction open for more than (ideally) seconds: bad idea on production servers. MySQL has to keep various rows open in indices, makes things slow in general. For other queries on other tables! ** There are research databases - use those.

Numbers: relative to perf server & to how often it will be run. (also more esoteric considerations)

Writes
Unless you're dealing with a tiny table, you need to index writes (similarly to reads). Watch out for deadlocks and for lock-wait timeouts (e.g., doing an update or a delete by primary query, rather than some secondary key).

gap locking - there are antipatterns (could be link tables, etc.) Entity doesn't have to be userID. BETTER: a few ways.
 * entity value attribute - key value map per entity - user id, pref id, pref value - it's tempting to - when you change prefs, delete all the rows for that userID, then reinsert new ones.
 * Have a JSON blob (hard to join on indiv rows)
 * change the query so you only delete by the primary key (which means you have to SELECT it first)
 * Locking select? don't do that.
 * So: select first, then decide what to do.... when you INSERT, you can INSERT IGNORE - if the row already exists, meh.

Bad example: how we used to do ArticleFeedbackv5. See minutes 11-13 of Asher's talk & https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=17

Mixing DB and non-DB transactions

 * Careful in mixing ops on an external thing like SWIFT or another database with a db transaction. Be careful. This is also about locking order. Every time you update or delete or insert anything, ask what you are locking, are there other callers, what are you doing after making the query all the way to making the commit?


 * Every web request, everything should happen in a transaction

Front-end performance (JavaScript and the browser)

 * We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching.
 * Thus, we use ResourceLoader. See good examples.
 * Adopt the init module pattern.
 * Avoid reflows (lots of good and bad examples, ask for help)

Wikimedia-specific gotchas

 * image loading?
 * cache & cachebusting?

axes:
 * avoid excessive contention
 * avoid locking things in an unnecessary order, espec slow, committing at the end
 * counter column you increment every time something happens. DON'T increment it in a hook before you parse a page for 10 seconds.
 * avoid things that, on cache miss, are ridiculously slow. (People think that it's ok to count * and put memcache in front of it, but misses and timeouts eat a lot of resources. Caches are not magic.)
 * Make your queries such that uncached is okay


 * What other hooks and weird extensions are happening?


 * We don't store files in a database. We store files Somewhere Else. Usually SWIFT for Wikimedia, but you could make the case to put it somewhere else.
 * If you are storing blobs, ... see if you can reuse current sys. In general, store resources under names that won't change.
 * We made the mistake of storing files under their "pretty names" - if you click Move, it ought to be fast (renaming title), but other versions of the file also have to be renamed. And Swift is distributed, so you can't just change the metadata on one volume of one system.

When to use the job queue
If the thing to be done is fast (~5 milliseconds) or needs to happen synchronously, then do it synchronously. Otherwise, put it in the job queue.