User:Sharihareswara (WMF)/Performance guidelines

These performance guidelines aim to help MediaWiki developers avoid common performance problems that slow down MediaWiki and the site. Moved most to Performance guidelines and Performance profiling for Wikimedia code.

Wikimedia-specific gotchas

 * image loading?
 * cache & cachebusting?

axes:
 * avoid excessive contention
 * avoid locking things in an unnecessary order, espec slow, committing at the end
 * counter column you increment every time something happens. DON'T increment it in a hook before you parse a page for 10 seconds.
 * avoid things that, on cache miss, are ridiculously slow. (People think that it's ok to count * and put memcache in front of it, but misses and timeouts eat a lot of resources. Caches are not magic.)
 * Make your queries such that uncached is okay


 * What other hooks and weird extensions are happening?


 * We don't store files in a database. We store files Somewhere Else. Usually SWIFT for Wikimedia, but you could make the case to put it somewhere else.
 * If you are storing blobs, ... see if you can reuse current sys. In general, store resources under names that won't change.
 * We made the mistake of storing files under their "pretty names" - if you click Move, it ought to be fast (renaming title), but other versions of the file also have to be renamed. And Swift is distributed, so you can't just change the metadata on one volume of one system.

When to use the job queue
If the thing to be done is fast (~5 milliseconds) or needs to happen synchronously, then do it synchronously. Otherwise, put it in the job queue.

The job queue does things with the database.

Does a thing happen synchronously or async (maybe triggered by a user action)? * MW developers have sometimes thought things need to be synchronous - e.g., file uploads - and then squid timed out. So we had to move that to async * HTMLCacheUpdate used to be partly synchronous (a few backlinks, would do them immediately) - but then there were deadlocks. Users do not want to see deadlock notifications! Got changed maybe 2012 * GWToolset

We use Redis to do most of the heavy lifting - to store queue itself. * The runner part is still in-house;

Presentations and documents

 * 1) 2010 - Roan's Wikimania talk on security, scalability and performance for extension developers
 * 2) July 2011 - Tim's security & performance talk
 * 3) August 2011 - Tim on performance
 * 4) March 2012 - Asher on site performance, graphite, and gdash
 * 5) May 2012 - "Scalable Web Architecture and Distributed Systems" by Kate Matsudaira
 * 6) June 2012 - Roan's MySQL optimization tutorial (SQL indexing Tutorial.pdf)
 * 7) September 2012 - MediaWiki Performance Profiling.ogv
 * 8) February 2013 - Sumana on graphite and ganglia
 * 9) spring 2013 - Tim's performance talk at Amsterdam hackathon
 * 10) January 2014 - Architecture Summit notes on performance
 * 11) January 2014 - Graphite docs
 * 12) ? - Job class reference
 * 13) March 2014 - Our use cases for Redis
 * 14) April 2014 - Manual:Job queue (and, from November 2013, Manual:Job queue/For developers)
 * 15) ? - Manual:How_to_debug
 * 16) April 2014 - http://ljungblad.nu/post/83400324746/80-of-end-user-response-time-is-spent-on-the-frontend

Checking how your code works with cache layers
We are massively cached! Your code needs to work in that environment! (but also work if everything misses cache.)

There are ways for extensions to disable caching, but don't. You need to understand why caching is breaking things - it's a code smell. But this is a rookie mistake and it would be condescending to emphasize it too strongly.

We have stuff in Vagrant or should get to adding it -- so, for more complicated bits of caching, there's a Vagrant role you can flip on to test it. Note to Sumana: ask Ori or other Vagrant experts for more on this.
 * also - when to use the jobqueue

For future: The beta cluster should be env of last resort and basically will not tell you about perf problems, because it is all on VMs. Won't tell you about cachebusting either, really. So don't wait till it hits beta cluster to figure it out! (Maybe someday we'll have a real performance testing cluster that's not a set of VMs.)