User:Sharihareswara (WMF)/Performance guidelines

These performance guidelines aim to help MediaWiki developers avoid common performance problems that slow down MediaWiki and the site.

What To Do (summary)
Meta-TODOs:
 * Be prepared to be surprised by the performance of your code - we are famously bad at predicting this.
 * Be scrupulous about measuring performance and know where time is being spent
 * When latency is identified, take responsibility for it and make it a priority
 * (you have the best idea of usage patterns & what to test)
 * Performance is often related to other code smells; think about the root cause.

TODOs:
 * Use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query
 * Adopt the init module pattern
 * Avoid reflows (lots of good and bad examples, ask for help)

Details
In the worst case, an action that is sort of expensive but valuable, if it misses hitting the cache, should take at most 5 seconds. Strive for two seconds.


 * example: saving a new edit to a page
 * example: rendering a video thumbnail

Long-running queries
long-running queries that do reads should be on a dedicated server like we do with analytics, whether the read is longrunning or you have repeatable read - 1 transaction open for more than (ideally) seconds: bad idea on production servers. MySQL has to keep various rows open in indices, makes things slow in general. For other queries on other tables! ** There are research databases - use those.

Numbers: relative to perf server & to how often it will be run. (also more esoteric considerations)

Writes
Unless you're dealing with a tiny table, you need to index writes (similarly to reads). Watch out for deadlocks and for lock-wait timeouts (e.g., doing an update or a delete by primary query, rather than some secondary key).

gap locking - there are antipatterns (could be link tables, etc.) Entity doesn't have to be userID. BETTER: a few ways.
 * entity value attribute - key value map per entity - user id, pref id, pref value - it's tempting to - when you change prefs, delete all the rows for that userID, then reinsert new ones.
 * Have a JSON blob (hard to join on indiv rows)
 * change the query so you only delete by the primary key (which means you have to SELECT it first)
 * Locking select? don't do that.
 * So: select first, then decide what to do.... when you INSERT, you can INSERT IGNORE - if the row already exists, meh.

Bad example: how we used to do ArticleFeedbackv5. See minutes 11-13 of Asher's talk & https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=17

Mixing DB and non-DB transactions

 * Careful in mixing ops on an external thing like SWIFT or another database with a db transaction. Be careful. This is also about locking order. Every time you update or delete or insert anything, ask what you are locking, are there other callers, what are you doing after making the query all the way to making the commit?


 * Every web request, everything should happen in a transaction

Front-end performance (JavaScript and the browser)

 * Adopt the init module pattern
 * Avoid reflows (lots of good and bad examples, ask for help)

Wikimedia-specific gotchas

 * image loading?
 * cache & cachebusting?

axes:
 * avoid excessive contention
 * avoid locking things in an unnecessary order, espec slow, committing at the end
 * counter column you increment every time something happens. DON'T increment it in a hook before you parse a page for 10 seconds.
 * avoid things that, on cache miss, are ridiculously slow. (People think that it's ok to count * and put memcache in front of it, but misses and timeouts eat a lot of resources. Caches are not magic.)
 * Make your queries such that uncached is okay


 * What other hooks and weird extensions are happening?


 * We don't store files in a database. We store files Somewhere Else. Usually SWIFT for Wikimedia, but you could make the case to put it somewhere else.
 * If you are storing blobs, ... see if you can reuse current sys. In general, store resources under names that won't change.
 * We made the mistake of storing files under their "pretty names" - if you click Move, it ought to be fast (renaming title), but other versions of the file also have to be renamed. And Swift is distributed, so you can't just change the metadata on one volume of one system.

When to use the job queue
If the thing to be done is fast (~5 milliseconds) or needs to happen synchronously, then do it synchronously. Otherwise, put it in the job queue.

The job queue does things with the database.

Does a thing happen synchronously or async (maybe triggered by a user action)? * MW developers have sometimes thought things need to be synchronous - e.g., file uploads - and then squid timed out. So we had to move that to async * HTMLCacheUpdate used to be partly synchronous (a few backlinks, would do them immediately) - but then there were deadlocks. Users do not want to see deadlock notifications! Got changed maybe 2012 * GWToolset

We use Redis to do most of the heavy lifting - to store queue itself. * The runner part is still in-house;

Presentations and documents

 * 1) July 2011 - Tim's security & performance talk
 * 2) August 2011 - Tim on performance
 * 3) March 2012 - Asher on site performance, graphite, and gdash
 * 4) May 2012 - "Scalable Web Architecture and Distributed Systems" by Kate Matsudaira
 * 5) June 2012 - Roan's MySQL optimization tutorial (SQL indexing Tutorial.pdf)
 * 6) September 2012 - MediaWiki Performance Profiling.ogv
 * 7) February 2013 - Sumana on graphite and ganglia
 * 8) January 2014 - Architecture Summit notes on performance
 * 9) January 2014 - Graphite docs
 * 10) ? - Job class reference
 * 11) March 2014 - Our use cases for Redis
 * 12) April 2014 - Manual:Job queue (and, from November 2013, Manual:Job queue/For developers)
 * 13) ? - Manual:How_to_debug
 * 14) April 2014 - http://ljungblad.nu/post/83400324746/80-of-end-user-response-time-is-spent-on-the-frontend

Profiling/graphs to watch

 * General advice: in your dev env, follow https://www.mediawiki.org/wiki/Manual:How_to_debug#Profiling for backend (including logging all your queries and inspecting that log) and look at your Chrome developer tools for frontend :-) and once it's in production, watch the general Wikimedia grid graph, the errors/fatals graph, the bits Ganglia graph, the app servers Ganglia graph, and the Graphite graphs for the sections you've profiled, and use https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=21 and http://noc.wikimedia.org/dbtree/ to look for slow queries.
 * backend: look out for duplicate queries, and queries generated within a for-loop. Run EXPLAIN on all read queries to ensure you're using the indices effectively.
 * check https://performance.wikimedia.org/profiler/report ? Note to Sumana: figure out how to read this effectively.


 * RoanKattouw: let's say that you have written new code and it has just deployed. You are happy! Yay! You maybe even are listening to a pleasant bit of music in celebration. But what graphs should you keep an eye on to see whether your new code has materially affected performance?
 *  That depends radically on what kind of code it is
 *  For frontend things I usually watch network graphs
 *  Because most frontend performance problems correlate in some way with changes in network patterns on the bits cluster (which serves JS/CSS code to users)
 *  For other things, I tend to look at general health graphs (load avg, CPU) of the cluster that would be affected (bits Apache (ResourceLoader), API, etc.)
 *  For other things, I tend to look at general health graphs (load avg, CPU) of the cluster that would be affected (bits Apache (ResourceLoader), API, etc.)

Ganglia
The Wikimedia grid is the umbrella for everything. From there it breaks down into cluster, then service, then maybe subservice, then location. The quartet of graphs up top for any cluster is always load/memory/CPU/network - the things you most care about! The "metric" choice determines what thumbnail gets displayed (after that quartet) for each machine in the cluster, so it doesn't matter too much. It's worth playing with the duration and switching it from an hour to a day or a week.

Sadly, there are often gaps in the data because of Ganglia outages; we'd like it to be more robust.
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - bits caches in eqiad
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - the main Apache cluster, 136 boxes
 * You can look at the cache clusters - it is often more instructive to look at the backends behind those caches, which take the hit when the caches stop caching.
 * so, like, instead of looking at the Memcached eqiad cluster report, you look at the general Application servers in eqiad report.
 * if you were worried about JS/CSS perf issues: the bits servers
 * to check Varnish hits, look at the app servers
 * object cache: ? depending, could be app servers or databases
 * The error/warning graph shows you the number of fatal errors & PHP warnings over time:
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last hour
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last day

Graphite/Gdash
The Graphite data gets gathered into curated dashboards at http://gdash.wikimedia.org or WMF staff can look at graphite.wikimedia.org for full coverage.

graphite.wikimedia.org is a tool for exploratory data analysis that provides a hierarchical, tree-like interface to let WMF staff browse all available metrics. wfProfile* data is aggregated under 'MediaWiki' in graphite-web. (We generate wfProfile data by annotating code with wfProfileIn / wfProfileOut calls, which tells MediaWiki to trace code execution through a particular code path. Profiling data for code deployed to Wikimedia's production cluster is then aggregated and plotted automatically in graphite.)
 * Example: in https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/CentralAuthUser.php#L167 the class name is, and the method name is  . So it's in Graphite under   ->   ->.

The "total frontend page load time" dashboard covers:
 * "did we accidentally reduce the localstorage and/or native browser cache hit rate"
 * "did we just deploy something that adds slow code to the critical path for loading a page"
 * "Did we just deploy something that causes more stuff to be downloaded on a page view"

Slow parse log
Fluorine - slow parse log - https://wikitech.wikimedia.org/wiki/Logs#fluorine:.2Fa.2Fmw-log.2F - should be on https://wikitech.wikimedia.org/wiki/Fluorine. Reasonably self-explanatory when you look at the log.

Examples:
 * 2014-04-29 21:31:40 mw1175 enwiki: 4.67 Frank_Rijkaard
 * 2014-04-29 21:31:40 mw1087 enwiki: 13.82 Wikipedia
 * 2014-04-29 21:31:40 mw1036 frwiki: 3.65 Grace_Kelly
 * 2014-04-29 21:31:40 mw1092 zhwiki: 3.14 福州长乐国际机场

The Grace Kelly page takes 3.65 seconds to parse.

Discussed in http://www.gossamer-threads.com/lists/wiki/wikitech/340434 "Identifying pages that are slow to render" and http://www.gossamer-threads.com/lists/wiki/wikitech/335559?do=post_view_threaded ("Re: Lua rollout to en.wikipedia.org and a few others").

Would be nice to have web front ends for this, as with Ishmael. Probably others too in https://wikitech.wikimedia.org/wiki/Logs would be nice to put behind web frontends.

Checking how your code works with cache layers
We are massively cached! Your code needs to work in that environment! (but also work if everything misses cache.)

There are ways for extensions to disable caching, but don't. You need to understand why caching is breaking things - it's a code smell. But this is a rookie mistake and it would be condescending to emphasize it too strongly.

We have stuff in Vagrant or should get to adding it -- so, for more complicated bits of caching, there's a Vagrant role you can flip on to test it. Note to Sumana: ask Ori or other Vagrant experts for more on this.
 * also - when to use the jobqueue

For future: The beta cluster should be env of last resort and basically will not tell you about perf problems, because it is all on VMs. Won't tell you about cachebusting either, really. So don't wait till it hits beta cluster to figure it out! (Maybe someday we'll have a real performance testing cluster that's not a set of VMs.)