Performance profiling for Wikimedia code

This is meant to be a guide that you can use, from the start of writing your code all the way till after it's in production, to help you profile your code and look out for performance problems. It assumes you are making some kind of change to MediaWiki that is meant to run on Wikimedia sites like Wikipedia.

Summary
General advice: in your dev env, follow Manual:Profiling for backend (including logging all your queries and inspecting that log) and look at your Chrome developer tools for frontend :-) and once it's in production, watch the general Wikimedia grid graph, the errors/fatals graph, the bits Ganglia graph, the app servers Ganglia graph, and the Graphite graphs for the sections you've profiled, and use https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=21 and http://noc.wikimedia.org/dbtree/ to look for slow queries.

Starting in your development environment
As you develop your database queries, use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query, and make sure to run EXPLAIN on all read queries to ensure you're using the indices effectively. Follow Manual:Profiling to check the backend (including logging all your queries and inspecting that log). Look out for duplicate queries, and queries generated within a for-loop.

We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. Thus, use use ResourceLoader. See good examples. Look at your Chrome developer tools to check performance.

Use one of the (soon to come) Vagrant roles to test how your code works with Wikimedia's caching layers. If caching is breaking things, figure out why.

On the beta cluster
The beta cluster is a set of virtual machines and thus is not a good environment to check for performance problems.

In production
Once your code is on real Wikimedia sites, you have more graphs to watch and more logs to mine for information.

What graphs should you keep an eye on to see whether your new code has materially affected performance? That depends radically on what kind of code it is. For instance, for frontend things, you should watch network graphs; most frontend performance problems correlate in some way with changes in network patterns on the bits cluster (which serves JS/CSS code to users). For other things, you should look at general health graphs (load average, CPU) of the cluster that would be affected (bits Apache (ResourceLoader), API, etc.).

It's always appropriate to check for round-trip time (via Graphite/Gdash, per instructions below). In the worst case, an action that is sort of expensive but valuable, if it misses hitting the cache, should take at most 5 seconds. Strive for two seconds.


 * example: saving a new edit to a page
 * example: rendering a video thumbnail

Also check right after deploy and a few hours and a few days after deploy for cache problems, increases in the job queue, slow queries, and so on (see below for how).

Profiling/graphs to watch

 * check https://performance.wikimedia.org/profiler/report ? Note to Sumana: figure out how to read this effectively.



Ganglia
The Wikimedia grid is the umbrella for everything. From there it breaks down into cluster, then service, then maybe subservice, then location. The quartet of graphs up top for any cluster is always load/memory/CPU/network - the things you most care about! The "metric" choice determines what thumbnail gets displayed (after that quartet) for each machine in the cluster, so it doesn't matter too much. It's worth playing with the duration and switching it from an hour to a day or a week.

Sadly, there are often gaps in the data because of Ganglia outages; we'd like it to be more robust.
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - bits caches in eqiad
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - the main Apache cluster, 136 boxes
 * You can look at the cache clusters - it is often more instructive to look at the backends behind those caches, which take the hit when the caches stop caching.
 * so, like, instead of looking at the Memcached eqiad cluster report, you look at the general Application servers in eqiad report.
 * if you were worried about JS/CSS perf issues: the bits servers
 * to check Varnish hits, look at the app servers
 * object cache: ? depending, could be app servers or databases
 * The error/warning graph shows you the number of fatal errors & PHP warnings over time:
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last hour
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last day

Graphite/Gdash
The Graphite data gets gathered into curated dashboards at http://gdash.wikimedia.org or WMF staff can look at graphite.wikimedia.org for full coverage.

graphite.wikimedia.org is a tool for exploratory data analysis that provides a hierarchical, tree-like interface to let WMF staff browse all available metrics. wfProfile* data is aggregated under 'MediaWiki' in graphite-web. (We generate wfProfile data by annotating code with wfProfileIn / wfProfileOut calls, which tells MediaWiki to trace code execution through a particular code path. Profiling data for code deployed to Wikimedia's production cluster is then aggregated and plotted automatically in graphite.)
 * Example: in https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/CentralAuthUser.php#L167 the class name is, and the method name is  . So it's in Graphite under   ->   ->.

The "total frontend page load time" dashboard covers:
 * "did we accidentally reduce the localstorage and/or native browser cache hit rate"
 * "did we just deploy something that adds slow code to the critical path for loading a page"
 * "Did we just deploy something that causes more stuff to be downloaded on a page view"

Slow parse log
Fluorine - slow parse log - https://wikitech.wikimedia.org/wiki/Logs#fluorine:.2Fa.2Fmw-log.2F - should be on https://wikitech.wikimedia.org/wiki/Fluorine. Reasonably self-explanatory when you look at the log.

Examples:
 * 2014-04-29 21:31:40 mw1175 enwiki: 4.67 Frank_Rijkaard
 * 2014-04-29 21:31:40 mw1087 enwiki: 13.82 Wikipedia
 * 2014-04-29 21:31:40 mw1036 frwiki: 3.65 Grace_Kelly
 * 2014-04-29 21:31:40 mw1092 zhwiki: 3.14 福州长乐国际机场

The Grace Kelly page takes 3.65 seconds to parse.

Discussed in http://www.gossamer-threads.com/lists/wiki/wikitech/340434 "Identifying pages that are slow to render" and http://www.gossamer-threads.com/lists/wiki/wikitech/335559?do=post_view_threaded ("Re: Lua rollout to en.wikipedia.org and a few others").

(It would be nice to have web front ends for this, as with Ishmael. Probably others too in https://wikitech.wikimedia.org/wiki/Logs would be nice to put behind web frontends.)