Performance profiling for Wikimedia code

Profiling/graphs to watch

 * General advice: in your dev env, follow https://www.mediawiki.org/wiki/Manual:How_to_debug#Profiling for backend (including logging all your queries and inspecting that log) and look at your Chrome developer tools for frontend :-) and once it's in production, watch the general Wikimedia grid graph, the errors/fatals graph, the bits Ganglia graph, the app servers Ganglia graph, and the Graphite graphs for the sections you've profiled, and use https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=21 and http://noc.wikimedia.org/dbtree/ to look for slow queries.
 * backend: look out for duplicate queries, and queries generated within a for-loop. Run EXPLAIN on all read queries to ensure you're using the indices effectively.
 * check https://performance.wikimedia.org/profiler/report ? Note to Sumana: figure out how to read this effectively.


 * RoanKattouw: let's say that you have written new code and it has just deployed. You are happy! Yay! You maybe even are listening to a pleasant bit of music in celebration. But what graphs should you keep an eye on to see whether your new code has materially affected performance?
 *  That depends radically on what kind of code it is
 *  For frontend things I usually watch network graphs
 *  Because most frontend performance problems correlate in some way with changes in network patterns on the bits cluster (which serves JS/CSS code to users)
 *  For other things, I tend to look at general health graphs (load avg, CPU) of the cluster that would be affected (bits Apache (ResourceLoader), API, etc.)
 *  For other things, I tend to look at general health graphs (load avg, CPU) of the cluster that would be affected (bits Apache (ResourceLoader), API, etc.)

Ganglia
The Wikimedia grid is the umbrella for everything. From there it breaks down into cluster, then service, then maybe subservice, then location. The quartet of graphs up top for any cluster is always load/memory/CPU/network - the things you most care about! The "metric" choice determines what thumbnail gets displayed (after that quartet) for each machine in the cluster, so it doesn't matter too much. It's worth playing with the duration and switching it from an hour to a day or a week.

Sadly, there are often gaps in the data because of Ganglia outages; we'd like it to be more robust.
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - bits caches in eqiad
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - the main Apache cluster, 136 boxes
 * You can look at the cache clusters - it is often more instructive to look at the backends behind those caches, which take the hit when the caches stop caching.
 * so, like, instead of looking at the Memcached eqiad cluster report, you look at the general Application servers in eqiad report.
 * if you were worried about JS/CSS perf issues: the bits servers
 * to check Varnish hits, look at the app servers
 * object cache: ? depending, could be app servers or databases
 * The error/warning graph shows you the number of fatal errors & PHP warnings over time:
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last hour
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last day

Graphite/Gdash
The Graphite data gets gathered into curated dashboards at http://gdash.wikimedia.org or WMF staff can look at graphite.wikimedia.org for full coverage.

graphite.wikimedia.org is a tool for exploratory data analysis that provides a hierarchical, tree-like interface to let WMF staff browse all available metrics. wfProfile* data is aggregated under 'MediaWiki' in graphite-web. (We generate wfProfile data by annotating code with wfProfileIn / wfProfileOut calls, which tells MediaWiki to trace code execution through a particular code path. Profiling data for code deployed to Wikimedia's production cluster is then aggregated and plotted automatically in graphite.)
 * Example: in https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/CentralAuthUser.php#L167 the class name is, and the method name is  . So it's in Graphite under   ->   ->.

The "total frontend page load time" dashboard covers:
 * "did we accidentally reduce the localstorage and/or native browser cache hit rate"
 * "did we just deploy something that adds slow code to the critical path for loading a page"
 * "Did we just deploy something that causes more stuff to be downloaded on a page view"

Slow parse log
Fluorine - slow parse log - https://wikitech.wikimedia.org/wiki/Logs#fluorine:.2Fa.2Fmw-log.2F - should be on https://wikitech.wikimedia.org/wiki/Fluorine. Reasonably self-explanatory when you look at the log.

Examples:
 * 2014-04-29 21:31:40 mw1175 enwiki: 4.67 Frank_Rijkaard
 * 2014-04-29 21:31:40 mw1087 enwiki: 13.82 Wikipedia
 * 2014-04-29 21:31:40 mw1036 frwiki: 3.65 Grace_Kelly
 * 2014-04-29 21:31:40 mw1092 zhwiki: 3.14 福州长乐国际机场

The Grace Kelly page takes 3.65 seconds to parse.

Discussed in http://www.gossamer-threads.com/lists/wiki/wikitech/340434 "Identifying pages that are slow to render" and http://www.gossamer-threads.com/lists/wiki/wikitech/335559?do=post_view_threaded ("Re: Lua rollout to en.wikipedia.org and a few others").

Would be nice to have web front ends for this, as with Ishmael. Probably others too in https://wikitech.wikimedia.org/wiki/Logs would be nice to put behind web frontends.

Checking how your code works with cache layers
We are massively cached! Your code needs to work in that environment! (but also work if everything misses cache.)

There are ways for extensions to disable caching, but don't. You need to understand why caching is breaking things - it's a code smell. But this is a rookie mistake and it would be condescending to emphasize it too strongly.

We have stuff in Vagrant or should get to adding it -- so, for more complicated bits of caching, there's a Vagrant role you can flip on to test it. Note to Sumana: ask Ori or other Vagrant experts for more on this.
 * also - when to use the jobqueue

For future: The beta cluster should be env of last resort and basically will not tell you about perf problems, because it is all on VMs. Won't tell you about cachebusting either, really. So don't wait till it hits beta cluster to figure it out! (Maybe someday we'll have a real performance testing cluster that's not a set of VMs.)