Performance profiling for Wikimedia code

This is meant to be a guide that you can use, from the start of writing your code all the way till after it's in production, to help you profile your code and look out for performance problems. It assumes you are making some kind of change to MediaWiki that is meant to run on Wikimedia sites like Wikipedia.

Summary
General advice on checking performance:
 * 1) In your development environment, follow Manual:Profiling for backend (including logging all your queries and inspecting that log) and look at your Firefox and Chrome developer tools for frontend.
 * 2) Once it's in production, watch the general Wikimedia grid graph, the errors/fatals graph, the bits Ganglia graph, the app servers Ganglia graph, and the Graphite graphs for the sections you've profiled, and use https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=21  and http://noc.wikimedia.org/dbtree/ to look for slow queries.


 * Work with your product managers/dev manager/yourself to understand general performance targets before you start architecting your system. For example, a user facing application might have an acceptable latency of 200ms while a database might something like 20 ms or less, especially if further access is decided by the results of previous queries. You don't want to prematurely optimize but understand if your targets are physically possible.


 * Always consider 99% numbers rather than averages. IOW, you don't want half of your users to have a good experience, you want all of them to. So you need to look at the 99th slowest sample to really understand performance.

Starting in your development environment
As you develop your database queries, use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query, and make sure to run EXPLAIN on all queries - any query that has a WHERE clause - to ensure you're using the indices effectively. Follow Manual:Profiling to check the backend (including logging all your queries and inspecting that log). Look out for duplicate queries, and queries generated within a for-loop.

We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. Thus, use use ResourceLoader. See good examples. Look at your Chrome developer tools to check performance.

Use one of the (soon to come) Vagrant roles to test how your code works with Wikimedia's caching layers. If caching is breaking things, figure out why.


 * You must consider the cache characteristics of your underlying systems and modify your testing methodology accordingly. For example, if your database has a 4 GB cache, you'll need to make sure that cache is cold before you begin by accessing 4 GB of random data before you begin.


 * Particularly with databases, but in general performance is heavily dependent on the size of the data you are storing (as well as caching) -- make sure you do your testing with realistic data sizes.


 * Spinning disks are really slow; use cache or solid state whenever you can; However as the datasize grows, the advantages of solid state (avoiding seek times) are reduced.

On the beta cluster
The beta cluster is a set of virtual machines and thus is not a good environment to check for performance problems.

In production
Once your code is on real Wikimedia sites, you have more graphs to watch and more logs to mine for information.

What graphs should you keep an eye on to see whether your new code has materially affected performance? That depends radically on what kind of code it is. For instance, for frontend things, you should watch network graphs; most frontend performance problems correlate in some way with changes in network patterns on the bits cluster (which serves JS/CSS code to users). For other things, you should look at general health graphs (load average, CPU) of the cluster that would be affected (bits Apache (ResourceLoader), API, etc.).

It's always appropriate to check for round-trip time (via Graphite/Gdash, per instructions below). In the worst case, an action that is sort of expensive but valuable, if it misses hitting the cache, should take at most 5 seconds. Strive for two seconds.


 * example: saving a new edit to a page
 * example: rendering a video thumbnail

Also check right after deploy and a few hours and a few days after deploy for cache problems, increases in the job queue, slow queries, and so on (see below for how).

Ganglia
The Wikimedia grid is the umbrella for everything. From there it breaks down into cluster, then service, then maybe subservice, then location. The quartet of graphs up top for any cluster is always load/memory/CPU/network - the things you most care about! The "metric" choice determines what thumbnail gets displayed (after that quartet) for each machine in the cluster, so it doesn't matter too much. It's worth playing with the duration and switching it from an hour to a day or a week.

Sadly, there are often gaps in the data because of Ganglia outages; we'd like it to be more robust.
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - bits caches in eqiad
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - the main Apache cluster, 136 boxes
 * You can look at the cache clusters - it is often more instructive to look at the backends behind those caches, which take the hit when the caches stop caching.
 * so, like, instead of looking at the Memcached eqiad cluster report, you look at the general Application servers in eqiad report.
 * if you were worried about JS/CSS perf issues: the bits servers
 * to check Varnish hits, look at the app servers
 * object cache: ? depending, could be app servers or databases
 * The error/warning graph shows you the number of fatal errors & PHP warnings over time:
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last hour
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last day

Graphite/Gdash
The Graphite data gets gathered into curated dashboards at http://gdash.wikimedia.org or WMF staff can look at graphite.wikimedia.org for full coverage.

graphite.wikimedia.org is a tool for exploratory data analysis that provides a hierarchical, tree-like interface to let WMF staff browse all available metrics. wfProfile* data is aggregated under 'MediaWiki' in graphite-web. (We generate wfProfile data by annotating code with wfProfileIn / wfProfileOut calls, which tells MediaWiki to trace code execution through a particular code path. Profiling data for code deployed to Wikimedia's production cluster is then aggregated and plotted automatically in graphite.)
 * Example: in https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/CentralAuthUser.php#L167 the class name is, and the method name is  . So it's in Graphite under   ->   ->.

The "total frontend page load time" dashboard covers:
 * "did we accidentally reduce the localstorage and/or native browser cache hit rate"
 * "did we just deploy something that adds slow code to the critical path for loading a page"
 * "Did we just deploy something that causes more stuff to be downloaded on a page view"

We also display some of this same data in https://performance.wikimedia.org/profiler/report ? Note to Sumana: figure out how to read this effectively.

Logstash
http://logstash.wikimedia.org/ is an exploratory data tool to view production log files, e.g., the logs of fatal errors, exceptions, slow parses, and so on. Only Wikimedia Foundation staff can view it, since the logs contain identifying information such as IP addresses.

You can query or filter by "type" (the name of the log) and by strings that appear in the logs, such as unique strings that your code would write in log messages.

Right after you deploy, check out:

Fatal monitor
Use the "load" icon (the folder) in the navigation bar to load the "fatalmonitor" dashboard. Click the time-limit setting in the navbar to limit the query to the last 5 minutes, then click it again to choose an auto-refresh setting of every 10 seconds.

Look out for fatal errors that mention your code.

If you see a severely unequal distribution of fatal errors across the servers, then one box needs to be removed from the pool - contact Ops!

Exception monitor
As with the fatal monitor, choose "exceptionmonitor" from the "Load" dropdown. Watch out for a rise in exceptions and for exceptions that mention your code.

Slow parse log
You can choose the  type (by typing it into the query textfield). Thus you'll see pages that took a long time for the parser to parse. Here are some examples of  log messages:


 * 2014-04-29 21:31:40 mw1175 enwiki: 4.67 Frank_Rijkaard
 * 2014-04-29 21:31:40 mw1087 enwiki: 13.82 Wikipedia
 * 2014-04-29 21:31:40 mw1036 frwiki: 3.65 Grace_Kelly
 * 2014-04-29 21:31:40 mw1092 zhwiki: 3.14 福州长乐国际机场

The Grace Kelly page takes 3.65 seconds to parse.

A few days after your deploy, try filtering by some of the pages that are always slow to parse, like Barack Obama, to see whether they are trending upwards.