Performance profiling for Wikimedia code

This is meant to be a guide that you can use, from the start of writing your code all the way till after it's in production, to help you profile your code and look out for performance problems. It assumes you are making some kind of change to MediaWiki that is meant to run on Wikimedia sites like Wikipedia. It is a companion to the performance guidelines.

Summary
General advice on checking performance:
 * 1) In your development environment, follow Manual:Profiling for backend (including logging all your queries and inspecting that log) and look at your Firefox and Chrome developer tools for frontend.
 * 2) Once it's in production, watch the general Wikimedia grid graph, the errors/fatals graph, the bits Ganglia graph, the app servers Ganglia graph, and the Graphite graphs for the sections you've profiled, and use https://commons.wikimedia.org/w/index.php?title=File:MediaWikiPerformanceProfiling.pdf&page=21  and https://noc.wikimedia.org/dbtree/ to look for slow queries.
 * Work with your product managers/dev manager/yourself to understand general performance targets before you start architecting your system. For example, a user facing application might have an acceptable latency of 200ms while a database might something like 20 ms or less, especially if further access is decided by the results of previous queries. You don't want to prematurely optimize but understand if your targets are physically possible.


 * Always consider 99% numbers rather than averages. In other words, you don't want half of your users to have a good experience, you want all of them to. So you need to look at the 99th slowest sample to really understand performance.

Starting in your development environment
As you develop your database queries, use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query, and make sure to run EXPLAIN on all queries - any query that has a WHERE clause - to ensure you're using the indices effectively. Follow Manual:Profiling to check the backend (including logging all your queries (see Bryan Davis's overview of current logging variables) and inspecting that log). Look out for duplicate queries, and queries generated within a for-loop. (For more details, see Roan Kattouw's 2010 talk on security, scalability and performance for extension developers, Roan's MySQL optimization tutorial from 2012 (slides), and Tim Starling's 2013 performance talk.)

You can profile your code with kcachegrind & MW tools:
 * http://kcachegrind.sourceforge.net/html/Home.html
 * PHP: xdebug Xdebug
 * Node.js: https://github.com/gwicke/nodegrind
 * Chrome: https://github.com/gwicke/chrome2calltree

We want to deliver CSS and JavaScript fast (bundled, minified, and avoiding duplication) while retaining the benefits of caching. Thus, use ResourceLoader. See good examples. Look at your Chrome developer tools to check performance.
 * You must consider the cache characteristics of your underlying systems and modify your testing methodology accordingly. For example, if your database has a 4 GB cache, you'll need to make sure that cache is cold as otherwise your data is likely still in the cache from previous queries.


 * Particularly with databases, but in general, performance is heavily dependent on the size of the data you are storing (as well as caching) -- make sure you do your testing with realistic data sizes.


 * Spinning disks are really slow; use cache or solid state whenever you can; However as the data size grows, the advantages of solid state (avoiding seek times) are reduced.

On the beta cluster
The Beta cluster is hosted in Wikimedia Cloud, meaning it runs on virtual machines. Thus, while a good place to detect functional problems, it is not a good environment to check for performance problems. (Bug 65394 discusses creation of a more realistic test environment for performance.)

In production
Once your code is on real Wikimedia sites, you have more graphs to watch and more logs to mine for information.

We load-test as part of our deployment process. New code gradually roll out over the course of a week, each week, starting with smaller wikis, to the more high-traffic wikis. See Deployments/One week for more information. Profiling data (Xenon) from MediaWiki PHP is tagged with the name of the deployment branch, which helps us correlate fluctuations in load with specific code changes.

What graphs should you keep an eye on to see whether your new code has materially affected performance? That depends radically on what kind of code it is. For instance, for front-end things you might want to monitor Navigation Timing or ResourceLoader graphs. For backend changes, you might want to create an XHProf profile of a relevant web request before and after the deployment.
 * Navigation Timing graphs.  reflects how fast the server generated a response and delivered it to the client ("time to first byte").   reflects how long it took for the browser to first draw content on the screen (e.g. stylesheets arrived and sufficient HTML content).
 * ResourceLoader: A significant change in http request rate, or cache hit ratio, is usually sign of a problem. You can also monitor the time it takes to "build" the module, at https://grafana.wikimedia.org/dashboard/db/resourceloadermodule

It's always appropriate to check for round-trip time (via Graphite/Grafana, per instructions below). In the worst case, an action that is sort of expensive but valuable, if it misses hitting the cache, should take at most 5 seconds. Strive for two seconds.


 * example: saving a new edit to a page
 * example: rendering a video thumbnail

Also check right after deploy and a few hours and a few days after deploy for cache problems, increases in the job queue, slow queries, and so on (see below for how).

Ganglia
The Wikimedia grid is the umbrella for everything. From there it breaks down into cluster, then service, then maybe subservice, then location. The quartet of graphs up top for any cluster is always load/memory/CPU/network - the things you most care about! The "metric" choice determines what thumbnail gets displayed (after that quartet) for each machine in the cluster, so it doesn't matter too much. It's worth playing with the duration and switching it from an hour to a day or a week.

Sadly, there are often gaps in the data because of Ganglia outages; we'd like it to be more robust.
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - bits caches in eqiad
 * https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 - the main Apache cluster, 136 boxes
 * You can look at the cache clusters - it is often more instructive to look at the backends behind those caches, which take the hit when the caches stop caching.
 * so, like, instead of looking at the Memcached eqiad cluster report, you look at the general Application servers in eqiad report.
 * if you were worried about JS/CSS perf issues: the bits servers
 * to check Varnish hits, look at the app servers
 * object cache: ? depending, could be app servers or databases
 * The error/warning graph shows you the number of fatal errors & PHP warnings over time:
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last hour
 * |exception&gtype=stack&glegend=show&aggregate=1&embed=1 https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&n=&hreg[ =vanadium.eqiad.wmnet&mreg[ ] =fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1] last day

Graphite
The Graphite data gets gathered into curated dashboards at https://grafana.wikimedia.org

The "frontend Navigation Timing" dashboard covers: https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown
 * "did we accidentally reduce the localStorage and/or native browser cache hit rate"
 * "did we just deploy something that adds slow code to the critical path for loading a page"
 * "Did we just deploy something that causes more stuff to be downloaded on a page view"

Logstash
https://logstash.wikimedia.org/ is an exploratory data tool to view production log files, e.g., the logs of fatal errors, exceptions, slow parses, and so on. Only Wikimedia Foundation staff can view it, since the logs contain identifying information such as IP addresses.

You can query or filter by "type" (the name of the log) and by strings that appear in the logs, such as unique strings that your code would write in log messages.

Right after you deploy, check out:

Fatal monitor
Use the "load" icon (the folder) in the navigation bar to load the "fatalmonitor" dashboard. Click the time-limit setting in the navbar to limit the query to the last 5 minutes, then click it again to choose an auto-refresh setting of every 10 seconds.

Look out for fatal errors that mention your code.

If you see a severely unequal distribution of fatal errors across the servers, then one box needs to be removed from the pool - contact Ops!

Exception monitor
As with the fatal monitor, choose "exceptionmonitor" from the "Load" dropdown. Watch out for a rise in exceptions and for exceptions that mention your code.

Slow parse log
You can choose the  type (by typing it into the query textfield). Thus you'll see pages that took a long time for the parser to parse. Here are some examples of  log messages:


 * 2014-04-29 21:31:40 mw1175 enwiki: 4.67 Frank_Rijkaard
 * 2014-04-29 21:31:40 mw1087 enwiki: 13.82 Wikipedia
 * 2014-04-29 21:31:40 mw1036 frwiki: 3.65 Grace_Kelly
 * 2014-04-29 21:31:40 mw1092 zhwiki: 3.14 福州长乐国际机场

The Grace Kelly page takes 3.65 seconds to parse.

A few days after your deploy, try filtering by some of the pages that are always slow to parse, like Barack Obama, to see whether they are trending upwards.