Talk:Wikimedia Performance Team/Backend performance

Good & bad examples
Leave 'em here! Sharihareswara (WMF) (talk) 16:56, 5 May 2014 (UTC)
 * hoo has https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/client/resources/wikibase.client.linkitem.init.js as a good example of lazy loading - used to be a script, "a big JS module in Wikibase that would ahve loaded ~90kib of javascript in the wikipedias". Hoo is finding it to add here. Sharihareswara (WMF) (talk) 23:01, 7 May 2014 (UTC) ✅
 * https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/877b838aee/client/resources/Resources.php - hoo says, "note that it has *huge* dependencies on other things not usually loaded in client at all. that's the actual point, introducing dependencies" Sharihareswara (WMF) (talk) 23:04, 7 May 2014 (UTC) ✅
 * https://github.com/wikimedia/mediawiki-extensions-examples/blob/ae0aac5f9/Example/Example.php#L104 has good examples of ResourceLoader modules. Krinkle (talk) 13:06, 9 May 2014 (UTC) ✅
 * Migration for existing non-ResourceLoader-using extensions (bad example): ResourceLoader/Developing with ResourceLoader Krinkle (talk) 13:31, 9 May 2014 (UTC) ✅

Bad example for "We want to deliver CSS and JavaScript fast": Extension:SyntaxHighlight GeSHi before – it used to put the styles in the section of page HTML. Matma Rex (talk) 13:17, 9 May 2014 (UTC) ✅
 * TimedMediaHandler has historically had a lot of problems with aggressively preloading CSS/JS modules, not sure if that's been cleaned up yet. Need to dig for specific examples. --brion (talk) 15:18, 9 May 2014 (UTC)

It's kind of hard to provide examples for "We are massively cached!" that would be understandable, but I guess    each provide some kind of a bad example plus a fix for some kind of a cache. You could probably search Bugzilla for 'cache' for more :) Matma Rex (talk) 13:17, 9 May 2014 (UTC)
 * These are good -- the notion that HTML output may sit around for a long time and still needs to be supported by the CSS and JS is a basic one to hammer in. Things where old JS/CSS hang around are in some ways more obvious, but stale HTML can be insidious! --brion (talk) 15:17, 9 May 2014 (UTC)
 * Still need to look through these more thoroughly, but I think I used at least one.... Sharihareswara (WMF) (talk) 21:00, 16 May 2014 (UTC)


 * Parsoid has parallel HTTP, though, using curl_multi. Superm401 - Talk 03:51, 10 May 2014 (UTC)

Good example (from an in-progress change) of not poisoning the cache with request-specific data (when cache is not split on that variable): Background: mw.cookie will use MediaWiki's cookie settings, so client-side developers don't think about this. These are passed via the ResourceLoader startup module. Issue: However, it doesn't use Manual:$wgCookieSecure (instead, this is documented not to be supported), since the default value ('detect') varies by the request protocol, and the startup module does not vary by protocol. Thus, the first hit could poison the module with data that will be inapplicable for other requests. Superm401 - Talk 03:51, 10 May 2014 (UTC) ✅

55550 has some fixes for MwEmbedSupport and TimedMediaHandler for ResourceLoader issues. Superm401 - Talk 12:52, 10 May 2014 (UTC) ✅

CentralAuth had increased - first versions were not optimised for caching. App server load, and the requests per second -- indicates misses https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&m=ap_rps&s=by+name&c=Application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4

patchset for checking squid cache proxies by network instead of indiv listing them - increased CPU on app servers to ~50% upwards. Needed reworking - ipz - optimized way of storing this data structure. This happens only at scale.


 * Thanks for the examples! I'm marking them ✅ when I've integrated them into the Performance guidelines page.  Sharihareswara (WMF) (talk) 13:49, 10 May 2014 (UTC)

Possible example for "Work when cache hits and misses" might be TwnMainPage extension. It offloads stats (site stats and user stats) recalculation to job queue, adding jobs to the queue before the cache expires. In case of cache miss it does not show anything. It also sets a limit of 1 second for calculating message group stats. --Nikerabbit (talk) 09:38, 11 May 2014 (UTC)
 * Thanks! ✅ Sharihareswara (WMF) (talk) 21:00, 16 May 2014 (UTC)

Regarding reflow, someone should confirm this, but I believe part of the reason VE changed to construct tabs on the server was to reduce reflow due to JavaScript UI changes. Superm401 - Talk 09:49, 11 May 2014 (UTC)


 * Yes, it was; thanks, had forgotten! Jdforrester (WMF) (talk) 09:59, 11 May 2014 (UTC)
 * Thanks! ✅ Sharihareswara (WMF) (talk) 21:00, 16 May 2014 (UTC)

Failure to ensure that new or upgraded extensions function properly with other core extensions
It should be required that all upgraded or new extensions that permit the addition of content visible to the public operate with the revision deletion/suppression module, and that any actions related to content creation be included in the logs reported to checkusers - before installation on any non-testing project. This should be a mandatory criterion before installing, even as a test example, on any "real" project; failure to do this has resulted in extremely inappropriate content addition or difficulty for checkusers to identify and block vandals. AFT5 did not have this ability designed in, and required significant re-engineering to fix the problem; after that, a promise was made not to release future extensions on production wikis, even as tests, until the ability to revision delete/suppress and checkuser was demonstrated. Then Flow was released without the ability to checkuser contributions, or to revision delete/suppress. (Incidentally, the reverse is also true - any actions taken to revision delete/suppress any form of content addition needs to show up in the deletion and/or suppression logs.)

I am certain there are other core extensions with which anything new needs to be able to interact appropriately; these are the two I'm most familiar with, so I'm giving this example. Risker (talk) 01:22, 8 May 2014 (UTC)
 * Risker, thank you so much for leaving this detailed comment! I think you are absolutely right that MediaWiki or MediaWiki extension developer needs to consider revision deletion/suppression compliance and the other criteria and tasks you mentioned. However, Performance guidelines is about *how fast* we deliver content to users, not about security concerns like the one you have mentioned. Therefore I am going to copy and paste your comment onto the talk page of Security for developers/Architecture and have already brought it to the attention of Chris Steipp, the Wikimedia Foundation software security expert. Thank you again! Sharihareswara (WMF) (talk) 14:59, 9 May 2014 (UTC)


 * Yes, this isn't performance, but it is gold. It belongs in extension guidelines (I think there's a page somewhere for it, maybe as part of "getting your extension reviewed"). Flow has massive interaction with these and many many many other features of MediaWiki at WMF, I captured some of them at Flow/Architecture. -- S Page (WMF) (talk) 09:50, 11 May 2014 (UTC)

What to do

 * Work with your product managers/dev manager/yourself to understand general performance targets before you start architecting your system. For example, a user facing application might have an acceptable latency of 200ms while a database might something like 20 ms or less, especially if further access is decided by the results of previous queries. You don't want to prematurely optimize but understand if your targets are physically possible.

General Principles

 * Always consider 99% numbers rather than averages. IOW, you don't want half of your users to have a good experience, you want all of them to. So you need to look at the 99th slowest sample to really understand performance.

Backend

 * You must consider the cache characteristics of your underlying systems and modify your testing methodology accordingly. For example, if your database has a 4 GB cache, you'll need to make sure that cache is cold before you begin by accessing 4 GB of random data before you begin.


 * Particularly with databases, but in general performance is heavily dependent on the size of the data you are storing (as well as caching) -- make sure you do your testing with realistic data sizes.


 * Spinning disks are really slow; use cache or solid state whenever you can; However as the datasize grows, the advantages of solid state (avoiding seek times) are reduced. -from Toby Negrin, 10 May 2014


 * Toby, thank you! I am moving some of this to Performance guidelines and some to the performance profiling page. Sharihareswara (WMF) (talk) 12:54, 10 May 2014 (UTC)

Latency
On latency: some operations may have surprisingly variable network latency, such as looking up image files when Instant Commons is enabled. There can be some ways to manage this: --brion (talk) 15:23, 9 May 2014 (UTC)
 * first, be aware of which code paths are meant to always be fast (DB, memcache) and which may be slow (fetching File info or spam blacklists that might be cross-wiki and go over the internet)
 * when creating a code path that may be intermittently slow, DOCUMENT THIS FACT
 * be careful not to pile on requests -- for instance an external search engine might be slow to return under poor conditions while it's normally fast. Bottlenecking may cause all web servers to get caught up.
 * Consider breaking operations into smaller pieces which can be separated
 * Alternately, consider running operations in parallel -- this can be tricky though, we don't have good primitives for doing multiple HTTP fetches at once right now
 * Thanks, Brion! (Moved from the examples topic so I can think about it separately.) Sharihareswara (WMF) (talk) 13:51, 10 May 2014 (UTC)
 * Moved into the "how to think about performance" section. Thanks! Sharihareswara (WMF) (talk) 13:00, 16 May 2014 (UTC)
 * User:Superm401, in this revision you removed a paragraph about round-trip time. Can you say more about why you decided to take that out? Thanks! Sharihareswara (WMF) (talk) 03:27, 17 May 2014 (UTC)
 * That was the other Matt (Mwalker (WMF)). It looks the round-trip text was mostly refactored, but he did remove the part about "300 milliseconds".  I don't know his reasoning, but the old version could have been a little clearer.  In particular, it said, "Our goal is reasonable up to 300 milliseconds round-trip time or so" without making clear what that goal was (my understanding is RTT here roughly means ping time, and does not have anything to do with PHP performance).  Perhaps the goal is a reference to some overall time which includes both network and server components, but it wasn't stated. Superm401 - Talk 22:30, 17 May 2014 (UTC)
 * I was trying to be bold and make that sentence make sense. I figured that regardless of what the server is doing people should be aware of high latency connections (and give examples). I didn't totally realize the intent was to say, 5 seconds - 300ms latency is the performance we're going for -- and I think it confuses the issue anyways because of how variable it is. Mwalker (WMF) (talk) 23:35, 19 May 2014 (UTC)

parser cache
About the parser cache: you need to know - by what parameters the cache is particioned. It's not 1 cache entry per page. It's - per page and also fragments on user prefs like language (or value of uselang query string param) and date format. (That's it by default but other things may be included.)

Use the Edit Preview, which is not cached.

If you're working on a parser tag extension....

General strategy for parser caching: almost attributes are cached only if they are used in the parse.
 * Developers - if you do something like use the language object in something that gets called upon parse, like a parser hook, the parsercache will notice this and, say, fragment by language.
 * We need better parsercache documentation. Sumana is moving this stuff to the talk page and, at some point in the future, someone (maybe Sumana) will use this + past parsercache bugs to write it. Sharihareswara (WMF) (talk) 17:59, 10 May 2014 (UTC)

Critical paths
This document should mention the different critical paths we have in mediawiki. It's important to think about when (or rather: how often) the code you write is executed. Here are the main cases
 * Always. This is obviously the most critical.
 * On page views (when actually sending HTML) that is, nearly always, unless we reply with 304 Not Modified or some such.
 * When rendering page content. This is executed a *lot* less, a lot more expensive operations are acceptable. Rendering is typically not done while a user is waiting (unless the user just edited the page, see below).
 * When saving an edit. This is the rarest code path, and the one on which the largest delays are acceptable. Users tend to accept a longer wait after performing an action that "feels heavy", like editing a page.
 * Done! Added to "How to think about performance". Thanks! Sharihareswara (WMF) (talk) 13:26, 16 May 2014 (UTC)

Etherpad
https://etherpad.wikimedia.org/p/performance Sharihareswara (WMF) (talk) 08:28, 11 May 2014 (UTC)

Bus factor
I think the main (or only) goal of such a document should be to increase our performance bus factor and remove bottlenecks. Do you think it would help with that as it is currently? --Nemo 08:33, 11 May 2014 (UTC)

99th percentile

 * I would word this to say 50th, 90th and 99th. See explanation below NRuiz (WMF) (talk) 10:03, 20 May 2014 (UTC)

Why? There are middle ways (e.g. 75th percentile), this looks like a false dichotomy. --Nemo 08:42, 11 May 2014 (UTC)


 * Often performance is relatively flat until the last few % of users – e.g. (fake example) 0.3s at 25%, 0.4s at 50%, 0.5s at 75%, 1s at 80%, 2s at 85%, 5s at 90%, 10s at 95%, 30s at 99% or the like. Just looking at the mean/median/quartiles will give you a false picture of just how bad the system could be for a large number of users. 1% of page views, for example, is 10m a day, every day. 1% is still an awful lot of people. Leaving 1% of users to have a terrible outcome is not good enough. Our scale is such that it's not OK to write off sections of users without a really good idea. Jdforrester (WMF) (talk) 08:48, 11 May 2014 (UTC)


 * It depends on what those users are doing. In your example 99th percentile is useful, but in more specific/edge actions with less data points the 99th percentile can be too skewed and not actually point to any problem. For instance, if I save a heavy page like m:User:Cbrown1023/Logos I expect it to timeout; if I look at the Special:Contributions for a bot with a million edits and filter by ns8, I know it's going to be slow. That doesn't mean that such heavy pages should be forbidden or that the slow feature should be removed, because it's still better than nothing and doesn't affect normal/reasonable usages. --Nemo 09:31, 11 May 2014 (UTC)
 * In any web request sample set you are likely to have two very distinct signals, users coming to your application with a full and an empty cache. If you have mobile data you will see maybe a third signal (mobile data is quite noisy though). While I agree with you that 99th percentile might be too skewed, the average is meaningless as it muddles both signals. The usual practice in performance is to present data for percentiles 50th and 90th and if you want to dig further 99th and 1st. In order to calculate those percentiles with some confidence in their statististical significance you need at least 10.000 data points. NRuiz (WMF) (talk) 10:01, 20 May 2014 (UTC)
 * Thanks Nuria, I agree with your proposal, can you edit the text directly? Average is particularly misleading and I've already replaced it with median. --Nemo 10:14, 20 May 2014 (UTC)
 * Right, averages are meaningless when talking about performance. I have edited the "measure" section adding data of how and what to measure NRuiz (WMF) (talk) 10:28, 20 May 2014 (UTC)

Aspirations for the future
The most ambitious are invited to edit/take over/comment Requests for comment/Performance standards for new features to cover what we may not be ready for yet, but desire to reach at some point. --Nemo 09:39, 11 May 2014 (UTC)

Visibility
If we want this document to be more visible for the average developer, or even make it part of their routine, what's the way to do so? Include it as one point of (a revamped) Manual:Pre-commit checklist? --Nemo 09:39, 11 May 2014 (UTC)

TODO for Sumana, 14 May 2014
You can look at the history of this talk page for May 14-16 if you want to see my old TODO list. :) Sharihareswara (WMF) (talk) 22:53, 16 May 2014 (UTC)

on resources and indexing
I have this note from Zurich, probably from Gabriel Wicke, and can't quite make sense of it. Anyone want to help?
 * "think of resources as being the thing you deliver. That is the main rep of the content. Add indexing you need in ephemeral tables. You can directly retrieve/access the main request by getting resource - HTML, JSON, etc - and rearchitect your indexing layer separately...."

I think I'm stuck here because I'm not sure what indexing has to do with the delivery of resources here. Sharihareswara (WMF) (talk) 22:55, 16 May 2014 (UTC)

Credits
I think the talk page or perhaps the subject-space page should have a credits section. The page history isn't really sufficient, I don't think. --MZMcBride (talk) 00:01, 17 May 2014 (UTC)

How to think about performance
 How often will this happen? We have several critical code paths in MediaWiki. It's important to think about how often the site or the browser will have to execute your code. Here are the main cases:


 * Always. This is obviously the most critical.
 * On page views (when actually sending HTML) -- that is, nearly always, unless we reply with 304 Not Modified or some such. Nearly every time an anonymous (not logged in) reader reads a Wikipedia page, she will get canned, pre-rendered HTML sent to her. If you add new code that runs every time anyone views a page, watch out.
 * When rendering page content. We usually only have to render page content (on the server side) after an edit or after a cache miss. So we do it a lot less, so more expensive operations are acceptable. Rendering is typically not done while a user is waiting -- unless the user just edited the page, which leads to...
 * When saving an edit. This is the rarest code path, and the one on which the largest delays are acceptable. Users tend to accept a longer wait after performing an action that "feels heavy", like editing a page.

This part of the guidelines positions itself as a hierarchical way of looking at performance, but I'm not sure this is how people should actually be thinking about performance. I think this section sends the wrong message.

Even though we can perhaps trick users into not being as annoyed about longer delays, the reality is that the fact that editing "feels heavy" is a bug that we should be (and are) working to address. In other words, ideally common good actions such as editing should feel fast and lightweight. Developers and everyone else should be designing and implementing code that moves us in this direction.

We should re-evaluate whether an importance hierarchy is best here. --MZMcBride (talk) 00:02, 17 May 2014 (UTC)


 * I think this part of the text gets clearer once you consider its priority is to answer "how do I not kill the Wikimedia cluster?". Actually, is there anything on this document that's not geared towards that? (Yaron also asked on #mediawiki to separate WMF-specific stuff from the rest.) --Nemo 13:34, 19 May 2014 (UTC)

Indexing not a silver bullet
One thing to note about database indexes is that more isn't always better. Once an index gets big enough that it doesn't fit into RAM anymore, it slows down dramatically. Additionally, an index can make reads faster, but writes slower. -- RobLa-WMF (talk) 01:09, 17 May 2014 (UTC)

Indexes and query plans
I also have a indexing related question: I've heard that sometimes the query plans might different in WMF production and on your local instance due to many reasons: custom WMF indexes, different mysql/mariadb version and just lack of data in tables. I'm wondering is this relevant in practice and if so are there ways to avoid being surprised by those differences. --Nikerabbit (talk) 11:44, 19 May 2014 (UTC)

Failure paths
Perhaps add something on failure paths. Something that is dangerous is a 'tight retry loop' for instance, if possible after failure you should reschedule and/or cache the error for a short time, before trying again. Failure paths in Wikipedia are important, if 300 servers start playing pingpong.... :) Another dangerous fail scenario is an incorrectly cached error. TheDJ (talk) 16:33, 19 May 2014 (UTC)
 * Both in success (with job queues) and in failure (with retry), you could think about the concepts of throttle and debounce respectively. Both are a form of rate limiting and they are applied quite often in various ways throughout the code. TheDJ (talk) 16:38, 19 May 2014 (UTC)

Cookies
There is a brief mention of why cookies are detrimental to performance. I think is worth going into that a little more. Couple well known studies about cookie impact:

http://www.yuiblog.com/blog/2007/03/01/performance-research-part-3/

http://www.stevesouders.com/blog/2009/08/21/the-skinny-on-cookies/

I know 2007 sounds like "are youu from the past?" but think that mobile, in terms of speed and browsers, is years behind from desktop thus the concerns with cookies increasing payload are very real ones today in mobile traffic. There are also architectural concerns but hey, this doc is about performance NRuiz (WMF) (talk) 09:56, 20 May 2014 (UTC)

Other link: https://developers.google.com/speed/docs/best-practices/request

Summing it up, there are two main concerns when it comes to cookie usage:
 * Cookies bloat the payload, there is more data send back and forth unnecessarily. More so in mobile as this can end up being many unneeded kilobytes being send with every single request.


 * Cookies might affect cache hit ratios, if you are caching full requests it might be that you add a cookie, the request changes and thus the user request is no longer cached.

"Preserving positioning"
Should this section, about making sure things don't move around while the page loads, be here at all? It seems very much like a UI guideline, not a performance guideline. Yaron Koren (talk) 15:15, 20 May 2014 (UTC)
 * It's about reducing the time needed before a page is or is perceived "ready", an aspect of frontend performance. Do you think frontend performance is out of scope here? Backend performance seems to be the main focus of this page (for a reason or another), but there is some confusion about this, see several sections above on the topic. --Nemo 16:10, 20 May 2014 (UTC)