Talk:Wikimedia Performance Team/Backend performance

Good & bad examples
Leave 'em here! Sharihareswara (WMF) (talk) 16:56, 5 May 2014 (UTC)
 * hoo has https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/client/resources/wikibase.client.linkitem.init.js as a good example of lazy loading - used to be a script, "a big JS module in Wikibase that would ahve loaded ~90kib of javascript in the wikipedias". Hoo is finding it to add here. Sharihareswara (WMF) (talk) 23:01, 7 May 2014 (UTC) ✅
 * https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/877b838aee/client/resources/Resources.php - hoo says, "note that it has *huge* dependencies on other things not usually loaded in client at all. that's the actual point, introducing dependencies" Sharihareswara (WMF) (talk) 23:04, 7 May 2014 (UTC) ✅
 * https://github.com/wikimedia/mediawiki-extensions-examples/blob/ae0aac5f9/Example/Example.php#L104 has good examples of ResourceLoader modules. Krinkle (talk) 13:06, 9 May 2014 (UTC) ✅
 * Migration for existing non-ResourceLoader-using extensions (bad example): ResourceLoader/Developing with ResourceLoader Krinkle (talk) 13:31, 9 May 2014 (UTC) ✅

Bad example for "We want to deliver CSS and JavaScript fast": Extension:SyntaxHighlight GeSHi before – it used to put the styles in the section of page HTML. Matma Rex (talk) 13:17, 9 May 2014 (UTC) ✅
 * TimedMediaHandler has historically had a lot of problems with aggressively preloading CSS/JS modules, not sure if that's been cleaned up yet. Need to dig for specific examples. --brion (talk) 15:18, 9 May 2014 (UTC)

It's kind of hard to provide examples for "We are massively cached!" that would be understandable, but I guess    each provide some kind of a bad example plus a fix for some kind of a cache. You could probably search Bugzilla for 'cache' for more :) Matma Rex (talk) 13:17, 9 May 2014 (UTC)
 * These are good -- the notion that HTML output may sit around for a long time and still needs to be supported by the CSS and JS is a basic one to hammer in. Things where old JS/CSS hang around are in some ways more obvious, but stale HTML can be insidious! --brion (talk) 15:17, 9 May 2014 (UTC)


 * Parsoid has parallel HTTP, though, using curl_multi. Superm401 - Talk 03:51, 10 May 2014 (UTC)

Good example (from an in-progress change) of not poisoning the cache with request-specific data (when cache is not split on that variable): Background: mw.cookie will use MediaWiki's cookie settings, so client-side developers don't think about this. These are passed via the ResourceLoader startup module. Issue: However, it doesn't use Manual:$wgCookieSecure (instead, this is documented not to be supported), since the default value ('detect') varies by the request protocol, and the startup module does not vary by protocol. Thus, the first hit could poison the module with data that will be inapplicable for other requests. Superm401 - Talk 03:51, 10 May 2014 (UTC) ✅

55550 has some fixes for MwEmbedSupport and TimedMediaHandler for ResourceLoader issues. Superm401 - Talk 12:52, 10 May 2014 (UTC) ✅

CentralAuth had increased - first versions were not optimised for caching. App server load, and the requests per second -- indicates misses https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&m=ap_rps&s=by+name&c=Application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4

patchset for checking squid cache proxies by network instead of indiv listing them - increased CPU on app servers to ~50% upwards. Needed reworking - ipz - optimized way of storing this data structure. This happens only at scale.


 * Thanks for the examples! I'm marking them ✅ when I've integrated them into the Performance guidelines page.  Sharihareswara (WMF) (talk) 13:49, 10 May 2014 (UTC)

Failure to ensure that new or upgraded extensions function properly with other core extensions
It should be required that all upgraded or new extensions that permit the addition of content visible to the public operate with the revision deletion/suppression module, and that any actions related to content creation be included in the logs reported to checkusers - before installation on any non-testing project. This should be a mandatory criterion before installing, even as a test example, on any "real" project; failure to do this has resulted in extremely inappropriate content addition or difficulty for checkusers to identify and block vandals. AFT5 did not have this ability designed in, and required significant re-engineering to fix the problem; after that, a promise was made not to release future extensions on production wikis, even as tests, until the ability to revision delete/suppress and checkuser was demonstrated. Then Flow was released without the ability to checkuser contributions, or to revision delete/suppress. (Incidentally, the reverse is also true - any actions taken to revision delete/suppress any form of content addition needs to show up in the deletion and/or suppression logs.)

I am certain there are other core extensions with which anything new needs to be able to interact appropriately; these are the two I'm most familiar with, so I'm giving this example. Risker (talk) 01:22, 8 May 2014 (UTC)
 * Risker, thank you so much for leaving this detailed comment! I think you are absolutely right that MediaWiki or MediaWiki extension developer needs to consider revision deletion/suppression compliance and the other criteria and tasks you mentioned. However, Performance guidelines is about *how fast* we deliver content to users, not about security concerns like the one you have mentioned. Therefore I am going to copy and paste your comment onto the talk page of Security for developers/Architecture and have already brought it to the attention of Chris Steipp, the Wikimedia Foundation software security expert. Thank you again! Sharihareswara (WMF) (talk) 14:59, 9 May 2014 (UTC)

What to do

 * Work with your product managers/dev manager/yourself to understand general performance targets before you start architecting your system. For example, a user facing application might have an acceptable latency of 200ms while a database might something like 20 ms or less, especially if further access is decided by the results of previous queries. You don't want to prematurely optimize but understand if your targets are physically possible.

General Principles

 * Always consider 99% numbers rather than averages. IOW, you don't want half of your users to have a good experience, you want all of them to. So you need to look at the 99th slowest sample to really understand performance.

Backend

 * You must consider the cache characteristics of your underlying systems and modify your testing methodology accordingly. For example, if your database has a 4 GB cache, you'll need to make sure that cache is cold before you begin by accessing 4 GB of random data before you begin.


 * Particularly with databases, but in general performance is heavily dependent on the size of the data you are storing (as well as caching) -- make sure you do your testing with realistic data sizes.


 * Spinning disks are really slow; use cache or solid state whenever you can; However as the datasize grows, the advantages of solid state (avoiding seek times) are reduced. -from Toby Negrin, 10 May 2014


 * Toby, thank you! I am moving some of this to Performance guidelines and some to the performance profiling page. Sharihareswara (WMF) (talk) 12:54, 10 May 2014 (UTC)

Latency
On latency: some operations may have surprisingly variable network latency, such as looking up image files when Instant Commons is enabled. There can be some ways to manage this: --brion (talk) 15:23, 9 May 2014 (UTC)
 * first, be aware of which code paths are meant to always be fast (DB, memcache) and which may be slow (fetching File info or spam blacklists that might be cross-wiki and go over the internet)
 * when creating a code path that may be intermittently slow, DOCUMENT THIS FACT
 * be careful not to pile on requests -- for instance an external search engine might be slow to return under poor conditions while it's normally fast. Bottlenecking may cause all web servers to get caught up.
 * Consider breaking operations into smaller pieces which can be separated
 * Alternately, consider running operations in parallel -- this can be tricky though, we don't have good primitives for doing multiple HTTP fetches at once right now
 * Thanks, Brion! (Moved from the examples topic so I can think about it separately.) Sharihareswara (WMF) (talk) 13:51, 10 May 2014 (UTC)

parser cache
About the parser cache: you need to know - by what parameters the cache is particioned. It's not 1 cache entry per page. It's - per page and per user language. (That's it by default but other things may be included.)

Use the Edit Preview, which is not cached.

If you're working on a parser tag extension....

General strategy for parser caching: almost attributes are cached only if they are used in the parse.
 * Developers - if you do something like use the language object in something that gets called upon parse, like a parser hook, the parsercache will notice this and, say, fragment by language.
 * We need better parsercache documentation. Sumana is moving this stuff to the talk page and, at some point in the future, someone (maybe Sumana) will use this + past parsercache bugs to write it. Sharihareswara (WMF) (talk) 17:59, 10 May 2014 (UTC)

Critical Pathes
This document should mention the different critical pathes we have in mediawiki. It's important to think about when (or rather: how often) the code you write is executed. Here are the main cases
 * Always. This is obviously the most critical.
 * On page views (when actually sending HTML) that is, nearly always, unless we reply with 304 Not Modified or some such.
 * When rendering page content. This is executed a *lot* less, a lot more expensive operations are acceptable. Rendering is typically not done while a user is waiting (unless the user just edited the page, see below).
 * When saving an edit. This is the rarest code path, and the one on which the largest delays are acceptable. Users tend to accept a longer wait after performing an action that "feels heavy", like editing a page.

Etherpad
https://etherpad.wikimedia.org/p/performance Sharihareswara (WMF) (talk) 08:28, 11 May 2014 (UTC)

Bus factor
I think the main (or only) goal of such a document should be to increase our performance bus factor and remove bottlenecks. Do you think it would help with that as it is currently? --Nemo 08:33, 11 May 2014 (UTC)