Talk:Wikimedia Performance Team/Backend performance

Good & bad examples
Leave 'em here! Sharihareswara (WMF) (talk) 16:56, 5 May 2014 (UTC)
 * hoo has https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/client/resources/wikibase.client.linkitem.init.js as a good example of lazy loading - used to be a script, "a big JS module in Wikibase that would ahve loaded ~90kib of javascript in the wikipedias". Hoo is finding it to add here. Sharihareswara (WMF) (talk) 23:01, 7 May 2014 (UTC) ✅
 * https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/877b838aee/client/resources/Resources.php - hoo says, "note that it has *huge* dependencies on other things not usually loaded in client at all. that's the actual point, introducing dependencies" Sharihareswara (WMF) (talk) 23:04, 7 May 2014 (UTC) ✅
 * https://github.com/wikimedia/mediawiki-extensions-examples/blob/ae0aac5f9/Example/Example.php#L104 has good examples of ResourceLoader modules. Krinkle (talk) 13:06, 9 May 2014 (UTC) ✅
 * Migration for existing non-ResourceLoader-using extensions (bad example): ResourceLoader/Developing with ResourceLoader Krinkle (talk) 13:31, 9 May 2014 (UTC) ✅

Bad example for "We want to deliver CSS and JavaScript fast": Extension:SyntaxHighlight GeSHi before – it used to put the styles in the section of page HTML. Matma Rex (talk) 13:17, 9 May 2014 (UTC) ✅
 * TimedMediaHandler has historically had a lot of problems with aggressively preloading CSS/JS modules, not sure if that's been cleaned up yet. Need to dig for specific examples. --brion (talk) 15:18, 9 May 2014 (UTC)

It's kind of hard to provide examples for "We are massively cached!" that would be understandable, but I guess    each provide some kind of a bad example plus a fix for some kind of a cache. You could probably search Bugzilla for 'cache' for more :) Matma Rex (talk) 13:17, 9 May 2014 (UTC)
 * These are good -- the notion that HTML output may sit around for a long time and still needs to be supported by the CSS and JS is a basic one to hammer in. Things where old JS/CSS hang around are in some ways more obvious, but stale HTML can be insidious! --brion (talk) 15:17, 9 May 2014 (UTC)


 * Parsoid has parallel HTTP, though, using curl_multi. Superm401 - Talk 03:51, 10 May 2014 (UTC)

Good example (from an in-progress change) of not poisoning the cache with request-specific data (when cache is not split on that variable): Background: mw.cookie will use MediaWiki's cookie settings, so client-side developers don't think about this. These are passed via the ResourceLoader startup module. Issue: However, it doesn't use Manual:$wgCookieSecure (instead, this is documented not to be supported), since the default value ('detect') varies by the request protocol, and the startup module does not vary by protocol. Thus, the first hit could poison the module with data that will be inapplicable for other requests. Superm401 - Talk 03:51, 10 May 2014 (UTC) ✅

55550 has some fixes for MwEmbedSupport and TimedMediaHandler for ResourceLoader issues. Superm401 - Talk 12:52, 10 May 2014 (UTC) ✅

CentralAuth had increased - first versions were not optimised for caching. App server load, and the requests per second -- indicates misses https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&m=ap_rps&s=by+name&c=Application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4

patchset for checking squid cache proxies by network instead of indiv listing them - increased CPU on app servers to ~50% upwards. Needed reworking - ipz - optimized way of storing this data structure. This happens only at scale.


 * Thanks for the examples! I'm marking them ✅ when I've integrated them into the Performance guidelines page.  Sharihareswara (WMF) (talk) 13:49, 10 May 2014 (UTC)

Possible example for "Work when cache hits and misses" might be TwnMainPage extension. It offloads stats (site stats and user stats) recalculation to job queue, adding jobs to the queue before the cache expires. In case of cache miss it does not show anything. It also sets a limit of 1 second for calculating message group stats. --Nikerabbit (talk) 09:38, 11 May 2014 (UTC)

Regarding reflow, someone should confirm this, but I believe part of the reason VE changed to construct tabs on the server was to reduce reflow due to JavaScript UI changes. Superm401 - Talk 09:49, 11 May 2014 (UTC)


 * Yes, it was; thanks, had forgotten! Jdforrester (WMF) (talk) 09:59, 11 May 2014 (UTC)

Failure to ensure that new or upgraded extensions function properly with other core extensions
It should be required that all upgraded or new extensions that permit the addition of content visible to the public operate with the revision deletion/suppression module, and that any actions related to content creation be included in the logs reported to checkusers - before installation on any non-testing project. This should be a mandatory criterion before installing, even as a test example, on any "real" project; failure to do this has resulted in extremely inappropriate content addition or difficulty for checkusers to identify and block vandals. AFT5 did not have this ability designed in, and required significant re-engineering to fix the problem; after that, a promise was made not to release future extensions on production wikis, even as tests, until the ability to revision delete/suppress and checkuser was demonstrated. Then Flow was released without the ability to checkuser contributions, or to revision delete/suppress. (Incidentally, the reverse is also true - any actions taken to revision delete/suppress any form of content addition needs to show up in the deletion and/or suppression logs.)

I am certain there are other core extensions with which anything new needs to be able to interact appropriately; these are the two I'm most familiar with, so I'm giving this example. Risker (talk) 01:22, 8 May 2014 (UTC)
 * Risker, thank you so much for leaving this detailed comment! I think you are absolutely right that MediaWiki or MediaWiki extension developer needs to consider revision deletion/suppression compliance and the other criteria and tasks you mentioned. However, Performance guidelines is about *how fast* we deliver content to users, not about security concerns like the one you have mentioned. Therefore I am going to copy and paste your comment onto the talk page of Security for developers/Architecture and have already brought it to the attention of Chris Steipp, the Wikimedia Foundation software security expert. Thank you again! Sharihareswara (WMF) (talk) 14:59, 9 May 2014 (UTC)


 * Yes, this isn't performance, but it is gold. It belongs in extension guidelines (I think there's a page somewhere for it, maybe as part of "getting your extension reviewed"). Flow has massive interaction with these and many many many other features of MediaWiki at WMF, I captured some of them at Flow/Architecture. -- S Page (WMF) (talk) 09:50, 11 May 2014 (UTC)

What to do

 * Work with your product managers/dev manager/yourself to understand general performance targets before you start architecting your system. For example, a user facing application might have an acceptable latency of 200ms while a database might something like 20 ms or less, especially if further access is decided by the results of previous queries. You don't want to prematurely optimize but understand if your targets are physically possible.

General Principles

 * Always consider 99% numbers rather than averages. IOW, you don't want half of your users to have a good experience, you want all of them to. So you need to look at the 99th slowest sample to really understand performance.

Backend

 * You must consider the cache characteristics of your underlying systems and modify your testing methodology accordingly. For example, if your database has a 4 GB cache, you'll need to make sure that cache is cold before you begin by accessing 4 GB of random data before you begin.


 * Particularly with databases, but in general performance is heavily dependent on the size of the data you are storing (as well as caching) -- make sure you do your testing with realistic data sizes.


 * Spinning disks are really slow; use cache or solid state whenever you can; However as the datasize grows, the advantages of solid state (avoiding seek times) are reduced. -from Toby Negrin, 10 May 2014


 * Toby, thank you! I am moving some of this to Performance guidelines and some to the performance profiling page. Sharihareswara (WMF) (talk) 12:54, 10 May 2014 (UTC)

Latency
On latency: some operations may have surprisingly variable network latency, such as looking up image files when Instant Commons is enabled. There can be some ways to manage this: --brion (talk) 15:23, 9 May 2014 (UTC)
 * first, be aware of which code paths are meant to always be fast (DB, memcache) and which may be slow (fetching File info or spam blacklists that might be cross-wiki and go over the internet)
 * when creating a code path that may be intermittently slow, DOCUMENT THIS FACT
 * be careful not to pile on requests -- for instance an external search engine might be slow to return under poor conditions while it's normally fast. Bottlenecking may cause all web servers to get caught up.
 * Consider breaking operations into smaller pieces which can be separated
 * Alternately, consider running operations in parallel -- this can be tricky though, we don't have good primitives for doing multiple HTTP fetches at once right now
 * Thanks, Brion! (Moved from the examples topic so I can think about it separately.) Sharihareswara (WMF) (talk) 13:51, 10 May 2014 (UTC)

parser cache
About the parser cache: you need to know - by what parameters the cache is particioned. It's not 1 cache entry per page. It's - per page and also fragments on user prefs like language (or value of uselang query string param) and date format. (That's it by default but other things may be included.)

Use the Edit Preview, which is not cached.

If you're working on a parser tag extension....

General strategy for parser caching: almost attributes are cached only if they are used in the parse.
 * Developers - if you do something like use the language object in something that gets called upon parse, like a parser hook, the parsercache will notice this and, say, fragment by language.
 * We need better parsercache documentation. Sumana is moving this stuff to the talk page and, at some point in the future, someone (maybe Sumana) will use this + past parsercache bugs to write it. Sharihareswara (WMF) (talk) 17:59, 10 May 2014 (UTC)

Critical paths
This document should mention the different critical paths we have in mediawiki. It's important to think about when (or rather: how often) the code you write is executed. Here are the main cases
 * Always. This is obviously the most critical.
 * On page views (when actually sending HTML) that is, nearly always, unless we reply with 304 Not Modified or some such.
 * When rendering page content. This is executed a *lot* less, a lot more expensive operations are acceptable. Rendering is typically not done while a user is waiting (unless the user just edited the page, see below).
 * When saving an edit. This is the rarest code path, and the one on which the largest delays are acceptable. Users tend to accept a longer wait after performing an action that "feels heavy", like editing a page.

Etherpad
https://etherpad.wikimedia.org/p/performance Sharihareswara (WMF) (talk) 08:28, 11 May 2014 (UTC)

Bus factor
I think the main (or only) goal of such a document should be to increase our performance bus factor and remove bottlenecks. Do you think it would help with that as it is currently? --Nemo 08:33, 11 May 2014 (UTC)

99th percentile
Why? There are middle ways (e.g. 75th percentile), this looks like a false dichotomy. --Nemo 08:42, 11 May 2014 (UTC)


 * Often performance is relatively flat until the last few % of users – e.g. (fake example) 0.3s at 25%, 0.4s at 50%, 0.5s at 75%, 1s at 80%, 2s at 85%, 5s at 90%, 10s at 95%, 30s at 99% or the like. Just looking at the mean/median/quartiles will give you a false picture of just how bad the system could be for a large number of users. 1% of page views, for example, is 10m a day, every day. 1% is still an awful lot of people. Leaving 1% of users to have a terrible outcome is not good enough. Our scale is such that it's not OK to write off sections of users without a really good idea. Jdforrester (WMF) (talk) 08:48, 11 May 2014 (UTC)


 * It depends on what those users are doing. In your example 99th percentile is useful, but in more specific/edge actions with less data points the 99th percentile can be too skewed and not actually point to any problem. For instance, if I save a heavy page like m:User:Cbrown1023/Logos I expect it to timeout; if I look at the Special:Contributions for a bot with a million edits and filter by ns8, I know it's going to be slow. That doesn't mean that such heavy pages should be forbidden or that the slow feature should be removed, because it's still better than nothing and doesn't affect normal/reasonable usages. --Nemo 09:31, 11 May 2014 (UTC)

Aspirations for the future
The most ambitious are invited to edit/take over/comment Requests for comment/Performance standards for new features to cover what we may not be ready for yet, but desire to reach at some point. --Nemo 09:39, 11 May 2014 (UTC)

Visibility
If we want this document to be more visible for the average developer, or even make it part of their routine, what's the way to do so? Include it as one point of (a revamped) Manual:Pre-commit checklist? --Nemo 09:39, 11 May 2014 (UTC)

TODO for Sumana, 14 May 2014
Summary of TODOs: (for Sumana) Sharihareswara (WMF) (talk) 20:57, 14 May 2014 (UTC)
 * redlink: improving API Etiquette
 * redlink: overall Wikimedia infrastructure, especially how MediaWiki is used, with diagrams
 * redlink: aspirational "what we want our performance architecture to be in the future" vision (may interrelate with similar vision doc for architecture in general)
 * redlink: list of doc tasks that need doing
 * redlink: UX guidelines
 * change: memcache as a persistence layer? maybe say, "maybe use memcache (and/or other caches) INSTEAD OF a persistence layer!"
 * add: performance goal: making anonymous things and logged in the same speed
 * add: explanation of performance vs scalability
 * add: think of resources as being the thing you deliver. That is the main rep of the content. Add indexing you need in ephemeral tables. You can directly  retrieve/access the main request by getting resource - HTML, JSON, etc -  and rearchitect your indexing layer separately....
 * add: These guidelines are meant to help people writing from the small to the big
 * add: picking the right cache (may be a redlink)
 * add: If you're gonna implement a backend, look for an interface first
 * add: "Maybe going too much into details: in context of frontend caching &  Varnish - do not serve different content from the same URL. Diff  content, diff URL. Must also be true for anon users! URLs  must be deterministic - with proviso that not if you are logged in!  Then wrinkles - cookies submitted ..... caching layers..... should  strive to cache logged in requests..... longstanding problem."
 * add: On cache hits & misses - we do not have a paradigm at WMF of  regenerating big things on cache hit - rather, for big hit, cache on  save. parsercache, i18n, spamblacklist ..... it's a strategy we use. If  you have something large that will take a long time, cache on save  instead of generating dynamically on request. (or schedule a job on  save). Preemptive cache population strategy. (cache-on-save.... vs store.) If something is very expensive to recompute ... use something closer to a store. (we use backend Varnish, which we call a cache, which is kind of closer to store on the cache-to-store spectrum.)
 * add: Matt's link re VE (check with James Forrester to see if it is right)
 * add (from talkpage): TwnMainPage extension cache example
 * add (from talkpage): Latency considerations
 * add (from talkpage): critical paths considerations
 * finally: after all that is done, send note to wikitech-l :)

Added at Architecture meetings/Performance guidelines discussion 2014-05-14: Sharihareswara (WMF) (talk) 13:36, 15 May 2014 (UTC)
 * improve the cache-to-store spectrum distinction, make sure no one thinks any of the caches are actually persistent stores!
 * Re the GettingStarted cookie regex, there are some Gerrit changes linked from the ops-l thread about this issue - grab them to link to if we go forward with this as an example
 * clarify object cache vs memcached in Caching Layers section
 * consider putting https://www.mediawiki.org/wiki/Talk:Performance_guidelines#Latency into the "how to think about perf" section