Requests for comment/Performance standards for new features

It would be nice to be able to define a codified set of minimum performance standards that new MediaWiki features must meet before they can be deployed to larger Wikimedia sites such as English Wikipedia, or be considered complete.

Background
Originally from a wikitech-l discussion. Right now, varying amounts of effort are made to highlight potential performance bottlenecks in code review, and engineers are encouraged to profile and optimize their own code. But beyond "is the site still up for everyone / are users complaining on the village pump / am I ranting in irc", we've offered no guidelines as to what sort of request latency is reasonable or acceptable. If a new feature (like aftv5, or flow) turns out not to meet perf standards after deployment, that would be a high priority bug and the feature may be disabled depending on the impact, or if not addressed in a reasonable time frame. Obviously standards like this can't be applied to certain existing parts of mediawiki, but systems other than the parser or preprocessor that don't meet new standards should at least be prioritized for improvement.

Goals
From a backend perspective, Wikimedia Foundation now has more resources than it used to have. However, it's necessary for features to respect some minimal standard with two goals:
 * don't bring the cluster down (or otherwise affect other areas of the platform severely),
 * be able to scale features up as planned (as opposed to being forced to disable some crucial component of a feature in a way that makes it useless).

Once those goals are accomplished, most of the exploration should go towards the uncharted territory of what happens between the servers and the users' monitor. We must provide users what they need for reading or contributing, not waste their precious time in waiting nor let them jump to some other quickier-delivered candy on the web.

How it could look like
If we set some measurable standards (numbers are just examples):
 * p999 (long tail) full page request latency of 2000ms
 * p99 page request latency of 800ms
 * p90 page request latency of 150ms
 * p99 banner request latency of 80ms
 * p90 banner request latency of 40ms
 * p99 db query latency of 250ms
 * p90 db query latency of 50ms
 * 1000 write requests/sec (if applicable; writes operations must be free from concurrency issues)

And/or less metrical things:
 * guidelines about degrading gracefully
 * specific limits on total resource consumption across the stack per request
 * etc.

Scope and issues
Things to address, or to exclude from the scope of the RfC.
 * Metrics, if any, should be actually measurable by devs and don't go over what needed for the goals above.
 * Otherwise it can only be after-the-fact measuring by "performance experts".
 * Can make people more confident about +2; especially volunteers are often stuck for months with unactionable performance concerns (or even years for WMF configuration like maintenance reports).
 * Metrics on overall platform performance need to be taken in mind and are being expanded, e.g. Page load time (navigationStart to loadEventStart) in milliseconds (slides).
 * Better PHP profiling?
 * At what point to measure, e.g. from backend perspective or transcontinental user's point.
 * Also, does it matter if some big areas of the world or big portions of the population suffer a significant discrimination in terms e.g. of worse latency, bandwidth or hardware? Should some minimum level of functioning be granted?
 * Not everything can be measured. Define a process/safety net to detect performance regressions/fiascos caused in ways nobody thought of, e.g. critical areas of the code/features/kinds of changes which need to be reviewed in some particolar way or triple-checked (by some particular people?). Examples of dangerous changes:
 * anything that adds a module to every page (or every auth page),
 * anything that adds an HTTP request to every page,
 * anything that specifies cache headers,
 * anything that sets cookies.
 * An actual testing/measuring playground to replicate realistic conditions (e.g. wiki datasets size, replag, job queue, network topology, ...).
 * Beta cluster?
 * MediaWiki-Vagrant??
 * Testing in production? Gradual deploy.
 * Beta Features?
 * Educational/help material: could be a byproduct to assist a "policy", or the actual outcome of the RfC.
 * For devs: "like understanding if algorithms or queries are O(n) or O(2^n)".
 * For professional and amateur sysadmins: "coherent documentation on how to improve performance of a MediaWiki site at all levels" now completely missing.

Recommendations

 * Know these things about your code:
 * What page(s) does your extension load on? Is it as granular as it could be? Check your BeforePageDisplay handlers and see if you can't exclude more pages with some judicious ifs
 * Reload the page. Does any interface element appear after the page has loaded? Does content shift or do interface elements move after DOMContentLoaded?
 * Does your extension result in additional HTTP requests on each page? If so, you should be wracking your brain thinking about how to avoid making the additional request.
 * Split your extension payload into (at least) two modules: one is a tiny chrome module that contains *only* the initial interface element that leads the user into your extension (example: the gear icon for ULS) and move *everything* else to modules that are lazy-loaded on demand using mw.loader.using.
 * On that note: read about mw.loader.using. You can load modules in JS.
 * We should -2 any new GeoIP lookup that's implemented before this is implemented: https://bugzilla.wikimedia.org/show_bug.cgi?id=57126
 * Cache jQuery selectors, batch DOM modifications
 * see Architecture Summit 2014/Performance