ResourceLoader/Requirements/Tim Starling

Summary
Recommended features:
 * Concatenation
 * Event-driven script loading
 * File transformations
 * Server-side caching (even with no valid $wgCacheDirectory)
 * Short Squid expiry time and optimised server-side cache hits.
 * Timestamp-based version numbers

Optional features:
 * Footer script tag placement
 * Support for JavaScript modules which work without a MediaWiki installation

Concatenation
Concatenating objects is unequivocally good for the client as soon as the total number of objects exceeds the browser's concurrent connection limit. It's very likely that if we're not in this regime already, we soon will be. Concatenation is useful for scaling up the complexity of our client-side ecosystem.

However, concatenation must be balanced against:
 * 1) The tendency to include excessive amounts of rarely-used code.
 * 2) The overhead incurred when different modules concatenate the same code into different buckets.
 * 3) The need to serve items with different Vary headers (gen=js etc.).

My analysis suggests that to begin with, the number of buckets should be fairly small: say, one for page view, one for edit, and one bucket for each special page that needs its own JS. The page view bucket should contain code that is required for all page views. Scripts like OggPlayer.js that are required on occasional page views should be served separately. Duplication of code (issue 2 above) should be avoided even if it means adding additional requests. Remember that browsers do have mechanisms to mitigate the connection setup overhead.

It makes sense to include skin-specific JS and CSS in the page view bucket. Thus the skin name should be part of the URL, and when the user changes their skin, they will have to reload the common JS and CSS. We should have an associative array of such parameters, and allow them to be added by any module. Some parameters I've identified are:


 * User logged in: only then do they get ajaxwatch.js
 * disablesuggest user option: suppresses mwsuggest.js
 * editsectiononrightclick user option: sends rightclickedit.js

For performance reasons, it makes sense to have these user options be passed from the HTML back to the script loader via the URL. This allows us to send the data with no Vary header.

The other option would be to send out these scripts along with dynamically generated content like user script subpages. But all requests with Vary:Cookie must be forwarded back to Florida, damaging performance for people in Europe. By striving to keep Vary:Cookie requests small, we reduce the number of Florida RTTs and thus the overall latency.

This approach can be extended as long as the length of the parameter blob is not too large. It may even be possible to remove the Vary:Cookie requests completely.

Deferred loading
There are two ways in which the loading of scripts can be deferred:


 * 1) By placing script tags in the page footer.
 * 2) By loading required JS in response to UI events such as button clicks.

For example, Drupal have chosen to move all their script tags except jQuery to the page footer. This improves the time it takes for the webpage to be displayed on the user's first visit. However, my analysis suggests that there are a large number of scripts which we would need to serve from the header:


 * For backwards compatibility and ease-of-use, user-defined scripts such as: MediaWiki:Common.js, MediaWiki:Skinname.js, User:name/skinname.js, JS subpage preview.
 * The result of the AjaxAddScript hook probably also needs to be there for b/c.
 * diff.js adjusts the display pre-render and so needs to be in the header.
 * ajax.js, some parts of gen=js (skin, stylepath variables), jQuery and wikibits are required to be loaded before other scripts, potentially including header-loaded user scripts.
 * common/IEFixes.js, edit.js, metadata.js, search.js and upload.js contain variables that are referenced from the HTML, such as event attributes, so loading them from the footer would cause JS errors if the user interacted with those elements before the script finished loading. This is fixable, but this may well be the initial situation.

Using footer script tag placement for just a few scripts, when the bulk of script loading is done in the header, may reduce the amount of concatenation which can be done, with little performance benefit to offset it.

So, I suggest making footer script tag placement an optional, low-priority part of the initial project.

Event-driven loading can be useful where a large amount of code is required to support a rarely-used feature. However, it needs to be implemented carefully. In particular, it is important to give feedback to the user to indicate that the module is loading. This reassures the user that their click did actually do something, and tells them that they are expected to wait. Footer script tag placement gives some automatic feedback via the browser's UI. With event-driven loading it needs to be entirely implemented by us.
 * A quick note, mwEmbed includes some helper functions for this sort of thing like mw.addLoaderDialog, $j('target').loadingSpinner etc.

Even with appropriate feedback, event-driven loading should only be used when the improvement in initial page view time outweighs the perceived reduction in responsiveness to events. The trade-off is not simply 1:1, the user expects to have to wait for an initial page view, and an extra 500ms there will not be so keenly noticed as a 500ms delay in response to a button click. Similarly, there is an argument for loading subfeatures of a dynamically-loaded feature immediately, instead of waiting for subsequent click events.
 * We should be careful that we don't try and mix bottom of the page scripts with domReady loading events. Since bottom of the page scripts will delay domReady.


 * I highly recommend we stick domReady instead of bottom of the page. This is because javascript modules may extend other javascript modules and there needs to be a time for that to happen before interfaces start getting drawn to the page. a mw.ready function provides a clean separation between configuration / extension bindings and interface being rendered to the dom. With bottom of the page includes scripts may think "now" is a good time to perform actions, but it may be that another scirpt needs to extend it or bind to one of its events before it starts outputting to the dom.


 * The way it works in the current setup is your javascript module defines a small "loader.js" ( more info here ) but it essentially is a snippet of javscript that define a module in javascript and is always included (if the module is enabled). This also allows you to invoke any module in any context. For example a "view" page could load the wiki-editor inline in response to the "edit" button being pressed. Or the add-media-wizard can load the uploadWizard in response to the upload button within the add-media-wizard. This is different from the per-page javascript like a UploadPage.js that invokes the uploadWizard module for the current page. On per-page js we could include all the javascript inline if we wanted, but domReady loading is not a bad way to go, since you get a fast page response and a little spinner for a short period of time and you should generally be able to double up on the cache for that module. Mdale 17:27, 16 June 2010 (UTC)

Several MetavidWiki modules utilise event-driven loading, and I recommend supporting them in the initial project. Where possible, event-driven loading should support concurrent and pipelined downloads, it should not serialise requests. It should also support concatenation and minification.


 * That is correct, the javascript modules load all their resources in a single request with concatenation, minification & localization. Mdale 17:27, 16 June 2010 (UTC)

CSS
My analysis suggests that CSS needs to be included in the script loader project, as well as JavaScript. We are currently sending a large number of stylesheets to clients, this would benefit from concatenation.

It may be possible to use downlevel-revealed conditional comments to concatenate browser-specific CSS with general CSS. Instead of this:



We could have this:



Open source code for CSS minification exists.

Transformations
We would like to support the following transformations:


 * JavaScript minification
 * CSS Janus
 * CSS minification

These can all can be done in pure PHP and cached in $messageMemc or in a new table in the database. Caching in $wgCacheDirectory could be benchmarked also. The usual gzip output handler can be in front (wfGzipHandler).

I assume Michael's demand that MediaWiki startup time be avoided is based on a misconception about how fast MediaWiki is to start up. Startup time for a default installation with no APC is 32ms on my laptop, as measured with a simple entry point and ab -c1. This should be fast enough, as long as our concatenation strategy is sufficiently aggressive. It doesn't load the localisation system unless it is requested, let alone the entire code base. MediaWiki has several high-traffic lightweight entry points, it has already been optimised for this role.

On Wikipedia, the performance loss due to the large number of extension setup files is offset by gains from APC and faster processors, giving a startup time of around 13ms. This overhead will be reduced by having Squid in front.

mdale notes on MediaWiki startup
Apologies for my misconception. Yea, we are not worried about start-up time in wikimedia or "managed" context since it will be behind squid or varnish. The quick ~check cache file~ entry point is really about irregular traffic wikis on shared hosting accounts.

Some more benchmarks all requests are for ab -c 50 -n 1000 class=wikibits,window.jQuery,mwEmbed about 190K with apache bench ( gzip off )


 * Laptop ( cache file check, no mediaWiki webstart.php ) :
 * 3.5ms per request, 282.61 Requests per second
 * Laptop moved include of mediaWiki webstart.php a few lines above the cache check :
 * ab -c 50 -n 1000 = 26.7 ms per request, 37.44 Requests per second  ** much more memory used ( laptop almost crashes )**
 * Prototype includes APC ( cache file check, no mediaWiki webstart.php ) :
 * ab -c 50 -n 1000 = 0.598ms per request, 1672.38 Requests per second
 * Prototype includes APC, include of mediaWiki webstart.php
 * ab -c 50 -n 1000 = 5.65ms 176.96 Requests per second

It should be noted this was just moving webstart up a few lines, The real system would probably check a few things like file modified times, db message keys versions etc.

But, I guess your right ... computers are so fast now days, whats an extra order of magnitude? Assuming worst case a resource starved shared hosting account gave as little as 40 request per second that still like 50K requests per day, you will probably hit other scalability issues before reading from the APC optcode cache becomes your bottleneck. On the other hand, if you can have 9X performance without too many engineering headaches, Why not? You could also set up a basic expire of an 1 hour and read must-re-validate headers so you only include webstart every hour or when people do a shift refresh. This could let shared hosting sites weather "slashdotings" while not losing much in terms of maintainability. Either way I don't think its a demand that we avoid MediaWiki start-up.


 * The main reason for starting up MediaWiki is to get its configuration. Once you have configuration, you can access things like:
 * Memcached
 * Cache directory settings
 * Module registration (including from extensions)
 * Dynamically generated content
 * The other reason is that it makes it far easier to implement cache invalidation (as opposed to expiry). For instance, when a user changes $wgAjaxWatch in LocalSettings.php, we want it to take effect immediately. If they have to wait an hour, they will file a bug before the hour ends, and rightly so. We can't just monitor the file timestamp of LocalSettings.php itself, since the user may have split up their configuration file into multiple files, like what Wikimedia does.
 * Starting MediaWiki allows us to implement complex cache invalidation policies, where resource caches are invalidated immediately based on any number of configured criteria. -- Tim Starling 02:49, 17 June 2010 (UTC)
 * It analogous to writing out .html or .js file for apache to serve or a poor man's squid cache. The dumb/fast entry point would (just like squid ) hit the rest of medaWiki every hour or so to check that the cache is still valid. It does not matter if the content is "dynamic" either way you will be unlinking the cached file or be purging the cache in the squids on any user interaction that invalidated it ( localsettings.php updates, preference change, message key update etc ). Then the dumb/fast entry point would just do the key-to-file check, find it missing then continue loading mediaWiki to build out the requested data. Yes you may need a separate LocalCacheSettings.php file or something like that. It was done this way in reaction to people saying going through mediawiki would be too slow. But as mentioned above I see no problem incurring mediaWiki's extra startup-time cost, and pointing the "poor man" to squid install documentation. Mdale 23:39, 17 June 2010 (UTC)

The registration interface
For PHP callers, there is some value in registering and grouping scripts. For event-driven loading, there is even more value in it.

I propose splitting up registration of core and extension scripts, similar to what we do with special pages and autoload classes, to avoid excessive performance overhead when loading DefaultSettings.php. Core scripts should be registered in a static member variable of the script loader class, and extension scripts should be registered in a global variable with the same format.

Like in AutoLoader.php, core filenames should be relative to $IP, and extension filenames should be absolute.

A possible format would be to have files and groups of files sharing the same namespace. A file could be registered with:

'jquery.ui.draggable' => array( 'file' => 'js/jquery/ui/draggable.js' ),

Or with shortcut notation:

'jquery.ui.draggable' => 'js/jquery/ui/draggable.js',

The type can be guessed from the filename. Type classification only needs to be done for requested files so the performance overhead would not be too onerous. However, dynamically generated content would need a type option:

'loader' => array( 'type' => 'js', 'callback' => array( 'ScriptLoader', 'getLoader' ) ),

Files can have dependencies which need to be loaded before them:

'uploadPage' => array( 'file' => 'js/uploadPage.js', 'deps' => array( 'loader' ) ),

Dependencies need not be of the same type. A complex module, then, can be defined as a list of dependencies with no file member:

'jquery.ui' => array( 'deps' => array( 'draggable', 'droppable', 'resizable', ... ) ),

Having registered all this data, the calling code to include both CSS and JS becomes very simple:

$wgOut->addResource( 'jquery.ui', array( 'bucket' => 'all' ) );

For non-concatenated requests, the bucket option would be omitted:

$wgOut->addResource( 'jquery.ui' );

The client side interface has no buckets, so adding resources would generally require only the resource name.


 * Pushing all dependency mapping to php makes it difficult for a user script or gadget to know everything that is installed in a given script context. Imagine that a wiki A does not include jquery.ui by default, rather an extension has an opt-in feature that includes jquery.ui as a dependency of "droppable" and another wiki B extension that has an opt-in feature that only includes jquery.ui. Now you have a user script that thinks "jquery.ui" includes "droppable" and has the idea to dynamically request 'jquery.ui' for some wikis ui.droppable will be defined and for others it wont. Or if you define it the other way "droppable" includes jquery.ui, if you wanted to load "dropable" on another page where a different user-script had already defined jquery.ui there would be no way to access "droppable" via the resourceLoader without also including jQuery.ui.


 * Aside from config and user scirpt contexts, There are client-state based conditionals. You can imagine a large sets of javascript to support svg with flash being a dynamic dependency of a SVG application only if the client does not have native svg support and has the flash plugin. As the html platform rapidly grows in complexity more and more sub-components will fill various gaps, loading all the gap-fillers for browsers that will never use that code, will eventually affect performance.


 * And there is graceful degradation, in a recent bug loading jquery.ui crashed Netscape 7 so jquery.ui had to be loaded conditionally.


 * Mapping all those conditionals to php could be and difficult and result in hackish php representations of javascript conditional states. Modular "loader JavaScript" driving the request set based on both php config & client context is more ideal and lets you optimize on both transport and package size rather than trying to map out everything in php.


 * And finally for modular extending of javascript interfaces in dynamic loading context. For example you have an the entire uploadWizard javascript loaded on edit pages only when the user clicks on "insert image". If user-script wants to add-on a transport for uploading from a new tag, the uploadWizard loader has a binding event "UploadWizard.UpdateRequest', that lets any script bind to that hook, and add resources they need so that the uploadWizard can still do its build-out in a single resourceLoader request and include all the dynamic modules without substantially increasing the payload for the "edit" page. These modules need to check conditionals like does the browser support the tag and maybe it shares the XHR transport mechanism that only gets in some context or config usages of the uploadWizard etc etc. Mdale 05:21, 16 June 2010 (UTC)
 * I believe that these concerns could be addressed with the parameterization mechanism Tim describes in his section on concatenation (search for "disablesuggest"). This would work for your second, third and fourth examples with 'includeFlashSVG', 'includeJUI' and 'supportsCamInput' parameters, respectively. As for your first example, I believe extensions need to not be stupid and avoid using conflicting resource names (e.g. by prefixing the extension name). The other part of the problem pertains to double-loading; there should be safeguards against that anyway. --Catrope 12:15, 16 June 2010 (UTC)
 * Your user-script may have difficulties adding "supportsCamInput" into the php config files for the resource-loader. Second this is really about expected response and easy client side dependency checking. Of course you can parameterization everything, but what your left with is a long string with a complicated way to describe what you want, instead of explicitly listing what you want. Say you need need [ResourceA, ResourceB, ResourceC] and not ResourceD or ResourceF. You lookup the current deployed version of the code or extension that includes the js you want. ( be sure to not accidentally read the trunk or some older checkout ) You find that if you request ResourceC from the resourceLoader you get [ResourceA, ResourceC, ResourceD, ResourceF] . You now need to send a request to the resource loader that says give me: [ResourceC, NotResourceD_Flag, NotResourceF_Flag, 'ResourceB'] ( you have to remember, ResourceA is a dependency of ResourceC )  Is that really better than just requesting [ResourceA, ResourceB, ResourceC] ? Furthermore by proposing the use of "draggable" as a resource name instead of $j.ui.draggable, you have a disconnect between what the resource is called and what it defines. With one to one mapping of resource to defined name its easy to avoid unintended consequences. For example you have extended properties of $j.ui now you want to load "draggabe" from the resourceLoader. Since you can't get "draggable" without j.ui, it will automatically re-defines  $j.ui you lose your extended methods. If instead if you requested $j.ui.draggable you know it will go into $j.ui.draggable and define $j.ui.draggable and nothing else.


 * Just to be clear.. I Do support being able to call a grouped script names like mw.load("draggable" ...) just that the module and its dependencies should be defined in javascript not php, this way its resources and configuration can be checked client side and you don't have a opaque relationship between what you request and what you get.


 * One more point, using grouped resource names php-side makes script debugging slightly complicated in dynamic loading contexts. If your script is requesting "draggable" which stands in for a few raw files, in debug mode php now has to output a chain of raw file and php generated file calls then issue the callback at the end of that call set. With one-to-one resource mapping the debug setup is less complicated. Mdale 16:56, 16 June 2010 (UTC)

I suggest the terminology "resource", since it's rarely been used in MediaWiki before, so it's suitable as a new jargon word. It's general enough that it can apply to CSS, JS, and things that we might support in the future like CSS sprites. The configuration global could be $wgResourceList or $wgResourceConf, something like that. ResourceLoader would be a good class name. "Script" implies JavaScript, which is potentially confusing. The JS2 terminology "class" is ambiguous and confusing, since that word too is used for other things.


 * I agree on naming change to "resource", "resourceName" away from "script", "class" Mdale 05:21, 16 June 2010 (UTC)

Localisation
As previously discussed with Trevor and Michael, localisation should be done by having a message key list in some special format at the top of the source file. The presence of such localisation in a file should be noted in the file registration, to avoid the overhead of scanning unlocalised files:

'foo' => array( 'file' => 'js/foo.js', 'l10n' => true ),


 * How would this register work for user scripts? In the context of minification and gziping a don't know if searching for a string would be that costly. Also we recently added a 'includeAllModuleMessages' function for extensions that are primarily javascript driven and want all of their messages in their primary class. UploadWizard is using such message substitution. Mdale 14:16, 16 June 2010 (UTC)

This would add a dependency on a message resource. Message resources would need a special caching and invalidation system. The cache should be in the database. Two tables are necessary:


 * A table called msg_resource, which stores JS blobs for each resource name
 * A table called msg_resource_links, which has a row per message per resource and an index each way.

MessageCache::replace should load the list of message resources of which the given message is a part. Then each relevant message resource should be loaded, modified, and pushed back into msg_resource. Locking selects can be avoided using a blind conditional update followed by affectedRows. I can explain what that means if you need me to.


 * Sounds reasonable. I had done some hacks in that direction but we really needed specialized tables for performance reasons. It needs to be benchmarked but it might be faster to have wiki edits that update messages just unlink files and purge proxy cache that include its message key, this way dynamic / ondemand resource loading will just have to check a single file existence rather than hit the db cache / load all the db code on every request. ( I agree that in the context of page output with resource-loader inline links its not very costly to do a few more db queries. )    Mdale 14:16, 16 June 2010 (UTC)

The message resource cache should also store the last modification timestamp of the script file. When the script file changes, the message resource can be rebuilt. There should be no need for this cache to expire.


 * This adds some weight to the resource-loader on on-demand includes. Ie if you on-demand request a set of resources then the system must check the modification timestamp of all the requested files before returning the package. Its non-obvious how that works with the reverse proxy. You either have to store the version for every possible dynamic include in a "inline lined" / generated file or use some global version request number and assume few file modifications, that don't touch some file that is "inline linked" that updates your global version. Or just wait the few hours for the cache to expire ;) Mdale 14:16, 16 June 2010 (UTC)

Compared to Michael's JS2 scheme, this scheme makes it more difficult to construct scripts which are useful both with and without MediaWiki being present. I assume this is what Michael means by "stand alone usage". Wikimedia's role is to support Wikimedia websites and MediaWiki users. I don't think non-MediaWiki users of Metavid scripts should be on our list of priorities at all.


 * I don't think this is what I mean by stand alone usage. The stand alone scripts already support localization from the standard i18n php localization files. This is how the video player / add-media-wizard etc, are localized as a gadget via translate wiki, with a mediawiki independent check out of mwEmbed. In the stand alone usage context you don't check the mediawiki DB for the messages rather just use the php files. When used with mediaWiki, we use the normal mediaWiki wfGetMsg function so it pulls messages from the MessageCache which includes the DB message updates.


 * Are you proposing the messages not be stored in i8ln.php files? Mdale 14:16, 16 June 2010 (UTC)

The problem is still tractable, an interested developer could write a maintenance script which outputs the localized JavaScript for use by external installations. I just don't think that's something that Wikimedia should pay developers to do, unless we have contractual obligations.


 * I already create static ( only js ) release packages simply by requesting the scripts via the resource-loader, to get the package in a different language you just set the language key. Mdale 14:16, 16 June 2010 (UTC)

Versioning and HTTP proxy caching
Version numbers like $wgStyleVersion have limited utility. They are useful when you have simultaneous updates of the HTML and the linked resource, and they are useful for discarding the client-side cache of a logged-in user. However, they are not useful to discard the cache of a logged-out user, since these users will have the version number cached for as long as possible in the HTML. This is because generating page view HTML is very expensive. It doesn't make sense to expire expensive HTML, when we can just expire cheap script loader requests instead.

Script loader requests will be cheap as long as the script loader supports 304 responses and cache hit ratios are high. I think we should aim to meet those conditions, and then set expiry headers on the order of one hour.

The highest-timestamp version scheme in JS2 was good, I think we should keep that general idea. However, using UNIX timestamps would make it very challenging for sysadmins to do a forced purge of a file from the squid cache.

To assist sysadmins, I think the timestamps should be human-readable, say ISO 8601, and I think they should be rounded down to the nearest 10 seconds. We should have a maintenance script "purgeResource.php", similar to purgeList.php, which can purge every possible variant of a given URL within a given time period. Such a script should be able to do 10k req/s or so, so it would be able to purge a whole month of URLs in about 25 seconds. Without the 10s rounding, it would take 250 seconds, which would be much less convenient.
 * Okay, I originally wrote some objections here, but I think I understand the issue better now. At other sites I've solved the issue of skew between JS and PHP by tightly coupling them, so to stop loading an undesired JS file, you rev the PHP and JS together. That works fine for sites that are highly dynamic and different on every page load, like Flickr, but Wikipedia depends on static caching much much more. That's an interesting problem but the idea of a time-based purge still seems problematic to me. Maybe I'd just have to see it done in practice. NeilK 21:13, 11 June 2010 (UTC)


 * A caveat here is that, sometimes you many not want "old" html with "new" JavaScript interface code. Say you add some markup to better support section editing, or need a few more php configuration vars passed to the inline JavaScript page output. You don't want to have to check that the markup or vars exists for 30 days while the html cache is cleared.  We may want some way to flag script version to freshly generated html.  The trade off of never purging without page updating the url, is that every time you fix some bug in JavaScript code, you have to manually purge all the pages caches to get the update.
 * I agree with Tim here, we should lean towards low resourceLoader expires. The JavaScript already has to be sensitive to slight page structure differences from different skins, and we should move to packaging more configuration variables into resource Loader requests rather than inline page output. This way JavaScript configuration changes can be handed similar to message updates, that is purging some javascript resource rather than having to update the inline page output with configuration data.  Mdale 16:12, 16 June 2010 (UTC)
 * Yes, this is an issue, but I believe we should take the trouble of coding around this in JS (not even always needed if you use  due to the semantics of the empty jQuery object) because it's worth the much nicer cachign infrastructure. --Catrope 16:48, 16 June 2010 (UTC)