Requests for comment/CentralNotice Caching Overhaul - Frontend Proxy

Overview
CentralNotice is a system which delivers a banner to every page request in the content namespaces of a wiki. It currently does this in a series of three requests -- the first happens in the head and is a call to get the GeoIP country lookup, then ResourceLoader loads the CentralNotice controller, which in turn then loads a dynamic banner. CentralNotice is able to deliver a custom banner to a user based on the following properties: Country (~200), Language (~300), Project (14), User Status (2), Device Type (3), and Bucket (4). There are 30 slots per property combination.

The current CentralNotice architecture is unsatisfactory for several reasons:
 * 1) It uses massively more cache objects than is desired (a worst case of about 200GB of on-disk usage based on the average size of a banner being ~8k.)
 * 2) We do not load the banner until the ResourceLoader controller has loaded; which results in a page bump

The proposed solution to these issues is to have a static javascript blob, injected via skin hook, in every content namespace page that will call up to a special CentralNotice banner server. This server will map the request into approximately 200 disjoint allocations in a lookup table.

The lookup table will have specific details about banners in that allocation which will allow, with the usage of a random number generator, a direct request by banner name to the backend cache. This allows the backend to much more elegantly handle objects as it only needs to cache a banner by three varying parameters .



The JSONP Call in HEAD
A small snippet of Javascript in the head of a page will call a new subdomain entitled banners.wikimedia.org (or similar). The request parameters will be a composition of static variables known at page generation time on a wiki and dynamic variables that will come from a cookie or local storage. Dynamic variables are required where the user state will have changed from the defaults on the wiki; aka they logged in, or they changed their UI language. It is also required so that we may bucket users for A/B testing. These variables are:
 * Static project name (e.g. wikipedia)
 * Static project content language
 * Static user login status (may be overriden by cookie)


 * Dynamic UI language
 * Dynamic bucket (1 thru 4 currently)
 * Dynamic user login status

For the moment; the head JS is imagined to look something like:

Additional Variables added by Banner Server
CentralNotice needs additional variables in order to know what banner is suitable to serve to a user. These are unique per client and should initially be calculated on the proxy and then passed down for usage in a dynamic cookie. These are:
 * The user's device (e.g. desktop, iphone, android, ...)
 * The user's location as determined by IP address

Proxy Disjoint Set Mapping
CentralNotice will routinely produce a disjoint set of banner allocations and distribute them to the proxy servers. In the current plan this will be a series of three sets of tables, several offset lookup tables, a map table of the disjoint set, and several map line->banner entry tables. Presuming that the proxy server is a node.js server, string lookups will probably be the most efficient map format and the lookup would look something like:


 * 1) For each variable, determine the map string offset
 * 2) In each map line, check if offset is set, if so keep map line in future lookups
 * 3) Repeat until you have gone through all variables or map lines; there should be either 1 or no viable map lines

Simplified example:
Query 1: Given an inbound call of a logged in user in Russia on wikipedia; we search all map lines for... Query 2: Given an inbound call of a logged out user in Canada on wikibooks; we search all map lines for...
 * 1) offset 0 (wikipedia) which is all of them so no lines are dropped
 * 2) offset 5 (Russia) which is only line 2, so the others are dropped
 * 3) offset 7 (Logged in) which still is only line 2 and is now the final choice so we proceed to choose a random banner
 * 1) offset 1 (wikibooks) which drops line 3
 * 2) offset 4 (Canada) which drops line 2
 * 3) offset 6 (Logged out) which drops line 1
 * 4) As there are no valid map lines; no banner is shown

Random Banner Choice
Using a random number generator in the proxy we select a random banner and then request that from the backend proxy. We optionally composite the request (e.g. feeding back the detected country/device) with the returned banner which is set in a JavaScript variable ready for consumption by CentralNotice.

Returned Object and CentralNotice Controller
The returned object from the banner server will be in the form of a JSONP call that will set global variables (as the mediaWiki object is not guaranteed to be initialized.)

When ResourceLoader loads the BannerController that code with then decide what to do with the data in the globals. E.g. if the banner should be shown or not and where the banner should be placed on the page. It will also handle making a reporting call back up to the server for analytics purposes.

CentralNotice Provided Mapping Data
Ideally all dynamic mapping data will be provided by CentralNotice in a single blob (JSON/XML/?) that will be updated every time there is an allocation change (e.g. banner added/removed from campaign, campaign enabled/disabled.) The blobs will come with an expiry date and they should request the data from meta.wm.o once the expiry date is passed.

The banner servers should also be able to accept a purge command will force a rerequest of the data from meta.

Device Detection Regex
Currently we have this in a ResourceLoader JS file; this could easily be moved into a blob for distribution. Basically we currently look for easy strings like 'iphone' or 'android' in the UA string and call it that.

Disjoint Set Data
We only currently provide a somewhat buggy version of this currently which we can debug and place into a blob. See an example [//meta.wikimedia.org/wiki/Special:GlobalAllocation live disjoint set].

Reporting
Right now we are using Udp2Log and a null endpoint to detect what banner was actually shown to the user. The endpoint is called after the banner controller has run. This will remain unchanged and will be even more important to have independent of the delivery mechanism in case JS is disabled on the client (we don't want false statistics.)

We should however, eventually migrate to event logging; but that's not in the immediate scope of this work.

Software
We can either use Node.JS or Varnish as the banner server. Because I think this will be simpler to implement in Node I'd prefer to start there as the banner server. However, if the performance is too bad I can certainly write a Varnish VMOD to do the same (but instead of distributing JSON blobs it'll probably be XML because Expat is what I'm familiar with in C).

Node Considerations
 * + Failure of the banner code does not take down the rest of cluster
 * + Faster development with fewer bugs
 * + Faster deployment in case of bugs/changes
 * - Requires a wrapper to be written around Maxmind's GeoIP libraries and it's another place to update that data
 * - Requires additional servers to be provisioned and maintained in all our data centres (for optimal latency)
 * ? Node's efficiency
 * ? Portability -- we will be locked to this technology
 * ? Though not addressed in this RFC, with node we can run dynamic JS locally on the server that is served with a banner from the backend that can determine if it wants to display or not -- saving bandwidth. Potentially we could do the same thing in a VMOD with Lua.

Varnish VMOD Considerations
 * + Can be developed as a standalone library with VMOD/Nginx/PHP bindings for portability without changing core code
 * + Can use GeoIP code already written
 * + Can eventually reside on the frontend proxy obviating the need for additional servers once proven
 * - Bigger, more rapidly changing, list of servers that will need to be tracked by CentralNotice for purging purposes (possibly can use built in MediaWiki purge mechanism with some changes)
 * - Slower to develop / deploy
 * ? Likely to be faster / more memory efficient than node once optimized

Hardware
I have estimated that we see ~600Mbps peak CentralNotice traffic (~6,500 requests per second) based on current banner requests served and average banner size. Given that this requires redundancy and would greatly benefit from being located in caching centers; I estimate 4 servers (2 in eqiad, 2 in ams) with dual gigabit cards (and 16GB of RAM if we do on board varnish) would easily be able to handle the load.

Future Improvements not Addressed in this RfC

 * No delivery of banners that can be deterministically hidden on the banner server
 * Removal of most of the banner controller -- have the banner inject itself appropriately using document.write or similar
 * Reduction in banner size by running JS / CSS through minifiers