Requests for comment/CentralNotice Caching Overhaul - Frontend Proxy

Overview
CentralNotice is a system which delivers a banner to every page request in the content namespaces of a wiki. It currently does this in a series of three requests -- the first happens in the head and is a call to get the GeoIP country lookup, then ResourceLoader loads the CentralNotice controller, which in turn then loads a dynamic banner. CentralNotice is able to deliver a custom banner to a user based on the following properties: Country (~200), Language (~300), Project (14), User Status (2), Device Type (3), and Bucket (4). There are 30 slots per property combination.

The current CentralNotice architecture is unsatisfactory for several reasons:
 * 1) It uses massively more cache objects than is desired. The worst case is currently about 200GB of on-disk usage based on the average size of a banner being ~8k multiplied by the number of objects in cache (Country * Language * Project * User Status * Device Type * Bucket).
 * 2) We do not load the banner until the ResourceLoader controller has loaded; which results in a page bump
 * 3) Fundraising uses this to generate income. We know from our tests that the time to first impression is incredibly important, reducing that is in our favour.
 * 4) Fundraising also has determined that the majority of our donations come from the first page view. We do not have an opportunity to cache these scripts.

The proposed solution to these issues is to have a static javascript blob, injected via skin hook, in every content namespace page that will call up to a special CentralNotice banner server. This server will map the request into approximately 200 disjoint allocations in a lookup table.

The lookup table will have specific details about banners in that allocation which will allow, with the usage of a random number generator, a direct request by banner name to the backend cache. This allows the backend to much more elegantly handle objects as it only needs to cache a banner by three varying parameters .



The JSONP Call in HEAD
A small snippet of JavaScript in the head of content page will start a JSONP call to a known URL on the bits servers (e.g. //bits.wm.o/banners). The request parameters will be a composition of static variables known at page generation time on a wiki and dynamic variables that will come from a cookie or local storage. Dynamic variables are required where the user state will have changed from the defaults on the wiki; aka they logged in, or they changed their UI language. It is also required so that we may bucket users for A/B testing. These variables are:
 * Static project name (e.g. wikipedia)
 * Static project content language
 * Static user login status (may be overriden by cookie)


 * Dynamic UI language
 * Dynamic bucket (1 thru 4 currently)
 * Dynamic user login status

For the moment; the head JS is imagined to look something like:

Additional Variables added by Banner Server
CentralNotice needs additional variables in order to know what banner is suitable to serve to a user. These are unique per client and should initially be calculated on the proxy and then passed down for usage in a dynamic cookie. These are:
 * The user's device (e.g. desktop, iphone, android, ...)
 * The user's location as determined by IP address

Note: The user’s geolocation is present already in a cookie generated by the controller. The name of this cookie should not be changed, nor should the name of the global variable the data eventually resides in. So the proposed JS in the HEAD will need to be aware of this.

Proxy Disjoint Set Mapping
CentralNotice will routinely produce a disjoint set of banner allocations and distribute them to the proxy servers. In the current plan this will be a series of three sets of tables, several offset lookup tables, a map table of the disjoint set, and several map line->banner entry tables. Presuming that the proxy server is a node.js server, string lookups will probably be the most efficient map format and the lookup would look something like:


 * 1) For each variable, determine the map string offset
 * 2) In each map line, check if offset is set, if so keep map line in future lookups
 * 3) Repeat until you have gone through all variables or map lines; there should be either 1 or no viable map lines

Simplified example:
Query 1: Given an inbound call of a logged in user in Russia on wikipedia; we search all map lines for... Query 2: Given an inbound call of a logged out user in Canada on wikibooks; we search all map lines for...
 * 1) offset 0 (wikipedia) which is all of them so no lines are dropped
 * 2) offset 5 (Russia) which is only line 2, so the others are dropped
 * 3) offset 7 (Logged in) which still is only line 2 and is now the final choice so we proceed to choose a random banner
 * 1) offset 1 (wikibooks) which drops line 3
 * 2) offset 4 (Canada) which drops line 2
 * 3) offset 6 (Logged out) which drops line 1
 * 4) As there are no valid map lines; no banner is shown

Random Banner Choice
Using a random number generator in the proxy we select a random banner and then request that from the backend proxy. We optionally composite the request (e.g. feeding back the detected country/device) with the returned banner which is set in a JavaScript variable ready for consumption by CentralNotice.

Returned Object and CentralNotice Controller
The returned object from the banner server will be in the form of a JSONP call that will set global variables (as the mediaWiki object is not guaranteed to be initialized.)

When ResourceLoader loads the BannerController that code with then decide what to do with the data in the globals. E.g. if the banner should be shown or not and where the banner should be placed on the page. It will also handle making a reporting call back up to the server for analytics purposes.

CentralNotice Provided Mapping Data
Ideally all dynamic mapping data will be provided by CentralNotice in a single blob (JSON/XML/?) that will be updated every time there is an allocation change (e.g. banner added/removed from campaign, campaign enabled/disabled.) The blobs will come with an expiry date and they should request the data from meta.wm.o once the expiry date is passed.

The banner servers should also be able to accept a purge command will force a rerequest of the data from meta.

Device Detection Regex
Currently we have this in a ResourceLoader JS file; this could easily be moved into a blob for distribution. Basically we currently look for easy strings like 'iphone' or 'android' in the UA string and call it that.

```Note: ``` MobileFrontend now also has device detection code. It may be possible to pull this data from there and have it only in one place.

Disjoint Set Data
We only currently provide a somewhat buggy version of this currently which we can debug and place into a blob. See an example [//meta.wikimedia.org/wiki/Special:GlobalAllocation live disjoint set].

Deployment considerations

 * 1) Roll out the varnish change and have that point to Special:BannerRandom on meta with all appropriate VCL cookie massaging to URL GET params as required. The request needs to still be put through cache!
 * 2) After testing, add JS snippet to head (this will take at least 40 days to take effect on all pages.)
 * 3) Set the cache epoch to when we added the JS snippet to head to purge out all pages still in cache

Somewhere between steps 1 and 2 roll out the Node service and test the VCL for that. This allows us to switch back and forth between the two solutions as we bugfix.

Reporting
Right now we are using Udp2Log and a null endpoint to detect what banner was actually shown to the user. The endpoint is called after the banner controller has run. This will remain unchanged and will be even more important to have independent of the delivery mechanism in case JS is disabled on the client (we don't want false statistics.)

We should however, eventually migrate to event logging; but that's not in the immediate scope of this work.

Rationalle on the static JS (Content Security Policy violation)
Having a static JavaScript blob will prevent us from deploying a Content Security Policy on the site. It is also discouraged from adding more static JavaScript. However, because it is so important for Fundraising to have a reduced time to load I deem this necessary. Additionally, when we do roll out a CSP we will need to come up with some solution for the ResourceLoader static scripts as well -- whatever solution we come up with there will probably be appropriate here and discussing that solution is beyond the scope of this RFC.

Software
We can either use Node.JS or a Varnish VMOD as the banner server. However, we will implement in Node.JS because this will be simpler to implement there and we do not have to worry about SSL termination, reporting, and all the other front-end problems we’ve already solved with Varnish.

Node Considerations
 * + Failure of the banner code does not take down the rest of cluster
 * + Faster development with fewer bugs
 * + Faster deployment in case of bugs/changes
 * - Requires a wrapper to be written around Maxmind's GeoIP libraries and it's another place to update that data. There is a [GeoIP implementation for NodeJS].
 * - Requires additional servers to be provisioned and maintained in all our data centres (for optimal latency)
 * ? Node's efficiency
 * ? Portability -- we will be locked to this technology
 * ? Though not addressed in this RFC, with node we can run dynamic JS locally on the server that is served with a banner from the backend that can determine if it wants to display or not -- saving bandwidth.

Banner object caching
We have two options:
 * 1) Cache the banner objects in the main cluster varnish cache
 * 2) Cache the banner objects in a local instance of Varnish

The advantage to caching the objects locally is that then we will not have symmetric traffic outside of the box reducing load on the cluster network.

Hardware
I have estimated that we see ~600Mbps peak CentralNotice traffic (~6,500 requests per second) based on current banner requests served and average banner size. Given that this requires redundancy and would greatly benefit from being located in caching centers; I estimate 4 servers (2 in eqiad, 2 in ams) with dual gigabit cards (and 16GB of RAM if we do on board varnish) would easily be able to handle the load.

Future Improvements not Addressed in this RfC

 * No delivery of banners that can be deterministically hidden on the banner server
 * Removal of most of the banner controller -- have the banner inject itself appropriately using document.write or similar
 * Reduction in banner size by running JS / CSS through minifiers