Requests for comment/Data-driven Zero Varnish configuration

From mediawiki.org
Request for comment (RFC)
Data-driven Zero Varnish configuration
Component General
Creation date
Author(s) Yurik
Document status declined

This RFC might be superseded by Unfragmented ZERO RFC

In order to remove all the carrier-specific settings from Varnish config files, we propose to use netmapper plugin to map client IP to a magic string. That string will contain all information required to identify zero traffic.

In the current Varnish code, netmapper maps IP to the carrier ID (X-CS), followed by carrier-specific validation, and if it all passes, the X-CS header is set:

id = netmapper(client.ip);
if (id == "250-99") {
  if ((url=="ru.wikipedia.org" || url=="en.wikipedia.org") &&
      (isDirectTraffic || proxy == "Opera") &&
      (...)
  ) {
    set Header.X-CS = id;
  }
} else if (id == "...") ...

We propose to alter this system to remove all carrier-specific code from Varnish:

magicStr = netmapper(client.ip);
// parse magicStr into id, supported_languages, proxies, and other values
id, supported_languages, proxies = parse_magic(magicStr);
if (url_language in supported_languages && current_proxy in allowed_proxies && ...) {
    set Header.X-CS = id;
}

Various carrier configurations[edit]

How carriers identify traffic
  • IP-based - carrier whitelists the entire Wikimedia's IP range, implies HTTPS support
  • IP-based, no images - carrier whitelists all IPs except upload.wikimedia.org (legacy contracts only), only allows zero.wikipedia.org (unless we implement a no-image m.), implies HTTPS support
  • URL-based: all languages or list of languages, both m. & zero. or just one of them - carrier does DPI to whitelist matching traffic
Connections
  • Multiple direct gateways (several ranges of originating carrier IP addresses, each possibly having different settings)
  • Opera Mini
  • Nokia and other proxies
  • Some carriers support HTTPS - could be either for direct gateways (must whitelist IPs, not URLs), or via a custom browser+proxy, e.g. Opera Mini which can whitelist URLs while also support HTTPs

Exposing configuration via API[edit]

API returns results as a mapping between a set of IPs and a magic string that contains all data required to make a decision if a given request is Zero or not. There could be many forms of the magic string, and its format is the biggest unknown at this point, as it has to on one hand cover every possible usage scenario, and on the other be easily parsable and concise.

Magic string format is a set of substrings separated with a space character.

"<X-CS ID> <proxy[+ssl],lang.subdomain,lang.subdomain,...> <...>"
  • <X-CS ID> in format 250-99
  • proxy - either empty string for direct connection, or the name of the proxy, e.g. "Opera"
  • ssl - ether empty string for non-ssl connection, or the keyword "ssl" if this connection could come via HTTPS
  • domain+languages is a set of comma-separated strings, one for each allowed language+domain pair (or a wildcard). It could be one or more of the following values:
    • "*" - all languages on all domains (ip-whitelisting). Note that this implies desktop as well as sister sites (wikiquotes, wikibooks, etc), hence header may be set on everything. If given, must be the only value present.
    • "*.m" - all languages on m.wikipedia.org domain. If given, no specific languages in .m domain should exist.
    • "*.zero" - all languages on zero.wikipedia.org domain. If given, no specific languages in .zero domain should exist.
    • "lang.subdomain" - specific language in wikipedia.org domain, such as "en.m,fr.m,en.zero,fr.zero" would whitelist 2 languages in both m and zero.
{
  // supports only non-proxied connections, IP-based, with HTTPS
  "123-11 ,* +ssl,*" : [ip list],
  // supports only non-proxied connections, zero only, IP-based, with HTTPS
  "123-12 ,*.zero +ssl,*.zero" : [ip list],
  // supports only non-proxied connections, m only, English & French
  "123-13 ,en.m,fr.m" : [ip list],
  // supports IP-based on non-proxied connections, and m only for English & French when coming via Opera proxy, but supports ssl
  "123-14 ,* +ssl,* Opera,en.m,fr.m Opera+ssl,en.m,fr.m" : [ip list],
}

Varnish implementation[edit]

Sudo-logic for implementing above specs in Varnish:

// Initialize values: "proxy" (string),
//                    "ssl" (bool),
//                    "subdomain" (string - 'm' or 'zero')
//                    "language" (string - 'en', 'fr', etc)
// get "carrier" (string) from netmapper ip lookup
// "carrier" is in the format described above

if (ssl) proxy = proxy + "+ssl"; // Proxy could be an empty string
proxy = " " + proxy + ",";

// In "carrier", find substring that starts with content of "proxy" and ends in space|EOL
char *info = strstr(carrier, proxy);
if (info) {
    bool isZero = false;

    // point info to the begining of the list of languages including first comma
    info += strlen(proxy) - 1;
    // create a string copy with content begining at "info" until strchr(' ', info), set info to point to it
    info = substring(info, 0, (strchr(' ', info) - info))
    
    // info could be ',*' or a list of ",*.subdomain", or ",language.subdomain"
    if (info == ",*") {
        isZero = true;
    } else if (domain == "wikipedia.org") {
        if (language == "") {
            // Special case - request to 'm.wikipedia.org' or 'zero.wikipedia.org'
            // If there are any languages for the same subdomain, allow it
            isZero = strstr(info, "." + subdomain) != NULL;
        } else {
            // check if either *.subdomain or language.subdomain is listed
            isZero =
                strstr(info, ",*." + subdomain) != NULL ||
                strstr(info, "," + language + "." + subdomain) != NULL;
        }
    }

    if (isZero) {
        // Extract the first value (X-CS ID) from the carrier string
        set header.X-CS = subs(carrier, '([^ ]+)');
    }
}