Manual:Varnish caching

From MediaWiki.org
Jump to: navigation, search

Varnish is a lightweight, efficient reverse proxy server[1] which reduces the time taken to serve often-requested pages.

Like Squid, Varnish is an HTTP accelerator which stores copies of the pages served by the web server. The next time the same page is requested, Varnish will serve the copy instead of requesting the page from the Apache server. This caching process removes the need for MediaWiki to regenerate that same page again, resulting in a tremendous performance boost.[2]

Varnish has the advantage of being designed specifically for use as an HTTP accelerator (reverse proxy). It stores much of its cached data in memory, creating fewer disk files and fewer accesses to the filesystem than the larger, more multi-purpose Squid package. Like Squid, it serves often-requested pages to anonymous-IP users from cache instead of requesting them from the origin web server. This reduces both CPU usage and database access by the base MediaWiki server.

Because of this performance gain, MediaWiki has been designed to integrate closely with a web cache and will notify Squid or Varnish when a page should be purged from the cache in order to be regenerated.

From MediaWiki's point of view, a correctly-configured Varnish installation is interchangeable with its Squid counterpart.

The architecture[edit | edit source]

An example setup of Varnish, Apache and MediaWiki on a single server is outlined below. A more complex caching strategy may use multiple web servers behind the same Varnish caches (all of which can be made to appear to be a single host) or use independent servers to deliver wiki or image content.

Outside world <--->

Server

Varnish accelerator
w.x.y.z:80

<--->

Apache webserver
127.0.0.1:80

To the outside world, Varnish appears to act as the web server. In reality it passes on requests to the Apache web server, but only when necessary. An Apache running on the same server only listens to requests from localhost (127.0.0.1) while Varnish only listens to requests on the server's external IP address. Both services run on port 80 without conflict as each is bound to different IP addresses.

Configuring Varnish 2.x[edit | edit source]

/etc/sysconfig/varnish[edit | edit source]

This is the first configuration file loaded by Varnish on startup. It specifies the amount of memory to be allocated to the Varnish cache, the location of the main (*.vcl) configurations and the specific IP addresses to which Varnish must respond.

The remainder of the configuration data, including the address of the backend server(s), is listed in the main *.vcl file - not here.

# Maximum number of open files
NFILES=131072
 
# Locked shared memory, default log size is 82MB + header
MEMLOCK=82000
 
## Configuration with VCL
#
# Listen on port 80, administration on localhost:6082, and forward to
# one content server selected by the vcl file, based on the request.  Use a
# fixed-size cache file.
#
# Note: you must replace "example.org" with the outside IP address of your server
# - this is the address at which Varnish receives incoming requests.
# $wgSquidServers in MediaWiki's LocalSettings.php will also need to list all addresses for this Varnish cache.
#
DAEMON_OPTS="-a example.org:80 \
             -T localhost:6082 \
             -f /etc/varnish/default.vcl \
             -u varnish -g varnish \
             -s file,/var/lib/varnish/varnish_storage.bin,4G"

If your server is to listen on more than one outside address (as will almost always be the case if you offer IPv6 support alongside IPv4), use commas to separate each individual outside address. Enclose any numeric IPv6 addresses in square brackets in this format:

DAEMON_OPTS="-a 192.170.2.1:80,[2001:db8::2]:80 \
             -T localhost:6082 \
             -f /etc/varnish/default.vcl \
             -u varnish -g varnish \
             -s file,/var/lib/varnish/varnish_storage.bin,4G"

Note that Varnish version 2.1 or later is required to enable inbound IPv6 connections.

A sample mediawiki.vcl[edit | edit source]

The address(es) of the backend server(s) must be specified here. In a simple installation, one server (localhost) is sufficient. Larger sites may operate multiple wiki or image servers behind a single Varnish cache[3]:

# set default backend if no server cluster specified
backend default {
        .host = "localhost";
        .port = "80"; }
 
# create a round-robin director: "apaches" uses wiki1 and wiki2 as backend servers.
director apaches round-robin {
  { .backend = { .host = "wiki1"; .port = "80"; } }
  { .backend = { .host = "wiki2"; .port = "80"; } } }
 
# access control list for "purge": open to only localhost and other local nodes
acl purge {
        "localhost";
        "wiki1";
        "wiki2";
        "image1";
}

If more than one backend webserver is available, a list of servers to be used may be selected here on a per-domain basis. This could allow multiple, relatively powerful servers to be used to respond to wiki page text requests while requests for static images are handled on a local web server. A simple one-server installation would simply pass all unhandled requests to the default web server.

Any requests other than a simple 'get' will be passed directly through to the web server, along with all requests from logged-in users.

Most common browsers support compression (gzip or zip) of returned pages. While Varnish itself performs no compression, it is configured here to store separate copies of a page depending on whether the user's browser supports compression.[4] If a browser accepts both gzip and zip (deflate), the gzip version of the page is served as it is smaller and therefore slightly quicker to display. The browser's reported capabilities are checked here and the gzip'ped version of pages is served wherever possible.

# vcl_recv is called whenever a request is received 
sub vcl_recv {
        # Serve objects up to 2 minutes past their expiry if the backend
        # is slow to respond.
        set req.grace = 120s;
 
        # Use our round-robin "apaches" cluster for the backend.
        if (req.http.host ~ "^images.example.org$") 
           {set req.backend = default;}
        else
           {set req.backend = apaches;}
 
        # This uses the ACL action called "purge". Basically if a request to
        # PURGE the cache comes from anywhere other than localhost, ignore it.
        if (req.request == "PURGE") 
            {if (!client.ip ~ purge)
                {error 405 "Not allowed.";}
            lookup;}
 
        # Pass any requests that Varnish does not understand straight to the backend.
        if (req.request != "GET" && req.request != "HEAD" &&
            req.request != "PUT" && req.request != "POST" &&
            req.request != "TRACE" && req.request != "OPTIONS" &&
            req.request != "DELETE") 
            {pipe;}     /* Non-RFC2616 or CONNECT which is weird. */
 
        # Pass anything other than GET and HEAD directly.
        if (req.request != "GET" && req.request != "HEAD")
           {pass;}      /* We only deal with GET and HEAD by default */
 
        # Pass requests from logged-in users directly.
        if (req.http.Authorization || req.http.Cookie)
           {pass;}      /* Not cacheable by default */
 
        # Pass any requests with the "If-None-Match" header directly.
        if (req.http.If-None-Match)
           {pass;}
 
        # Force lookup if the request is a no-cache request from the client.
        if (req.http.Cache-Control ~ "no-cache")
           {purge_url(req.url);}
 
        # normalize Accept-Encoding to reduce vary
        if (req.http.Accept-Encoding) {
          if (req.http.User-Agent ~ "MSIE 6") {
            unset req.http.Accept-Encoding;
          } elsif (req.http.Accept-Encoding ~ "gzip") {
            set req.http.Accept-Encoding = "gzip";
          } elsif (req.http.Accept-Encoding ~ "deflate") {
            set req.http.Accept-Encoding = "deflate";
          } else {
            unset req.http.Accept-Encoding;
          }
        }
 
        lookup;
}

Varnish must pass the user's IP address as part of the 'x-forwarded-for' header, so that MediaWiki may be configured to display the user's address in special:recentchanges instead of Varnish's local IP address.

sub vcl_pipe {
        # Note that only the first request to the backend will have
        # X-Forwarded-For set.  If you use X-Forwarded-For and want to
        # have it set for all requests, make sure to have:
        # set req.http.connection = "close";
 
        # This is otherwise not necessary if you do not do any request rewriting.
 
        set req.http.connection = "close";
}

Varnish must be configured to allow a PURGE request from MediaWiki, instructing the cache to discard stored copies of pages which have been modified by user edits. These requests normally originate only from wiki servers within the local site.

If the page is not in the cache, a 200 (success) code is still returned as the objective is to remove the outdated page from the cache.

# Called if the cache has a copy of the page.
sub vcl_hit {
        if (req.request == "PURGE") 
            {purge_url(req.url);
            error 200 "Purged";}
 
        if (!obj.cacheable)
           {pass;}
}
 
# Called if the cache does not have a copy of the page.
sub vcl_miss {
        if (req.request == "PURGE") 
           {error 200 "Not in cache";}
}

The web server may set default expiry times for various objects. As MediaWiki will indicate (via a PURGE request) when a page has been edited and therefore needs to be discarded from cache, the Apache-reported defaults for expiry time are best ignored or replaced with a significantly-longer expiry time.

Pages served to logged-in users (identified by MediaWiki setting browser cookies) or which require passwords to access are never cached.

In this example, the 'no-cache' flag is being ignored on pages served to anonymous-IP users. Such measures normally are only needed if a wiki is making extensive use of extensions which add this flag indiscriminately (such as a wiki packed with random <choose>/<option> Algorithm tags on the main page and various often-used templates).

# Called after a document has been successfully retrieved from the backend.
sub vcl_fetch {
 
        # set minimum timeouts to auto-discard stored objects
#       set beresp.prefetch = -30s;
        set beresp.grace = 120s;
 
        if (beresp.ttl < 48h) {
          set beresp.ttl = 48h;}
 
        if (!beresp.cacheable) 
            {pass;}
 
        if (beresp.http.Set-Cookie) 
            {pass;}
 
#       if (beresp.http.Cache-Control ~ "(private|no-cache|no-store)") 
#           {pass;}
 
        if (req.http.Authorization && !beresp.http.Cache-Control ~ "public") 
            {pass;}
 
}

Configuring Varnish 3.x[edit | edit source]

Due to many changes in the VCL language there are some major differences between versions 2.x and 3.x. The code block below is an adapted version of the above code blocks, modified for Varnish 3.x. The below VCL is also configured for just one server.

# set default backend if no server cluster specified
backend default {
    .host = "127.0.0.1";
    .port = "8080";
    # .port = "80" led to issues with competing for the port with apache.
}
 
# access control list for "purge": open to only localhost and other local nodes
acl purge {
    "127.0.0.1";
}
 
# vcl_recv is called whenever a request is received 
sub vcl_recv {
        # Serve objects up to 2 minutes past their expiry if the backend
        # is slow to respond.
        set req.grace = 120s;
        set req.http.X-Forwarded-For = client.ip;
        set req.backend = default;
 
        # This uses the ACL action called "purge". Basically if a request to
        # PURGE the cache comes from anywhere other than localhost, ignore it.
        if (req.request == "PURGE") 
            {if (!client.ip ~ purge)
                {error 405 "Not allowed.";}
            return(lookup);}
 
        # Pass any requests that Varnish does not understand straight to the backend.
        if (req.request != "GET" && req.request != "HEAD" &&
            req.request != "PUT" && req.request != "POST" &&
            req.request != "TRACE" && req.request != "OPTIONS" &&
            req.request != "DELETE") 
            {return(pipe);}     /* Non-RFC2616 or CONNECT which is weird. */
 
        # Pass anything other than GET and HEAD directly.
        if (req.request != "GET" && req.request != "HEAD")
           {return(pass);}      /* We only deal with GET and HEAD by default */
 
        # Pass requests from logged-in users directly.
        if (req.http.Authorization || req.http.Cookie)
           {return(pass);}      /* Not cacheable by default */
 
        # Pass any requests with the "If-None-Match" header directly.
        if (req.http.If-None-Match)
           {return(pass);}
 
        # Force lookup if the request is a no-cache request from the client.
        if (req.http.Cache-Control ~ "no-cache")
           {ban_url(req.url);}
 
        # normalize Accept-Encoding to reduce vary
        if (req.http.Accept-Encoding) {
          if (req.http.User-Agent ~ "MSIE 6") {
            unset req.http.Accept-Encoding;
          } elsif (req.http.Accept-Encoding ~ "gzip") {
            set req.http.Accept-Encoding = "gzip";
          } elsif (req.http.Accept-Encoding ~ "deflate") {
            set req.http.Accept-Encoding = "deflate";
          } else {
            unset req.http.Accept-Encoding;
          }
        }
 
        return(lookup);
}
 
sub vcl_pipe {
        # Note that only the first request to the backend will have
        # X-Forwarded-For set.  If you use X-Forwarded-For and want to
        # have it set for all requests, make sure to have:
        # set req.http.connection = "close";
 
        # This is otherwise not necessary if you do not do any request rewriting.
 
        set req.http.connection = "close";
}
 
# Called if the cache has a copy of the page.
sub vcl_hit {
        if (req.request == "PURGE") 
            {ban_url(req.url);
            error 200 "Purged";}
 
        if (!obj.ttl > 0s)
           {return(pass);}
}
 
# Called if the cache does not have a copy of the page.
sub vcl_miss {
        if (req.request == "PURGE") 
           {error 200 "Not in cache";}
}
 
# Called after a document has been successfully retrieved from the backend.
sub vcl_fetch {
 
        # set minimum timeouts to auto-discard stored objects
#       set beresp.prefetch = -30s;
        set beresp.grace = 120s;
 
        if (beresp.ttl < 48h) {
          set beresp.ttl = 48h;}
 
        if (!beresp.ttl > 0s) 
            {return(hit_for_pass);}
 
        if (beresp.http.Set-Cookie) 
            {return(hit_for_pass);}
 
#       if (beresp.http.Cache-Control ~ "(private|no-cache|no-store)") 
#           {return(hit_for_pass);}
 
        if (req.http.Authorization && !beresp.http.Cache-Control ~ "public") 
            {return(hit_for_pass);}
 
}


Configuring Varnish 4.x[edit | edit source]

For more information go to https://www.varnish-cache.org/docs/4.0/whats-new/upgrading.html.

vcl 4.0;
# set default backend if no server cluster specified
backend default {
    .host = "127.0.0.1";
    .port = "8080";
    # .port = "80" led to issues with competing for the port with apache.
}
 
# access control list for "purge": open to only localhost and other local nodes
acl purge {
    "127.0.0.1";
}
 
# vcl_recv is called whenever a request is received 
sub vcl_recv {
        # Serve objects up to 2 minutes past their expiry if the backend
        # is slow to respond.
        #set req.grace = 120s;
        set req.http.X-Forwarded-For = client.ip;
        set req.backend_hint= default;
 
        # This uses the ACL action called "purge". Basically if a request to
        # PURGE the cache comes from anywhere other than localhost, ignore it.
        if (req.method == "PURGE") 
            {if (!client.ip ~ purge)
                {return(synth(405,"Not allowed."));}
            return(hash);}
 
        # Pass any requests that Varnish does not understand straight to the backend.
        if (req.method != "GET" && req.method != "HEAD" &&
            req.method != "PUT" && req.method != "POST" &&
            req.method != "TRACE" && req.method != "OPTIONS" &&
            req.method != "DELETE") 
            {return(pipe);}     /* Non-RFC2616 or CONNECT which is weird. */
 
        # Pass anything other than GET and HEAD directly.
        if (req.method != "GET" && req.method != "HEAD")
           {return(pass);}      /* We only deal with GET and HEAD by default */
 
        # Pass requests from logged-in users directly.
        if (req.http.Authorization || req.http.Cookie)
           {return(pass);}      /* Not cacheable by default */
 
        # Pass any requests with the "If-None-Match" header directly.
        if (req.http.If-None-Match)
           {return(pass);}
 
        # Force lookup if the request is a no-cache request from the client.
        if (req.http.Cache-Control ~ "no-cache")
           {ban(req.url);}
 
        # normalize Accept-Encoding to reduce vary
        if (req.http.Accept-Encoding) {
          if (req.http.User-Agent ~ "MSIE 6") {
            unset req.http.Accept-Encoding;
          } elsif (req.http.Accept-Encoding ~ "gzip") {
            set req.http.Accept-Encoding = "gzip";
          } elsif (req.http.Accept-Encoding ~ "deflate") {
            set req.http.Accept-Encoding = "deflate";
          } else {
            unset req.http.Accept-Encoding;
          }
        }
 
        return(hash);
}
 
sub vcl_pipe {
        # Note that only the first request to the backend will have
        # X-Forwarded-For set.  If you use X-Forwarded-For and want to
        # have it set for all requests, make sure to have:
        # set req.http.connection = "close";
 
        # This is otherwise not necessary if you do not do any request rewriting.
 
        set req.http.connection = "close";
}
 
# Called if the cache has a copy of the page.
sub vcl_hit {
        if (req.method == "PURGE") 
            {ban(req.url);
            return(synth(200,"Purged"));}
 
        if (!obj.ttl > 0s)
           {return(pass);}
}
 
# Called if the cache does not have a copy of the page.
sub vcl_miss {
        if (req.method == "PURGE") 
           {return(synth(200,"Not in cache"));}
}
 
# Called after a document has been successfully retrieved from the backend.
sub vcl_backend_response {
 
        # set minimum timeouts to auto-discard stored objects
#       set beresp.prefetch = -30s;
        set beresp.grace = 120s;
 
        if (beresp.ttl < 48h) {
          set beresp.uncacheable = true;
          return (deliver);
        }
 
        if (!beresp.ttl > 0s) {
          set beresp.uncacheable = true;
          return (deliver);
        }
 
        if (beresp.http.Set-Cookie) {
          set beresp.uncacheable = true;
          return (deliver);
        }
 
#       if (beresp.http.Cache-Control ~ "(private|no-cache|no-store)") {
#          set beresp.uncacheable = true;
#          return (deliver);
#        }
 
        if (beresp.http.Authorization && !beresp.http.Cache-Control ~ "public") {
          set beresp.uncacheable = true;
          return (deliver);
        }
 
        return (deliver);
}

Configuring MediaWiki[edit | edit source]

Since Varnish is doing the requests from localhost, Apache will receive "127.0.0.1" as the direct remote address. However, as Varnish forwards requests to Apache, it is configured to add the "X-Forwarded-For" header so that the remote address from the outside world is preserved. MediaWiki must be configured to use the "X-Forwarded-For" header in order to correctly display user addresses in special:recentchanges.

The required configuration is the same for Squid as for Varnish. Make sure the LocalSettings.php file contains the following lines:

$wgUseSquid = true;
$wgSquidServers = array( '127.0.0.1', 'example.org' );
$wgUsePrivateIPs = true;
//Use $wgSquidServersNoPurge if you don't want MediaWiki to purge modified pages
//$wgSquidServersNoPurge = array('127.0.0.1');

Be sure to replace 'example.org' with the IP address on which your Varnish cache is listening. These settings serve two purposes:

  • If a request is received from the Varnish cache server, the MediaWiki logs need to display the IP address of the user, not that of Varnish. A special:recentchanges in which every edit is reported as '127.0.0.1' is all but useless; listing that address as a Squid/Varnish server tells MediaWiki to ignore the IP address and instead look at the 'x-forwarded-for' header for the user's IP.
  • If a page or image is changed on the wiki, MediaWiki will send notification to every server listed in $wgSquidServers telling it to discard (purge) the outdated stored page.

Use $wgSquidServersNoPurge for addresses which need to be kept out of recentchanges, but which do not receive HTTP PURGE messages. For instance, if Apache and Squid are respectively on 127.0.0.1 and an external address on the same machine, there's no need to send Apache a "purge" message intended for Squid. Likewise, if Squid is listening to multiple addresses, only send "purge" to one of them.

See also Squid configuration settings for all settings related to Squid/Varnish caching.

Some notes[edit | edit source]

As most of the traffic is handled by the Varnish cache, a statistics package[5] will not give meaningful data if configured to analyse Apache's access_log. There are packages available to log Varnish access data to a file for analysis if needed. Counters on individual wiki pages will also severely underestimate the number of views to each page (and to the site overall) if a web cache is deployed. Many large sites will turn off the counters with $wgDisableCounters.

The display of the user's IP address in the user interface must also be disabled by setting $wgShowIPinHeader = false;

Note that Varnish is an alternative to Squid, but does not replace other portions of a complete MediaWiki caching strategy such as:

Pre-compiled PHP code
The default behaviour of PHP under Apache is to load and interpret PHP web scripts each time they are accessed. Installation of a cache such as APC (yum install php-pecl-apc, then allocate memory by setting apc.shm_size=128 or better in /etc/php.d/apc.ini) can greatly reduce the amount of CPU time required by Apache to serve PHP content.
Localisation/Internationalisation
By default, MediaWiki (as of version 1.16+) will create a huge l10n_cache database table and access it constantly - possibly more than doubling the load on the database server after an "upgrade" to the latest MediaWiki version. Set $wgLocalisationCacheConf to force the localisation information to be stored to the file system to remedy this.
Variables and session data
Storing variable data such as the MediaWiki sidebar, the list of namespaces or the spam blacklist to a memory cache will substantially increase the speed of a MediaWiki installation. Forcing user login data to be stored in a common location is also essential to any installation in which multiple, interchangeable Apache servers are hidden behind the same Varnish caches to serve pages for the same wikis. Install the memcached package and set the following options in LocalSettings.php to force both user login information and cached variables to use memcache:
$wgMainCacheType = CACHE_MEMCACHED;
$wgMemCachedServers = array ( '127.0.0.1:11211' );
$wgSessionsInMemcached = true;
$wgUseMemCached = true;
Note that, if you have multiple servers, the localhost address needs to be replaced with that of the shared memcached server(s), which must be the same for all of the matching web servers at your site. This ensures that logging a user into one server in the cluster logs them into the wiki on all the interchangeable web servers.

In many cases, there are multiple alternative caching approaches which will produce the same result. See Manual:Cache.

Apache configuration[edit | edit source]

Log file[edit | edit source]

The Apache web server log, by default, shows only the address of the Varnish cache server, in this example "127.0.0.1:80"

Apache may be configured to log the original user's address by capturing "x-forwarded-for" information under a custom log file format.[6]

An example for Apache's httpd.conf to configure logging of x-forwarded-for is:

LogFormat "%{X-Forwarded-for}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cached

Image hotlinking[edit | edit source]

If a site uses Apache's mod_rewrite to block attempts by other websites to hotlink images, this configuration will need to be removed and equivalent configuration added to Varnish's configuration files. Where an image server is located behind Varnish, typically 90% or more of common image requests never reach Apache and therefore will not be blocked by a "http_referer" check in Apache's configurations.

See also[edit | edit source]

References[edit | edit source]

  1. https://www.varnish-cache.org/about
  2. http://www.aulinx.de/oss/code/wikipedia/[dead link]
  3. http://wikia.googlecode.com/svn/utils/varnishhtcpd/mediawiki.vcl
  4. https://www.varnish-cache.org/docs/trunk/users-guide/compression.html
  5. AWStats
  6. http://httpd.apache.org/docs-2.2/mod/mod_log_config.html