Rules of thumb for robust service infrastructure

This document is aiming to collect what we learned from incidents, and distill it into rules of thumb for

typical fault handling issues & solutions, as well as
procedures & tools for testing error handling before as well as after deployment.

The focus on error handling is based on the observation that this tends to be less tested, which often leads to surprises and outages. To avoid those, we need to be especially thorough in their design and testing.

Detecting & communicating faults[edit]

Error responses[edit]

HTTP provides standard return codes for signaling errors: 4xx for request errors, and 5xx for server-side errors not caused by the specific response. Services should always report errors with the appropriate status code. New services should encode details of the issue in a JSON body following the HTTP problem JSON standard.

The most common error response codes are:

400 Bad request; Example: validation errors, invalid parameter combinations etc.
403 Forbidden: Request not authenticated / permitted.
404 Not found
429: Rate exceeded

500 Internal Server Error: Catch-all error.
503 Service Unavailable + retry-after header: Service is temporarily unavailable. If provided, the retry-after response header signals that retries are permitted after waiting for the specified number of seconds.
504 Gateway Timeout: A proxy timed out while connecting to an upstream server.

Timeouts[edit]

In the real world, there are many situations where no response is ever received for a request. The only way to avoid waiting forever is to use reasonable timeouts. HTTP clients typically have two separate timeouts:

a connect timeout for the setup of the TCP connection (typically ~15 seconds), and
a request timeout, specifying how long to wait at most for a response, after a TCP connection has been established.

While a relatively short connect timeout is basically always the right choice, setting the right overall response timeout will require a bit more thought. Our main goal is to set timeouts in a way that lets us handle a detected failure in a meaningful way. This means that timeouts for sub-requests should always be shorter than what can be tolerated by a service's caller. Many of our services ultimately get their requests from a web browser. Most browsers time out requests after 300 seconds, but many users will lose patience after less than 60 seconds. If we go with this number as an upper bound for response times, we will need to set sub-request timeouts to less than 60 seconds, so that we can report an error before the client disconnects. Applied recursively to each service level, this leads to a stacking of timeouts, and establishes fairly short response time requirements for the lowest-level service. See T97204 for an RFC discussing this technique.

Handling faults[edit]

Retries[edit]

Many failures are temporary, which makes it tempting to automatically retry the request. The hope is that this will lead to a better user experience, and might even save overall processing time if we can avoid re-doing a lot of expensive work with a retry of a cheap sub-request.

Unfortunately, ill-considered retries can also quickly amplify a small issue into a site-wide outage by creating a storm of requests. This is a very real issue, and has lead to numerous real-life outages in our infrastructure [todo: link to incident reports]. Thus, our goal needs to be to bound the maximum work amplification from retries.

To do this, we need to be very thoughtful about when and how we retry.

When to retry[edit]

A few rules of thumb are:

Never, never retry by default.
Never retry 4xx responses. "Insanity is repeating the same mistakes and expecting different results."
Never retry if the cause of a failure is overload. Sadly, most systems under overload just fail to respond. In an ideal world, they would return a 503 response instead, with an explicit retry-after to signal when retrying will be safe again. Sadly, implementing this is not easy in practice, which is why this is very rare in our infrastructure.
Retrying TCP connection setup errors is generally safe, as TCP connection setup is very cheap.

Beyond these safe defaults (implemented by our request libraries like preq), we'll need to consider the worst-case load amplification factor to make decisions on whether retrying specific additional error conditions is safe, or not. Retrying only at a single layer in the request processing graph is fairly safe, as the amplification factor in that case equals the number of retries plus one. On the other hand, retrying five times at each layer of a request processing graph 4 layers deep can multiply a failure at the lowest level by a factor of (1+5)^4 = 1296! This is a loaded footgun waiting to go off.

As a general rule, amplification factors should ideally be 2, and at most 4. This means that you can normally only retry once, in at most two levels of the request processing graph. Choose wisely.

How to retry[edit]

While being careful about retrying at all is the most important factor in keeping the system stable overall, other parameters can have a huge impact on the maximum load we can experience from retries:

Number of retries: A single retry is sufficient in almost all cases. Think thrice about the amplification factor before using more than a single retry.
Retry delay: Delaying retries can help to bound the peak load caused by retries. It also increases the chance that a retry succeeds after a temporary outage. In most situations, the initial delay should be greater than the typical no-error response time. For example, if a median response time of a service is 200ms, then a good retry delay would be a second. Repeat retries should back off exponentially. This strategy is used by default in preq.

Rate limiting & circuit breakers[edit]

A more direct way to limit load is to apply active rate limiting to each client of a service. Rejected requests are cheap to process, which helps with quick and targeted load shedding. Rates limits should be enforced both externally (ex: Varnish, RESTBase), as well as internally, for each service. The service-runner library used in most node services supports rate limiting out of the box.

Circuit breakers are basically client-side rate limits applied to error responses from a service. When a contacted service is returning a high rate of errors, it is likely that subsequent requests will fail as well. The service is likely to recover more quickly if it is not swamped by requests. When reaching an error rate threshold, the circuit breaker will stop most or all requests to the failed service, speeding recovery. A low rate of errors might still be allowed through, so that the circuit breaker can reset itself once the service returns to normal operation. We hope to integrate circuit breaker functionality in the PHP & node HTTP client libraries soon.

Testing error handling: Injecting faults[edit]

The only way to discover weaknesses in failure & load handling before they become user-visible is to exercise both for real. We can exercise some failure scenarios in unit tests, but there will always be aspects that are only discovered in integration tests, at realistic loads. While such testing won't be perfect, we can go quite a long way by testing a few standard cases. Here are some ideas:

Take out random dependencies
- DB (individual nodes, all)
- other services
Slow down random dependencies
- Blackhole with a firewall (simulates hardware disappearing)
- Introduce packet loss?
Overload: Saturate service with requests, see how it copes.
Things to look for
- Memory usage
- Decent error returns?
  - Rate limiting working?
- Recovery

Unfortunately, larger-scale integration & load testing tends to be a fairly manual affair, which means that we can't do a full set of tests for each routine deploy. Tools like ChaosMonkeyhave been developed to automate taking out individual service instances with some probability, but as far as I am aware there aren't any that could be used as-is in our infrastructure. The interesting thing with such tools is that they are exercising fault handling in normal production, at regular intervals, and with a low amount of manual effort.

Automating fault injection[edit]

ChaosMonkey