Rules of thumb for robust service infrastructure

Error handling paths tend to be less tested, which often leads to surprises and outages. To avoid those, we need to be especially thorough in their design and testing.

This document is aiming to describe
 * typical fault handling challenges & solutions, as well as
 * procedures & tools for testing error handling before as well as after deployment.

Error responses
HTTP provides standard return codes for signaling errors: 4xx for request errors, and 5xx for server-side errors not caused by the specific response. Services should always report errors with the appropriate status code. New services should encode details of the issue in a JSON body following the HTTP problem JSON standard.

The most common error response codes are:
 * 400 Bad request; Example: validation errors, invalid parameter combinations etc.
 * 403 Forbidden: Request not authenticated / permitted.
 * 404 Not found
 * 429: Rate exceeded


 * 500 Internal Server Error: Catch-all error.
 * 503 Service Unavailable + retry-after header: Service is temporarily unavailable. If provided, the retry-after response header signals that retries are permitted after waiting for the specified number of seconds.
 * 504 Gateway Timeout: A proxy timed out while connecting to an upstream server.

Timeouts
In the real world, there are many situations where no response is ever received for a request. The only way to avoid waiting forever is to use reasonable timeouts. HTTP clients typically have two separate timeouts: While a relatively short connect timeout is basically always the right choice, setting the right overall response timeout will require a bit more thought. Our main goal is to set timeouts in a way that lets us handle a detected failure in a meaningful way. This means that timeouts for sub-requests should always be shorter than what can be tolerated by a service's caller. Many of our services ultimately get their requests from a web browser. Most browsers time out requests after 300 seconds, but many users will lose patience after less than 60 seconds. If we go with this number as an upper bound for response times, we will need to set sub-request timeouts to less than 60 seconds, so that we can report an error before the client disconnects. Applied recursively to each service level, this leads to a stacking of timeouts, and establishes fairly short response time requirements for the lowest-level service. See T97204 for an RFC discussing this technique.
 * a connect timeout for the setup of the TCP connection (typically ~15 seconds), and
 * a request timeout, specifying how long to wait at most for a response, after a TCP connection has been established.

Retries
Many failures are temporary, which makes it tempting to automatically retry the request. The hope is that this will lead to a better user experience, and might even save overall processing time if we can avoid re-doing a lot of expensive work with a retry of a cheap sub-request.

Unfortunately, ill-considered retries can also quickly amplify a small issue into a site-wide outage by creating a storm of requests. This is a very real issue, and has lead to numerous real-life outages in our infrastructure [todo: link to incident reports]. Thus, our goal needs to be to bound the maximum work amplification from retries.

To do this, we need to be very thoughtful about when and how we retry.

When to retry
A few rules of thumb are: Beyond these safe defaults (implemented by our request libraries like preq), we'll need to consider the worst-case load amplification factor to make decisions on whether retrying specific additional error conditions is safe, or not. Retrying only at a single layer in the request processing graph is fairly safe, as the amplification factor in that case equals the number of retries plus one. On the other hand, retrying five times at each layer of a request processing graph 4 layers deep can multiply a failure at the lowest level by a factor of (1+5)^4 = 1296! This is a loaded footgun waiting to go off.
 * Never, never retry by default.
 * Never retry 4xx responses. "Insanity is repeating the same mistakes and expecting different results."
 * Never retry if the cause of a failure is overload. Sadly, most systems under overload just fail to respond. In an ideal world, they would return a 503 response instead, with an explicit retry-after to signal when retrying will be safe again. Sadly, implementing this is not easy in practice, which is why this is very rare in our infrastructure.
 * Retrying TCP connection setup errors is generally safe, as TCP connection setup is very cheap.

As a general rule, amplification factors should ideally be 2, and at most 4. This means that you can normally only retry once, in at most two levels of the request processing graph. Choose wisely.

How to retry
While being careful about retrying at all is the most important factor in keeping the system stable overall, other parameters can have a huge impact on the maximum load we can experience from retries:
 * Number of retries: A single retry is sufficient in almost all cases. Think thrice about the amplification factor before using more than a single retry.
 * Retry delay: Delaying retries can help to bound the peak load caused by retries. It also increases the chance that a retry succeeds after a temporary outage. In most situations, the initial delay should be greater than the typical no-error response time. For example, if a median response time of a service is 200ms, then a good retry delay would be a second. Repeat retries should back off exponentially. This strategy is used by default in preq.

Rate limiting & circuit breakers
A more direct way to limit load is to apply active rate limiting to each client of a service. Rejected requests are cheap to process, which helps with quick and targeted load shedding. Rates limits should be enforced both externally (ex: Varnish, RESTBase), as well as internally, for each service. The service-runner library used in most node services supports rate limiting out of the box.

Circuit breakers are basically client-side rate limits applied to error responses from a service. When a contacted service is returning a high rate of errors, it is likely that subsequent requests will fail as well. The service is likely to recover more quickly if it is not swamped by requests. When reaching an error rate threshold, the circuit breaker will stop most or all requests to the failed service, speeding recovery. A low rate of errors might still be allowed through, so that the circuit breaker can reset itself once the service returns to normal operation. We hope to integrate circuit breaker functionality in the PHP & node HTTP client libraries soon.

Testing error handling: Injecting faults

 * Take out random dependencies
 * DB (individual nodes, all)
 * other services
 * Overload: Saturate service with requests, see how it copes.
 * Things to look for
 * Memory usage
 * Decent error returns?
 * Rate limiting working?
 * Recovery

Automating fault injection

 * ChaosMonkey