Rules of thumb for robust service infrastructure

Error handling paths tend to be less tested, which often leads to surprises and outages. To avoid those, we need to compensate by being especially thorough in integration testing, both before deployment, as well as during normal operation.

This document is aiming to describe
 * typical fault handling challenges & solutions, as well as
 * procedures & tools for testing error handling before as well as after deployment.

Timeout

 * Stacking (see https://phabricator.wikimedia.org/T97204)
 * Avoid introducing expensive end points, keep latency ideally < 10s, definitely < 60s

Error responses

 * 500 vs. 503 + retry-after

Timeouts & retry

 * point out risks: amplification
 * Link to example incidents
 * set out policy:
 * only very targeted (see https://phabricator.wikimedia.org/T97204)
 * maximum load amplification factor
 * delays, rate limit, back-off

Testing error handling: Injecting faults

 * Take out random dependencies
 * DB (individual nodes, all)
 * other services
 * Overload: Saturate service with requests, see how it copes.
 * Things to look for
 * Memory usage
 * Decent error returns?
 * Rate limiting working?
 * Recovery

Automating fault injection

 * ChaosMonkey