Core Platform Team/Decisions Architecture Research Documentation/API router and rate limiting design proposal

= API routing / rate limiting design proposal =

Requirements
Core_Platform_Team/Initiatives/API_Gateway outlines the product requirements for the API Gateway initiative. The following technical requirements are derived from these, in addition to requirements necessary for the robust operation of these systems at the WMF.


 * Rate limit requests from authenticated clients (OAuth 2.0, MediaWiki as the authorization service)
 * Rate limit anonymous requests (anything not authenticated via the OAuth authorization service)
 * Limits must be applied across data-centers
 * NOTE: Cross-datacenter consistency can likely be quite relaxed; Minor and/or occasional loss of data or delayed replication should be quite tolerable for this use-case
 * Support for groups of clients with associated limits (so-called classes)
 * Arbitrarily defined
 * One or more clients per class (by client ID)
 * One (implicit?) group representing anonymous clients
 * Ability to report on the total number of API calls made by a client
 * Optional: Generate a notification to the client owner when a rate has been exceeded
 * Informative error responses that are clear as to the cause of failure, and the limit being imposed
 * Should return HTTP headers that report on the rate versus the limit imposed
 * Must be free of single points of failure
 * Must be horizontally scalable
 * Must fail-open where possible
 * Must have a latency low enough to be in-lined on every request

Sharing state (counters)
Any rate limiter implementation will make use of request counters to calculate rate. When there are multiple instances (necessary for scalability and fault tolerance), and when requests are distributed over them without client affinity, these counters must be shared. The implication here is that for every request fielded by the rate limiter at least two requests to shared state, an increment, and a read, will be made. Additionally, since the application of rate limiting is a blocking operation, any latency incurred contributes directly to overall request latency.

Typically, applications like Redis or Memcached are used for shared state like this. These stores have minimal latencies, and scale to large request volumes. However, we should take seriously the per-request cost and complexity that shared state introduces, and seek to minimize it whenever possible.

Optimizations
Some optimizations to shared state access are in theory possible. However, whether or not we are able to optimize may be contingent upon whether or not we implement our own rate limiting, or are able to contribute such changes to any existing implementation we choose.

Asynchronous increments
One optimization to the naive approach is to perform counter increments asynchronously. This creates an obvious race, and absolute correctness would not be possible. However, any potential inaccuracies are small, temporary, and would be outweighed by the benefits from decoupling this operation.

Mitigation caching
It should be possible to eliminate some counter reads. Certainly, if a client exceeds their rate limit and mitigation becomes necessary, we have the information necessary to establish how long this mitigation will remain in effect (i.e. how much time must pass before the rate drops below the limit); We can cache a mitigation for a period calculated to correspond with the time needed for the rate to fall below the limit. Other optimizations on the read path may also be possible.

Client affinity
In a perfect world, it would be possible to avoid the use of shared state entirely, provided we had session-based load-balancing upstream of rate limiters. With the same clients always served by the same rate limiter, rate limiters could operate entirely against local counters. This would result in a simpler, more robust, and higher performing system.

Endpoint cost disparity
HTTP resources are not all equal, some endpoints are more expensive than others (computationally, network or file IO, etc), thus it may be necessary to impose resource limits that are relative to a target.

Cost-correction / overrides
One approach to disparate resource costs would be to apply a cost-correcting multiplier to the provided rate limit. Such multipliers could be managed by operator-defined hierarchical structures, for example: In the above example, a target limit of 1000 requests per second is increased to 10,000/s for any endpoint matching /core/*/*/*/path/to/cheap, and decreased to 100/s for those matching /core/*/*/*/path/to/expensive. Unmatched endpoints are treated as if configured for a multiplier of 1.

Alternatively, the idea of overrides could be taken beyond simple cost multipliers to permit discrete overrides.

Concurrency limits
Another approach to addressing cost disparity would be to impose per-user concurrency limits on problematic endpoints. Such limits would best occur at the upstream service; Any discussion of implementing service-based concurrency limits would be out of scope for this document.

Sourcing client limits
Since the limits to enforce can vary based on “who” is accessing a resource, it must be possible to derive them from information found only in the request. Since MediaWiki is canonical for clients and users, the limits themselves either need to come from MediaWiki, or be mapped from elsewhere to MediaWiki references. There are any number of ways this might be accomplished, for example:


 * Limits could be stored by MediaWiki as session variables.  The rate limiter would then request the session object from session storage, deserialize, and extract the limit
 * Limits could be requested from a MediaWiki endpoint, using information provided in the access token (the client ID or a surrogate key)
 * The rate limiter could maintain a cache of the rates of active clients, and MediaWiki could invalidate this cache on logouts, and preseed on new authorizations
 * MediaWiki could encode the limit in a cryptographically signed token that the user submits for API requests.  The rate limiter would utilize the limit provided after validating the token signature

The latter approach is particularly attractive. It does not require any additional network transaction on the part of the rate-limiter to obtain limits, is less complex in nature, and is authorization-server agnostic.

JSON Web Tokens (JWT)
JSON Web Token (JWT), is an Internet standard for creating JSON-based access tokens that assert some number of claims. Tokens are signed by the server to verify that they are legitimate. JWT claims are typically used to pass the identity of authenticated users between an identity provider and a service provider, but other types of claims are possible. What is proposed here is to use JWTs, generated by MediaWiki, to pass configured rate limits (as a custom claim) to the rate limiting service.

Prior art
A number of freely licensed projects implementing HTTP routers and/or rate limiters exist. Please see rate limiter research for some popular examples, triaged for the purpose of this proposal.

Routing
We propose using Envoy as the API Gateway router. Envoy is extensible, performant, has a large and active user and developer community, and is already in use at the Foundation.

Rate limiting
Envoy has filters to support both local and global rate limiting, and they can be combined so that violations of the local limit preempt a call to the global limiter. The global rate limiting filter communicates with a service using gRPC, and a reference implementation of such a service does exist.

We evaluated Envoy’s rate limiting against some of the key factors mentioned in the background section to better understand its limitations and determine its feasibility.

Issues identified

 * 1) As discussed in the background section, our objective is to have the limit passed as an attribute of signed JWTs originating from MediaWiki (the authorization service).  Envoy ships with a filter that supports JSON Web Tokens, and our evaluation found it capable of validating token signatures, and copying the decoded payload to dynamic state for use by subsequent filters.  However, the rate limiter service assumes limits to be a part of its static (YAML-encoded) configuration; It is not currently possible to pass limit values to the global rate limiter service
 * 2) We found no straightforward means of supporting cost-correction overrides
 * 3) Redis is utilized for shared state, but this will not accommodate a multi-datacenter configuration

Solutions proposed
To address the above deficiencies, we propose the following:


 * 1) Modify the rate limiter service to accept an optional argument of client-supplied limit.  Exact semantics are TBD, but this optional argument could operate in concert with the existing static configuration, and the service changes suitably generalized to propose pushing upstream.  This change necessarily requires a corresponding change to the gRPC interface, and so requires changes to Envoy’s global rate limit filter as well.
 * 2) [PENDING DISCUSSION] Modify the rate limiter service to support the application of cost-correction overrides.  We’ll engage with upstream early to establish changes suitable for inclusion
 * 3) Changes to the rate limit service to support the use of Dynomite, Cassandra, or similar multi-datacenter storage solution.

Reporting
There is a product requirement that we maintain a count of accesses by client ID to support utilization graphs in the API portal. Our test configuration demonstrates the use of access logging to include a field containing the client ID ( in the output below). Envoy access logs can also be structured as JSON, and/or delivered via a gRPC sink. This exposes a great deal of flexibility; There are myriad ways to process and aggregate this information using standard tooling (likely we can have our pick from those already in use at the Foundation, but will seek guidance from SRE before defining this more concretely).

Why not define rates in the rate limiter, and pass the class name from MediaWiki instead?
It is a fundamental property of the API Gateway that requests are rate limited differently based on who is making the request. Specifically, they stipulate that we will rate limit on a client ID basis, where clients are grouped into classes that share a common limit.

In other words: Rate limits are attributes of a class, classes are collections of clients, and clients are associated with a MediaWiki user. This is the data model, and it belongs to the authorization service (MediaWiki + OAuth extension).

While it is technically possible to establish a mapping between class and corresponding limit in the rate limiter, and pass a class name from MediaWiki instead of a rate, it is a mistake to do so. The sorts of problems this will create are typical of poor encapsulation.

For example: It requires duplicate definitions of the class across disparate systems — MediaWiki to establish grouping of clients, the rate limiter for assignment of the limit value. Class management (creating classes, (re)assigning clients, altering limits, etc) now requires coordination between separate changes in MediaWiki (a UI?), and the rate limiter (a change in Gerrit and k8s deploy?), most likely performed by different people (and/or teams).

More significantly, it enshrines the concept of classes in the rate limiter contract. Should the semantics need to change, they will need to change in both places in a coordinated manner (and transitions like this typically require error-prone multi-step phased rollouts when performed in production). Future systems that reuse the rate limiter, either through Envoy or directly, will need to either conform to some similar notion of classes, or be otherwise (separately) accommodated.

Passing the rate from the authorization server (MW + extension) keeps the relationship between who is making the request and the corresponding rate, encapsulated within the system that establishes them. It creates a contract with the API router and rate limiter that can remain stable regardless of changes to the API Gateway semantics, or any future system that reuses this infrastructure.

Do we really need multi-datacenter support?
We think so, yes. Since we are limiting on a client ID basis, imagine a hypothetical ID belonging to a mobile reader available in app stores, and installed to 100s of thousands of devices. If traffic from these readers are distributed over every data center, we'll need to apply a rate limit that accounts for the aggregate number of requests.

Access token
A token forwarded to the API via an Authorization: Bearer… header. Represents a user’s authorization to access a resource.

Class
A group of clients with associated limits.

Client
Synonymous with application. A client is what accesses Wikimedia APIs on behalf of a user.

Client ID
Value issued to developers (by us) to uniquely identify a client.

Identity
Unique value that identifies what is being limited. For example: This could be an IP address for anonymous requests, or a client ID when authenticated.

JSON Web Token
JSON Web Token (JWT) is an internet standard for JSON-based access tokens.

OAuth
As used in this document, OAuth is always shorthand for OAuth 2.0.

Resource
Something a limit is imposed on. Typically (but not necessarily) corresponds to the ‘R’ in ‘URI’.

User
A person. Someone who uses the API through a client. When authenticated, these are users in MediaWiki.