Core Platform Team/Decisions Architecture Research Documentation/Services Architecture Recommendations (2019)

Wikimedia Services Architecture Recommendations.

Introduction
This document provides a high-level overview of the services architecture at Wikimedia, with recommendations to reflect the current thinking and discussion for its future direction.

RESTRouter
RESTRouter allows for the composition/curation of REST endpoints (read: an API) from one or more disparate micro-services. Transformations can be applied, and responses persistently cached to a storage service; Cached responses can be used to avoid costly round trips to back-end micro-services.

RESTBase
RESTBase is an HTTP service providing storage primitives to RESTRouter.

Primary data model:

Service endpoints
HTTP endpoints in service of the following use-cases are hosted by the RESTRouter/RESTBase infrastructure.

Parsoid
Parsoid.JS is a stateless service implementing an alternative wikitext parser; Parsoid output is utilized by VisualEditor to enable round-trip conversion of edited HTML back to wikitext without format normalization.

As Parsoid.JS is considered too slow to be in-lined with user facing requests, its output is pre-generated and stored on every document change (edits, template transclusions).

Additionally, snapshots of Parsoid.JS output that correspond to edits in-progress are stored separately (expiring after a TTL period).

MCS/PCS
Mobile Content Service/Page Content Service (herein referred to as MCS and PCS respectively), is a stateless service that produces document transforms apropos to certain use-cases (such as native mobile applications). The service is exposed via RESTRouter, which performs access checks, title normalization, and redirects.

Out of concern that the MCS/PCS backend service may be too slow to be in-lined with user-facing requests, responses are persisted to RESTBase, and content is pre-generated when the document is changed.

Feeds
RESTRouter is used to expose the Wikifeeds service by proxying requests, and enriching the responses with article summaries.

Graphs
RESTRouter proxies requests to Graphoid. No mangling of requests and responses is performed.

Recommendations
RESTRouter proxies requests to the Recommendations service. No mangling of requests and responses is performed.

Citoid
RESTRouter acts as a proxy to the Citoid service. No mangling of requests and responses is performed.

Reading lists
RESTRouter is utilized to present a REST API for the Reading Lists MediaWiki extension (via the Action API).

Mathoid
Mathoid is a stateless service that accepts LateX formulae as input and returns corresponding image data in various formats. Images are persisted in cassandra, and directly referenced on page views.

NOTE: Storage of images uses a model specific to Mathoid; The Persistent Caching pattern is not reused here.

Pageviews
RESTRouter serves as a proxy to the Analytics Query Service (hereafter referred to as AQS). AQS itself is a separate instance of RESTRouter/RESTBase, providing read-only access to a Cassandra cluster where pageviews data is maintained.

Documentation portal
An endpoint which utilizes swagger-ui to provide interactive documentation for all RESTRouter-hosted APIs. The OpenAPI specification used is an aggregation of those provided by the individual services.

PDF
Proton is a stateless service that utilizes Chromium (OSS web browser) to render wiki articles to PDF documents. RESTRouter proxies incoming requests to the backend service.



Persistent caching
Derived content that is durably persisted in perpetuity, or until invalidated by a newer value. This pattern is typically paired with event-driven asynchronous tasks to regenerate content (by issuing requests with Cache-Control: no-cache), each time the canonical source has changed.

Proxying
RESTRouter is used as an intermediary for requests from clients seeking resources from other services.

Composition
A REST API endpoint is composed in some way from one or more backend systems.

Access control
MediaWiki is canonical for all documents, and so it is the arbiter of who can and cannot access them; When external systems serve content, they must apply the same access controls as MediaWiki. To speed up access checks, and avoid an Action API request, a subset of the revision metadata is replicated from the MediaWiki database into RESTBase, and is updated on every edit. This metadata is used to provide a limited form of access control to integrated services where needed.

Title Normalization
Article titles are normalized according to per-wiki settings.

CSP and CORS headers setup
Security headers are injected into service responses. This was based on an idea that service implementations shouldn’t care about CORS and CSP, and RESTRouter would provide sane defaults.

Avoiding replication
Ad hoc replication is generally a bad idea, particularly when a loss of coherency cannot be tolerated. Our replication of data from MediaWiki’s RDBMS for access control leaves us vulnerable to data exposure if updates are lost or delayed (think: suppressed revision text).

If the latency of in-lined callbacks to MediaWiki for authorization become intractable, it may indicate that a service should instead be implemented in, or proxied through MediaWiki, where access validation can be incorporated into an existing transaction.

Premature optimization
Support for pre-generating content when a source has changed, and persisting it in perpetuity (or until a subsequent change), was built into the earliest iterations of RESTRouter and its use has since become convention. This type of preemptive caching is potentially very expensive. It requires resources for the propagation of events, computation, and storage, for every edit, of every document, multiplied by the number of services that do this. Additionally, it opens us up to an entire class of coherency bugs that would not exist otherwise. This is done in anticipation of a subsequent read that may never happen, or that may happen at most once given that responses are still cached at the edge. All of which is rationalized as being necessary for performance, yet it’s rare for latency targets to even exist (there is seldom an answer for “How fast must it be?”).

That such preemptive caching is necessary in some circumstances shouldn’t be used as justification for applying it indiscriminately. Service-level objectives (SLOs) should always be established during the planning of new systems. Design proposals should take into account the performance expected of them, and employ optimizations in a measured manner.

Abstractions: A priori versus a posteriori
There are many good reasons to build a service abstraction. Creating narrower, or simpler interfaces, or to generalize in order to decouple from non-abstract systems, may be good reasons to consider an abstraction. However, it is typically a poor use of time to create such abstractions in anticipation of future use-cases (“if you build it, [they] will come”). We seldom understand our problems well at such an early stage, and getting an abstraction wrong can have far reaching impact if consumers design toward an incorrect or inefficient model. Instead, we should practice Just In Time Engineering, identifying opportunities to abstract when the problem is made clear, or when the economies of scale justify it.

Parsoid
With respect to Parsoid, the RESTRouter/RESTBase stack are used to establish external, alternative representations of extant MediaWiki architecture; Parsoid is a MediaWiki parser, and RESTBase persistence is used as parser cache. This approach is flawed, something that has already been well established elsewhere; Work is currently underway to port Parsoid to PHP, and incorporate it into MediaWiki core. Concordantly, the persistence component should also be re-integrated with MediaWiki (both caching, and edit stashing).

Mathoid
RESTRouter and RESTBase provide considerable logistics support to Mathoid:


 * Storing maps of indirections to deduplicate formulae and avoid cache split
 * Maintaining mappings of formulae hashes to the corresponding input
 * Persisting rendered media

The current architecture is overwrought. A simpler design would be one where the features provided by RESTRouter and RESTBase are encapsulated in either Mathoid and/or the Math extension, rather than being spread over a third system.

Librarising features and moving them into services
Features such as title normalization and CORS/CSP header overrides could be easily turned into libraries and reused by individual services. This would decrease coupling between services and infrastructure, making services more standalone. For Node.JS services we already have service-template-node, that provides a framework which these libraries can be integrated into. A set of generalized integration tests could also be created to ensure that common semantics are properly implemented.

Off-the-shelf solutions to routing
With the adoption of the preceding recommendations RESTRouter’s role becomes that of an HTTP router, a problem for which any number of off-the-shelf technologies exist. Given the not insignificant resources required to maintain such software, and the limited resources of the Foundation, we’d be well served by deprecating RESTRouter/RESTBase entirely and adopting a well supported open-source alternative.