Requests for comment/Services and narrow interfaces

Problem statement
MediaWiki's codebase has mostly grown organically, which led to wide or non-existing internal interfaces. This makes it hard to test parts of the system independently and couples the development of parts strongly. Reasoning about the interaction of different parts of the system is difficult, especially once extensions enter the mix. Fault isolation is less than ideal for the same reason: a fatal error in a minor feature can bring down the entire system.

Additionally, the way clients request data from MediaWiki is changing. Richer clients request more information through APIs, which should ideally perform well even with relatively small requests. New features like notifications require long-running connections, which are difficult to implement in the PHP request-processing model. It would be useful to leverage solutions that exist outside of the PHP world for some of these applications.

Another problem is organizational. We now have several teams at the Wikimedia foundation working on new features. Currently, each team needs to handle the full stack from the front-end through caching layers and Apaches to the database. This tends to promote tight coupling of storage and code using it, which makes independent optimizations of the backend layers difficult. It also often leads to conflicts over backend issues as deployment approaches.

How embracing services can help to solve some of these issues
A common solution to the issues we are facing is to define parts of the system as independent services with clearly-defined, narrow interfaces. A popular and ubiquitous interface is HTTP. Reasons for its popularity include wide availability of implementations and middleware, a common vocabulary of verbs that can be applied to resources modeling the state (see REST) and reasonable efficiency. Even without a need for distribution, it is often useful to model interfaces in a way that would also easily map to HTTP. The value object RFC proposes complementary principles for PHP code.

Performance and scaling
With modern hardware, parallelism is the main method of improving the latency of individual requests. An architecture that makes it easy to process parts of a request in parallel is therefore likely to improve the performance of the application. Using distribution rather than shared-state multi-threading as the primary mechanism to exploit parallelism lets us scale to many machines, and provides good fault isolation without many of the problems common to naïve use of shared state.

Optimizations like batching can often be implemented transparently inside services without breaking the generality of interfaces. Services can opt to batch requests from several clients arriving within a given time window rather than just those from a single client. The transmission of many small messages is optimized by the upcoming HTTP 2.0 standard based on SPDY. SPDY support is already available in node.js, nginx, Apache, and others. Support in libcurl is under development.

Operational
Separate, distinct services that run independently provide significant advantages from an operational perspective, by breaking down large, often complex operational problems into many smaller ones that are easier to attack. While a per-service architecture can also be explored with a monolithic software architecture (similar to how API application servers running MediaWiki are split into a separate cluster), there are still significant benefits into having well-abstracted separate services.

More specifically, some of the advantages from an operational perspective are:
 * Monitoring: each function can be monitored independently and performance regressions or service outages can be easier to pinpoint to the specific component at fault (and the respective service owners), rather than requiring a wholistic investigation from scratch.
 * Service isolation: while the possibility of cascading failures is always going to be present, isolating the parts of infrastructure into separate services potentially helps with limiting the extent of many outages. For example, a separate media storage layer has limited site outages caused by NFS failures to partial outages.
 * Scaling: separate services can be scaled as needed, possibly with different hardware characteristics, adjusted to the individual service needs, and into separate availability zones.
 * Maintenance: basic maintenance of individual services can be broken down into smaller, easier maintenance tasks. For example, currently, upgrading the Linux distribution of MediaWiki application servers (e.g. from lucid to precise) is intimidating and a significant burden, due to the vast number of components that need to be examined and prepared in advance. Additionally, each of those system components map to very different software functionalities and regressions can only be identified by experienced engineers.

Security
Separate services can have considerable benefits from a security point of view, as they limit the attack vectors and allow for increased isolation of the services on the system & network level. Many services currently run in the 'application server' security context under a full MediaWiki installation, where they have full access to private settings including database passwords and the entire production cluster network. Abstracting these and other services into separate, purpose-built ones with minimal rights (no database access for example) can limit the consequences of vulnerabilities.

Interfaces between teams as a method of organizational scaling
Different teams in the foundation differ in their focus and areas of expertise. It would be great to free feature teams from the burden of implementing every detail of their services' backend infrastructure while simultaneously giving backend experts the ability to optimize the implementations behind a well-defined interface. Services can help here by splitting a bigger task between members of several teams. The interfaces developed in discussions between teams are more likely to be narrow and informed by both implementer and user concerns. Concerns would then surface early during interface design rather than being a source of conflicts in final review.

Additionally, separate services decrease the learning curve for each individual service by simplifying the architecture that both new software and operational engineers have to understand before making their contributions. A large, monolithic architecture is intimidating and more time-consuming to master and feel confident enough about to make large changes in.

Incremental change
A complex system like MediaWiki cannot be rewritten from scratch. We need a way to evolve the system incrementally. By starting to develop parts of the system like Parsoid, mathematical typesetting, and PDF rendering as services, we gain the ability to choose the most appropriate technology for each component. The ubiquity of HTTP makes it easy to use service interfaces from a variety of environments. For PHP, a separate RFC proposes a convenient and general HTTP service interface with support for parallelism.

Reusability & community fostering
The Wikimedia Foundation's infrastructure consists of several individual functions that all work together to provide the wiki experience. Some of these functions might be of broader interest if they were implemented separately and could attract numerous users and contributors outside of the traditional target group (users who want to run a wiki,) especially with some more-popular choices of software components.

For example, a simple, efficient, imagescaling service with a RESTful API, supporting a Swift backend, multiple formats and various features that we use (e.g. cgroups), that can be implemented without the complexity of MediaWiki's framework (or even in a different language) and that can run independently, could be very appealing to various third-party users, gain popularity on its own and attract contributors from the free software community.

Packaging and small installs
A strength of MediaWiki so far has been the ability to install a very stripped-down version in a PHP-only shared hosting environment. This might be insecure and slow, might not balance HTML tags in content and does not include fancy features like Math or PDF rendering. But it provides an easy starting point for running your own wiki.

In a service architecture, the challenge is to provide a similar experience for small-scale deployments. One answer to this problem can be packaging. Virtual machines running popular Linux distributions like Debian are now available at similar prices as a shared hosting install. With good packaging, the installation of MediaWiki can be as easy as  with optional   and   packages readily available. While there are definitely small overheads associated with running a distributed system in a small VM, this is likely to be offset by the choice of more efficient technologies for individual services. Another option is alternative implementations of some services for resource-constrained environments. Again, narrow interfaces make such drop-in replacements relatively straightforward.

XML, SOAP, WSDL?
What is being proposed here is very close to the Service-oriented architecture (SOA) style. The SOA term is closely associated with XML, SOAP and WSDL. Those are not universally loved. They are also specific implementation -- choices made in specific industries to implement services. We are advocating plain REST-style interfaces.

Fragmentation
While some amount of diversity is good and helps with aging architectures and can result into experimentation with new methods of writing code, there are significant risks of introducing fragmentation. This could be, for example, a proliferation of programming languages or frameworks that are employed in the development of the system as a whole. These, in turn, could increase the amount of domain knowledge and introduction of silos between separate teams that develop independently without looking at the larger picture, as well as increased complexity into the system from an architecture point-of-view.

Media storage
The split of the media storage infrastructure from NFS into Swift was one of the first large deployments of an HTTP service replacing an existing function of MediaWiki. While the deployment encountered several challenges and as a consequence, had to retain significant media storage application logic in MediaWiki, it has nevertheless provided significant benefits from an operational aspect, such as stability & scalability and is considered a success overall.

For media storage purposes, the Swift protocol was picked, which is an existing industry-standard, RESTful protocol with client bindings for multiple languages, of which we use PHP (for MediaWiki) & Python (for various tools that we wrote) and planning to use the Node.js ones (for the Offline Content Generation project). Multiple competing implementations of the Swift protocol exist, including the canonical Openstack Storage one that we use. We have already experimented with a second implementation (Ceph), with relatively small changes on the MediaWiki side or the rest of our tools.

The deployment is unique, by employing an existing protocol & software for implementing the service, which, in addition to all of the service architecture benefits listed above, is also bringing us fixes & important features implemented by the larger Openstack community without much additional effort from the Wikimedia community.

Parsoid
The Parsoid web service provides a bidirectional conversion interface between Wikitext and HTML as well as a specification of the HTML it generates and accepts. This design has worked out very well in practice:


 * We were able to implement Parsoid in node.js, which lets Parsoid perform well despite doing very complex processing. We were also able to use several specialized libraries like a HTML5 DOM treebuilder at a time where those were not readily available in other environments. Parsoid's interaction with MediaWiki happens exclusively through the web API. A small mock API server has made it straightforward to test Parsoid independently from a MediaWiki installation.
 * The use of HTTP interfaces let us quickly scale Parsoid with Varnish and LVS.
 * The VisualEditor was able to implement HTML-based editing on the client side without having to know anything about wikitext parsing (with some caveats). As long as VE conforms to the HTML specification when posting edited HTML for saving, it doesn't need to worry about wikitext serialization either. The presence of this clearly defined interface helped stand-alone VisualEditor testing. It also made it easier to pinpoint whether the source of an issue was in VisualEditor or Parsoid. VisualEditor as an HTML editor can also be used as a generic HTML editor for HTML-only wikis or other applications. This makes it useful outside the Wiki context and might attract a wider developer community.
 * Flow is now also using Parsoid to provide wikitext editing functionality in Talk pages without having to worry about wikitext itself. HTML storage of discussion entries speeds up the display. The HTML DOM spec allows content post-processing against a defined interface.
 * The new PDF rendering service has been designed to use Parsoid output for ease of reformatting based on the HTML DOM spec.
 * Parsoid has also been recently exposed as a separate external service run independently by Wikimedia's infrastructure and available to third-parties to build their own applications; it is already being used by Kiwix to implement offline wikipedias, the mobile team is starting to use the Parsoid HTML in their upcoming native apps and Google is working on moving their semantic indexing to Parsoid output rather than maintaining their own in-house parsing.

Pool queue
The pool queue is used to limit concurrent actions of some type across the cluster. Since its overhead is added to every parse, it is written in single threaded C for speed. The interface is text over a TCP connection because that is lower overhead (code wise and speed wise) than http.

lsearchd and Elasticsearch
Lsearchd and Elasticsearch are search backends for on-site search. They are written in Java to leverage the excellent Apache-Lucene library. Both of these have an http interface. Elasticsearch's http interface is very extensive.

Backend services
A very common pattern in web applications is a split between front-end and back-end services with a narrow API in between. A part of the backend service API typically doubles as public data API. Internally both the front-end and backend infrastructure might in turn use services.

Backend services typically focus on data storage and -retrieval. This makes storage a fertile starting point for moving towards a service-based architecture. A revision storage implementation motivated by concrete storage needs and more general storage API ideas are discussed in the storage service RFC.