Requests for comment/Services and narrow interfaces

Problem statement
MediaWiki's codebase has mostly grown organically, which led to wide or non-existing internal interfaces. This makes it hard to test parts of the system independently and couples the development of parts strongly. Reasoning about the interaction of different parts of the system is difficult, especially once extensions enter the mix. Fault isolation is less than ideal for the same reason: a fatal error in a minor feature can bring down the entire system.

Additionally, the way clients request data from MediaWiki is changing. Richer clients request more information through APIs, which should ideally perform well even with relatively small requests. New features like notifications require long-running connections, which are difficult to implement in the PHP request processing model. It would be useful to leverage solutions that exist outside the PHP world for some of these applications.

Another problem is organizational. We now have several teams at the Wikimedia foundation working on new features. Currently, each team needs to handle the full stack from the front-end through caching layers and Apaches to the database. This tends to promote tight coupling of storage and code using it, which makes independent optimizations of the backend layers difficult. It also often leads to conflicts over backend issues when deployment is getting closer.

How embracing services can help to solve some of these issues
A common solution to the issues we are facing is to define parts of the system as independent services with clearly defined and narrow interfaces. A popular and ubiquitous style of interface is HTTP. Reasons for its popularity include wide availability of implementations and middleware, a common vocabulary of verbs that can be applied to resources modeling the state (see REST) and reasonable efficiency. Even without a need for distribution, it is often useful to model interfaces in a way that would also easily map to HTTP. The value object RFC proposes complementary principles for PHP code.

Performance and scaling
With modern hardware, parallelism and distribution are the main methods of improving the latency of individual requests. An architecture that makes it easy to process parts of a request in parallel is thus likely to improve the performance of the application. Implementing this as distribution also lets us scale to many machines, and provides good fault isolation without the problems common with naive use of shared state.

Optimizations like batching can often be implemented transparently inside services without breaking the generality of interfaces. Services can opt to batch requests from several clients arriving within a given time window rather than just those from a single client. The transmission of many small messages is optimized by the upcoming HTTP 2.0 standard based on SPDY. SPDY support is already available in node.js, nginx, Apache and others. Support in libcurl is under development.

Operational
Separate, distinct services that are being run independently provide significant advantages from an operational perspective, by breaking down large, often complex operational problems into many smaller ones that are easier to attack. While a per-service architecture can also be explored with a monolithic software architecture (similar to how API application servers running MediaWiki are split into a separate cluster), there are still significant benefits into having well-abstracted separate services.

More specifically, some of the advantages from from an operational perspective are:
 * Monitoring: each function can be monitored independently and performance regressions or service outages can be easier to pinpoint to the specific component at fault (and the respective service owners), rather than requiring a wholistic investigation from scratch.
 * Service isolation: while the possibility of cascading failures is always going to be present, isolating the parts of infrastructure into separate services potentially helps with limiting the extent of many outages. like for example a separate media storage layer has limited site outages caused by NFS failures into partial outages.
 * Scaling: separate services can be scaled as needed, possibly with different hardware characteristics, adjusted to the individual service needs, and into separate availability zones.
 * Maintenance: basic maintenance of individual services can be broken down into smaller, easier maintenance tasks. For example, currently, upgrading the Linux distribution of MediaWiki application servers (e.g. from lucid to precise) is intimidating and a significant burden, due to the vast number of components that need to be examined and prepared in advance. Additionally, each of those system components, map to very different software functionalities and regressions can only be identified by experienced engineers.

Security
Separate services can have considerable benefits from a security point of view, as they limit the attack vectors and allow for increased isolation of the services on the system & network level. For example, imagescaling, lilypond (score) or LaTeX processing (math) currently all run on the application server security context, with each of them running with a full MediaWiki installation, including private settings & passwords, and with unfettered access to the rest of production. Abstracting these into separate purpose-built services would limit a potential vulnerability - either on Wikimedia's code or third-party code - into just the functions that these services provide (e.g. no database access).

Interfaces between teams as a method of organizational scaling
Different teams in the foundation differ in their focus and area of expertise. It would be great to free feature teams from the burden of implementing every detail of the backend infrastructure while at the same time giving backend experts the ability to optimize implementations behind a well-defined interface. Services can help here by splitting a bigger task between members of several teams. The interfaces developed in discussions between teams are more likely to be narrow and informed by both implementer and user concerns. Concerns surface early during interface design rather than being a source of conflicts in final review.

Additionally, separate services decrease the learning curve for each individual service by simplifying the architecture that both new software & operation engineers have to understand before making their contributions. A large, monolithic, architecture is intimidating and more expensive time-wise to master and feel confident to make large changes in.

The storage layer in particular seems to be a good candidate for a service abstraction. This is discussed in the storage service RFC.

Incremental change
A complex system like MediaWiki cannot be rewritten from scratch. We need a way to evolve the system incrementally. By starting to develop parts of the system like Parsoid, Math or PDF rendering as services, we gain the ability to choose the most appropriate technology for each component. The ubiquity of HTTP makes it easy to use service interfaces from a variety of environments. For PHP, a separate RFC proposes a convenient and general HTTP service interface with support for parallelism.

Reusability & community fostering
The Wikimedia infrastructure consist of several individual functions that are all working together to provide the Wiki experience. Some of these functions could be of broader interest if implemented separately and attract numerous users and contributors outside of the traditional target group (users who want to run a Wiki), especially with some more popular choices of software components.

For example, a simple, efficient, imagescaling service with a RESTful API, supporting a Swift backend, multiple formats and various features that we use (e.g. cgroups), that can be implemented without the complexity of MediaWiki's framework (or even in a different language) and that can run independently, could be very appealing to various third-party users, gain popularity on its own and attract contributors from the free software community.

Packaging and small installs
A strength of MediaWiki so far has been the ability to install a very stripped-down version in a PHP-only shared hosting environment. This might be insecure and slow, might not balance HTML tags in content and does not include fancy features like Math or PDF rendering. But it provides an easy starting point for running your own wiki.

In a service architecture, the challenge is to provide a similar experience for small-scale deployments. One answer to this problem can be packaging. Virtual machines running popular Linux distributions like Debian are now available at similar prices as a shared hosting install. With good packaging, the installation of MediaWiki can be as easy as  with optional   and   packages readily available. While there are definitely small overheads associated with running a distributed system in a small VM, this is likely to be offset by the choice of more efficient technologies for individual services. Another option is alternative implementations of some services for resource-constrained environments. Again, narrow interfaces make such drop-in replacements relatively straightforward.

XML, SOAP, WSDL?
What is being proposed here is very close to the Service-oriented architecture (SOA) style. The SOA term is closely associated with XML, SOAP and WSDL. Those are not universally loved. They are also specific implementation -- choices made in specific industries to implement services. We are advocating REST and plain HTTP interfaces.

Fragmentation
While some amount of diversity is good and helps with aging architectures and can result into experimentation with new methods of writing code, there are significant risks of introducing fragmentation. This could be, for example, a proliferation of programming languages or frameworks that are employed in the development of the system as a whole. These, in turn, could increase the amount of domain knowledge and introduction of silos between separate teams that develop independently without looking at the larger picture, as well as increased complexity into the system from an architecture point-of-view.

Media storage
The split of the media storage infrastructure from NFS into Swift was the first large deployment of an HTTP service replacing an existing function of MediaWiki. While the deployment encountered several challenges and as a consequence, had to retain significant media storage application logic in MediaWiki, it has nevertheless provided significant benefits from an operational aspect, such as stability & scalability and is considered a success overall.

For media storage purposes, the Swift protocol was picked, which is an existing industry-standard, RESTful protocol with client bindings for multiple languages, of which we use PHP (for MediaWiki) & Python (for various tools that we wrote) and planning to use the Node.js ones (for the Offline Content Generation project). Multiple competing implementations of the Swift protocol exist, including the canonical Openstack Storage one that we use. We have already experimented with a second implementation (Ceph), with relatively small changes on the MediaWiki side or the rest of our tools.

The deployment is unique, by employing an existing protocol & software for implementing the service, which, in addition to all of the service architecture benefits listed above, is also bringing us fixes & important features implemented by the larger Openstack community without much additional effort from the Wikimedia community.

Parsoid
The Parsoid web service provides a bidirectional conversion interface between Wikitext and HTML as well as a specification of the HTML it generates and accepts. This design has worked out very well in practice:


 * We were able to implement Parsoid in node.js, which lets Parsoid perform well despite doing very complex processing. We were also able to use several specialized libraries like a HTML5 DOM treebuilder at a time where those were not readily available in other environments. Parsoid's interaction with MediaWiki happens exclusively through the web API. A small mock API server has made it straightforward to test Parsoid independently from a MediaWiki installation.
 * The use of HTTP interfaces let us quickly scale Parsoid with Varnish and LVS.
 * The VisualEditor was able to implement HTML-based editing on the client side without having to know anything about wikitext parsing (with some caveats). As long as VE conforms to the HTML specification when posting edited HTML for saving, it doesn't need to worry about wikitext serialization either. The presence of this clearly defined interface helped stand-alone VisualEditor testing. It also made it easier to pinpoint whether the source of an issue was in VisualEditor or Parsoid. VisualEditor as an HTML editor can also be used as a generic HTML editor for HTML-only wikis or other applications. This makes it useful outside the Wiki context and might attract a wider developer community.
 * Flow is now also using Parsoid to provide wikitext editing functionality in Talk pages without having to worry about wikitext itself. HTML storage of discussion entries speeds up the display. The HTML DOM spec allows content post-processing against a defined interface.
 * The new PDF rendering service has been designed to use Parsoid output for ease of reformatting based on the HTML DOM spec.
 * Parsoid has also been recently exposed as a separate external service run independently by Wikimedia's infrastructure and available to third-parties to build their own applications; it is already being used by Kiwix to implement offline wikipedias, the mobile team is starting to use the Parsoid HTML in their upcoming native apps and Google is working on moving their semantic indexing to Parsoid output rather than maintaining their own in-house parsing.