Requests for comment/Services and narrow interfaces

Problem statement
MediaWiki's codebase has mostly grown organically, which led to wide or non-existing internal interfaces. This makes it hard to test parts of the system independently and couples the development of parts strongly. Reasoning about the interaction of different parts of the system is difficult, especially once extensions enter the mix. Fault isolation is less than ideal for the same reason: a fatal error in a minor feature can bring down the entire system.

Additionally, the way clients request data from MediaWiki is changing. Richer clients request more information through APIs, which should ideally perform well even with relatively small requests. New features like notifications require long-running connections, which are difficult to implement in the PHP request processing model. It would be useful to leverage solutions that exist outside the PHP world for some of these applications.

Another problem is organizational. We now have several teams at the Wikimedia foundation working on new features. Currently, each team needs to handle the full stack from the front-end through caching layers and Apaches to the database. This tends to promote tight coupling of storage and code using it, which makes independent optimizations of the backend layers difficult. It also often leads to conflicts over backend issues when deployment is getting closer.

How embracing services can help to solve some of these issues
A common solution to the issues we are facing is to define parts of the system as independent services with clearly defined and narrow interfaces. A popular and ubiquitous style of interface is HTTP. Reasons for its popularity include wide availability of implementations and middleware, a common vocabulary of verbs that can be applied to resources modeling the state (see REST) and reasonable efficiency. Even without a need for distribution, it is often useful to model interfaces in a way that would also easily map to HTTP. The value object RFC proposes complementary principles for PHP code.

Performance and scaling
With modern hardware, parallelism and distribution are the main methods of improving the latency of individual requests. An architecture that makes it easy to process parts of a request in parallel is thus likely to improve the performance of the application. Implementing this as distribution also lets us scale to many machines, and provides good fault isolation without the problems common with naive use of shared state.

Optimizations like batching can often be implemented transparently inside services without breaking the generality of interfaces. Services can opt to batch requests from several clients arriving within a given time window rather than just those from a single client. The transmission of many small messages is optimized by the upcoming HTTP 2.0 standard based on SPDY. SPDY support is already available in node.js, nginx, Apache and others. Support in libcurl is under development.

Operational
Seperate, distinct services that are being run independently provide significant advantages from an operational perspective, by breaking down large, often complex operational problems into many smaller ones that are easier to attack. While a per-service architecture can also be explored with a monolithic software architecture (similar to how API application servers running MediaWiki are split into a separate cluster), there are still significant benefits into having well-abstracted separate services.

More specifically, some of the advantages from from an operational perspective are:
 * Monitoring: each function can be monitored independently and performance regressions or service outages can be easier to pinpoint to the specific component at fault (and the respective service owners), rather than requiring a wholistic investigation from scratch.
 * Service isolation: while the possibility of cascading failures is always going to be present, isolating the parts of infrastructure into separate services potentially helps with limiting the extent of many outages. like for example a separate media storage layer has limited site outages caused by NFS failures into partial outages.
 * Scaling: separate services can be scaled as needed, possibly with different hardware characteristics, adjusted to the individual service needs, and into separate availability zones.
 * Maintenance: basic maintenance of individual services can be broken down into smaller, easier maintenance tasks. For example, currently, upgrading the Linux distribution of MediaWiki application servers (e.g. from lucid to precise) is intimidating and a significant burden, due to the vast number of components that need to be examined and prepared in advance. Additionally, each of those system components, map to very different software functionalities and regressions can only be identified by experienced engineers.

Security
Separate services can have considerable benefits from a security point of view, as they limit the attack vectors and allow for increased isolation of the services on the system & network level. For example, imagescaling, lilypond (score) or LaTeX processing (math) currently all run on the application server security context, with each of them running with a full MediaWiki installation, including private settings & passwords, and with unfetered access to the rest of production. Abstracting these into separate purpose-built services would limit a potential vulnerability -either on Wikimedia's code or third-party code- into just the functions that these services provide (e.g. no database access).

Interfaces between teams as a method of organizational scaling
Different teams in the foundation differ in their focus and area of expertise. It would be great to free feature teams from the burden of implementing every detail of the backend infrastructure while at the same time giving backend experts the ability to optimize implementations behind a well-defined interface. Services can help here by splitting a bigger task between members of several teams. The interfaces developed in discussions between teams are more likely to be narrow and informed by both implementer and user concerns. Concerns surface early during interface design rather than being a source of conflicts in final review.

Additionally, separate services decrease the learning curve for each individual service by simplifying the architecture that both new software & operation engineers have to understand before making their contributions. A large, monolithic, architecture is intimidating and more expensive time-wise to master and feel confident to make large changes in.

The storage layer in particular seems to be a good candidate for a service abstraction. This is discussed in the storage service RFC.

Incremental change
A complex system like MediaWiki can't be rewritten from scratch. We need a way to evolve the system incrementally. By starting to develop parts of the system like Parsoid, Math or PDF rendering as services, we gain the ability to choose the most appropriate technology for each component. The ubiquity of HTTP makes it easy to use service interfaces from a variety of environments. For PHP, a separate RFC proposes a convenient and general HTTP service interface with support for parallelism.

Reusability & community fostering
The Wikimedia infrastructure consist of several individual functions that are all working together to provide the Wiki experience. Some of these functions could be of broader interest if implemented separately and attract numerous users and contributors outside of the traditional target group (users who want to run a Wiki), especially with some more popular choices of software components.

For example, a simple, efficient, imagescaling service with a RESTful API, supporting a Swift backend, multiple formats and various features that we use (e.g. cgroups), that can is implemented without the complexity of MediaWiki's framework (or even in a different language) and able to run independently, could be very appealing to various third-party users, gain popularity of its own and attract contributors from the free software community.

Packaging and small installs
A strength of MediaWiki has so far been the ability to install a very stripped-down version in a PHP-only shared hosting environment. This might be insecure and slow, might not balance HTML tags in content and does not include fancy features like Math or PDF rendering. But it will provide an easy starting point for running your own wiki.

In a service architecture the challenge is to provide a similar experience for small-scale deployments. One answer to this problem can be packaging. Virtual machines running popular Linux distributions like Debian are now available at similar prices as a shared hosting install. With good packaging the installation of MediaWiki can be as easy as  with optional   and   packages readily available. While there are definitely small overheads associated with running a distributed system in a small VM, this is likely to be offset by the choice of more efficient technologies for individual services. Another options is alternative implementations of some services for resource-constrained environments. Again, narrow interfaces make such drop-in replacements relatively straightforward.

Possible critiques from a Service-Oriented-Architecture point of view
What is being proposed here is effectively to move Mediawiki architecture to embrace the Service Oriented Architecture approach. Normally, this might bring nightmarish scenarios of XML, SOAP, WSDL configurations. We bring those up here to clarify that those were specific implementation choices made in specific industries to implement services. However, there is no reason to go down those implementation routes. REST and plain HTTP Interfaces is what we advocate.

Fragmentation
While some amount of diversity is good and helps with aging architectures and can result into experimentation with new methods of writing code, there are significant risks of introducing fragmentation. This could be expressed into, for example, a proliferation of programming languages or frameworks that are employed in the development of the system as a whole. These, in turn, could increase the amount of domain knowledge and introduction of silos between separate teams that develop independently without looking at the larger picture, as well as increased complexity into the system from an architecture point-of-view.

Parsoid & VisualEditor
The largest example of a service architecture in the current Wikimedia architecture are the Parsoid and VisualEditor (VE) projects. Parsoid provides a bidirectional conversion interface between Wikitext and HTML as well as a specification of the HTML it generates. This allows VE to implement HTML-based editing on the client side without having to know anything about wikitext parsing (with some caveats). As long as VE conforms to the HTML specification when posting edited HTML for saving, it doesn't need to worry about wikitext serialization either. This approach has provided both projects with the following benefits:
 * Since Parsoid solely concerns itself with converting between wikitext and HTML, it doesn't worry itself with wikitext storage, skins, resource handling, user accounts, mediawiki configuration, interwikis, or any of the other CMS functionality that Mediawiki provides. This let us implement Parsoid as a service with clearly defined interfaces for using it, with a clearly defined spec as to what the HTML output means. In addition, this freed us from having to write this in PHP and pick a technology that worked best for the situation. In this case, Parsoid is implemented in node.js and interfaces with the rest of Mediawiki via the publicly available Mediawiki HTTP API for accessing wikitext for a title, and for accessing information that affects wikitext parsing (ex: mediawiki config).
 * While testing, Parsoid only needs to worry about keeping up its end of the bargain -- "accurate" bi-directional conversion between wikitext and HTML -- without concerning itself with who gives it wikitext or HTML. This let both Parsoid and the VE projects to focus on errors within their projects which narrows down the source of errors more than if they had been more tightly coupled.
 * Since Parsoid is client-agnostic, it can be used in a variety of applications. For example, Flow is using Parsoid to clean up Talk pages and provide the familiar wikitext editing that (a subset of) users want without having to worry about wikitext itself & the new PDF rendering service has been designed to use Parsoid output for ease of reformatting. Additionally, Parsoid has been recently exposed as a separate external service run independently by Wikimedia's infrastructure and available to third-parties to build their own applications; it is already being used by Kiwix to implement offline wikipedias, the mobile team is starting to use the Parsoid HTML in their upcoming native apps and Google is working on moving their semantic indexing to Parsoid output rather than maintaining their own in-house parsing.
 * VE itself is not bound to Parsoid and wikitext. It can be used to implement HTML-only wikis since its internal architecture is strongly HTML-based. This is an obvious benefit that accrues from splitting the problem of visually editing on wikis into parsing wikitext to HTML and editing HTML.

Media storage
The split of the media storage infrastructure from NFS into Swift was the first large deployment of an HTTP service replacing an existing function of MediaWiki. While the deployment has met several challenges and as a consequence there is still significant media storage application logic in MediaWiki, it has provided significant benefits from an operational aspect, such as stability & scalability and is considered a success overall.

For media storage purposes, the Swift protocol was picked, which is an existing industry-standard, RESTful protocol with client bindings for multiple languages, of which we use PHP (for MediaWiki) & Python (for various tools that we wrote) and planning to use the Node.js ones (for the Offline Content Generation project). Multiple competing implementations of the Swift protocol exist, including the canonical Openstack Storage one that we use. We have already experimented with a second implementation (Ceph), with relatively small changes on the MediaWiki side or the rest of our tools.

The deployment is unique, by employing an existing protocol & software for implementing the service, which, in addition to all of the service architecture benefits listed above, is also bringing us fixes & important features implemented by the larger Openstack community without much additional effort from the Wikimedia community.