Architecture focus 2015

At the meeting at the Lyon Hackathon, 2015-05-24 (see summary), we identified several key points we think should serve as guiding principles and high-level tasks for the development of the MediaWiki platform for the foreseeable future.


 * TBD: "We" is MediaWiki developers. Make that clearer in the text.

= Content adaptability, structured data and caching =
 * Supporting a wide range of devices and use cases
 * Separating data from presentation
 * Change propagation
 * Content composition and caching

Content representation
How we represent wiki content is an essential question. One the one hand, support for more and more new kinds of content are added to MediaWiki, using the ContentHandler mechanisms as well as other means. The shift away from editing wikitext markup towards visual editing allows and requires us to re-think how we want to store and manage textual page content, as well as meta-data such as page categories or media licenses.

Over the next months, we should survey the kinds of content we currently support and want to support in the future, and assess whether the current mechanisms we have for managing different types of content are sufficient. We also need to establish a way to manage multiple different kinds of content together as one page or revision (and perhaps add the notion of sub-revisions), see below.

Multi-content revisions
Making more kinds of wiki content machine readable and machine editable requires us to move storage of these kinds of content out of wikitext, where is is currently inlined, typically as template parameters, magic words, or magic links like categories. In addition to that, we need a place to store derived content, such as rendered HTML for different target platforms.


 * TBD: Gabriel: make it clearer that this is mainly about integration with page history and secondary indexes, using multiple Content objects per revision. Storage is really a detail.

Over the next months, we should establish a generic storage interface for storing arbitrary blobs (ideally in a content-addressable way). On top of that, we should establish a lookup service that associates any number of such blobs, along with information about their role, content model and serialization format, with any given revision. This would allow us to manage multiple types of revision-associated content as well as derived content with minimal disruption, though it may not be possible to avoid a breaking change to the XML dump format, if we want to include multiple content objects per revision there. (we currently provide separate dumps per content type, see https://phabricator.wikimedia.org/T93396)


 * TBD: Gabriel: RESTBase is an example for how this can work for blob storage -- it already stores multiple revision-associated content types, and supports listings on them. I see bigger challenges in:
 * more complex metadata indexing (RB secondary indexes or ElasticSearch? ContentHandler)
 * integration with history, link tables (should work via Revision and ContentHandler mechanisms)
 * change propagation

Generalized transclusion
MediaWiki currently features several transclusion mechanisms (image, template, special page, parser function, wikidata usage, dynamic injection of graphs, etc). The transclusion mechanism should be generalized, and the interfaces involved should be streamlined to allow a content representation based on composing elements of different kinds. The ultimate goal should be allowing us to assemble page content as late as possible, at the edge of our application (see "smart caching" below), or even on the client.

Over the next months, we should investigate how the different transclusion mechanism used with wikitext content can be unified and extended to work with non-wikitext content. Special attention should be given to the handling of parameters, especially such parameters that contain markup and, quite possibly, further transclusions. This may involve generalizing the preprocessor facility to work with other kinds of content beyond wikitext.

Change propagation
A challenge with the decomposition of content into multiple bits of data is the systematic propagation of changes through the system. Our current methods of tracking dependencies and scheduling asynchronous  updates are relatively difficult to extend to new types of content, and  show some signs of strain. With more dependencies to track and more types of content to update, we will need to improve the scalability,  ergonomics and efficiency of change propagation.

Over the next months, we should investigate a generalized aproach to change propagation, perhaps based on the publish/subscribe model. We should outline how such a system would be used to cover current and anticipated use cases.

Request routing and caching
The web caching layer used on the Wikimedia cluster is quite efficient for static files and "canonical" page content, but not very smart about other kinds of content. We should be able to cache (and efficiently purge) various "renderings" of a given page, for different user languages or target devices. Eventually, we should move towards a model where different parts of an HTML page get assembled only at the edge of the application, at the web cache layer (or even on the client). In addition, support for multiple data centers should be improved, by allowing request to be routed based solely on the URL.

Over the next months, we should work on improving routing and caching based on URLs (subdomains, pathes). The primary use cases are caching for anonymous users of multi-lingual sites like commons and wikidata, and routing of edit requests to the primary data center. Furthermore, we should investigate how transclusion mechanisms could be implemented in the caching layer ("late assembly").

= General architectural concerns = These will be addressed as we work through "Content adaptability, structured data and caching" but apply more broadly.

Modularity and testability
MediaWiki has grown for more than a decade now, and several parts of the code base have become hard to maintain. In particular, a large part of the code base is inter-dependent, cannot be re-used without all the other parts, or even be tested without a dummy database. Improving modularity should improve maintainability, testability and reusability, allowing us to add new features more easily, and refactor with more confidence.

Over the next months, we aim to develop an architecture guideline describing best practices for designing classes and other components. We should also work on refactoring some key parts of MediaWiki core towards compliance to these best practices. A good starting point seem to be high profile classes like Title or User, which could in may places be replaced by "dumb" value objects like the existing TitleValue or new classes like PageRecord and UserRecord.

Service-oriented architecture
MediaWiki is being developed towards being a versatile collaborative content management platform that can handle various kinds of content, can be federated with other wikis, and can be integrated with different kinds of services. To allow federation and integration in a less ad-hoc fashion, as well as to improve scalability, the idea of improving modularization should be extended to the level of services, leading to a service-oriented architecture (SOA).

Over the next months, we should develop a SOA strategy, identifying some key services we want to define and implement. Furthermore, we should specify how services are represented and used from inside the PHP code base. We also need to decide on if and how we want to continue support for a "monolithic" distribution bundle that can easily be used on a shared hosting plan.