Architecture focus 2015

At the meeting at the Lyon Hackathon, 2015-05-24 (see summary), we identified several key points we think should serve as guiding principles and high-level tasks for the development of the MediaWiki platform for the foreseeable future.

Content adaptability, structured data and caching
This section describes some key concerns that were identified with respect to the way content is stored, managed, combined, and updated by MediaWiki. New challenges arise particularly for the increased need to support a wide range of devices, as well as new use cases especially for machine readable content.

Content representation
How MediaWiki represent wiki content is an essential question. On the one hand, support for more and more new kinds of content are added to MediaWiki, using the ContentHandler mechanisms as well as other means. The shift away from editing wikitext markup towards visual editing allows and requires new concepts for the storage and management of textual page content, as well as meta-data such as page categories or media licenses.

Over the next months, the MediaWiki developer community and staff should survey the kinds of content MediaWiki currently supports and should support in the future, and assess whether the current mechanisms MediaWiki has for managing different types of content are sufficient. In this context, a way to manage multiple different kinds of content together as one page or revision should be defined, see below.

Multi-content revisions
Making more kinds of wiki content machine readable and machine editable requires storage of these kinds of content to be moved out of wikitext, where is is currently inlined, typically as template parameters, magic words, or magic links like categories. In addition to that, a place to store derived content, such as rendered HTML for different target platforms, is needed. The key concern here is integration of the additional content with MediaWiki's revision management system like the page history, recentchanges, diffs, link tables, etc. The mechanisms for the actual storage of content blobs (e.g. the new RESTBase or the old ExternalStore) should be a detail hidden by an abstract blob storage service interface.

Over the next months, the MediaWiki developer community and staff should establish a generic storage interface for storing arbitrary blobs (ideally in a content-addressable way). On top of that, a lookup service that associates any number of such blobs with a revision would be useful. This would allow managing multiple types of revision-associated content as well as derived content with minimal disruption, though it may not be possible to avoid a breaking change to the XML dump format (WMF currently provides separate dumps for some content types, see https://phabricator.wikimedia.org/T93396)

Generalized transclusion
MediaWiki currently features several transclusion mechanisms (image, template, special page, parser function, wikidata usage, dynamic injection of graphs, etc). The transclusion mechanism should be generalized, and the interfaces involved should be streamlined to allow a content representation based on composing elements of different kinds. The ultimate goal should be to assemble page content as late as possible, at the edge of our application (see "smart caching" below), or even on the client.

Over the next months, the MediaWiki developer community and staff should investigate how the different transclusion mechanism used with wikitext content can be unified and extended to work with non-wikitext content. Special attention should be given to the handling of parameters, especially such parameters that contain markup and, quite possibly, further transclusions. This may involve generalizing the preprocessor facility to work with other kinds of content beyond wikitext.

Change propagation
A challenge with the decomposition of content into multiple bits of data is the systematic propagation of changes through the system. Our current methods of tracking dependencies and scheduling asynchronous  updates are relatively difficult to extend to new types of content, and  show some signs of strain. With more dependencies to track and more types of content to update,  the scalability, ergonomics and efficiency of change propagation is becoming crucial.

Over the next months, the MediaWiki developer community and staff should investigate a generalized approach to change propagation, perhaps based on the publish/subscribe model. The application of such a system to current and anticipated use cases should be outlined.

Request routing and caching
The web caching layer used on the Wikimedia cluster is quite efficient for static files and "canonical" page content, but not very smart about other kinds of content. It should be possible to cache (and efficiently purge) various "renderings" of a given page, for different user languages or target devices. Eventually, MediaWiki should move towards a model where different parts of an HTML page get assembled only at the edge of the application, at the web cache layer (or even on the client). In addition, support for multiple data centers should be improved, by allowing request to be routed based solely on the URL.

Over the next months, the MediaWiki developer community and staff should work on improving routing and caching based on URLs (subdomains, paths). The primary use cases are caching for anonymous users of multi-lingual sites like commons and wikidata, and routing of edit requests to the primary data center. Furthermore, the implementation of transclusion mechanisms in the caching layer ("late assembly") should be investigated.

General architectural concerns
This section is about more general principles that should be applied when addressing the issues described in Content adaptability, structured data and caching.

Modularity and testability
MediaWiki has grown for more than a decade now, and several parts of the code base have become hard to maintain. In particular, a large part of the code base is inter-dependent, cannot be re-used without all the other parts, or even be tested without a dummy database. Improving modularity should improve maintainability, testability and reusability, allowing new features to be added more easily, and refactor with more confidence. Splitting of code as a PHP library that has no dependency on Mediawiki which is required by core or an extension via composer is a good example.

Over the next months, the MediaWiki developer community and staff aim to develop an architecture guideline describing best practices for designing classes and other components. To validate the newly defined best practices, some key parts of MediaWiki core should be refactored towards compliance, and the experience with that refactoring should be discussed and documented to improve the guidelines. A good starting point seem to be high profile classes like Title or User, which could in many places be replaced by "dumb" value objects like the existing TitleValue or new classes like PageRecord and UserRecord. Unnecessary complications for using composer should be resolved.

Service-oriented architecture
MediaWiki is being developed towards being a versatile collaborative content management platform that can handle various kinds of content, can be federated with other wikis, and can be integrated with different kinds of services. To allow federation and integration in a less ad-hoc fashion, as well as to improve scalability, the idea of improving modularization should be extended to the level of services, leading to a service-oriented architecture (SOA).

Over the next months, the MediaWiki developer community and staff should develop a SOA strategy, identifying some key services to define and implement. Furthermore, guidelines for the representation and use of services in PHP code should be defined. One key question in this context is whether support for a "monolithic" distribution bundle for the LAMP platform should be continued, or whether MediaWiki should move to a distribution of VMs that provide a per-configured orchestration of services.