Architecture focus 2015

NOTE: THIS IS AN INCOMPLETE DRAFT! At the meeting at the Lyon Hackathon, 2015-05-24, we identified several key points we think should serve as guiding principles and high level tasks for the development of the MediaWiki platform for the foreseeable future.

Content Representation
How we represent wiki content is an essential question. One the one hand, support for more and more new kinds of content are added to MediaWiki, using the ContentHandler mechanisms as well as other means. The shift away from editing wikitext markup towards visual editing allows and requires us to re-thing how we want to store and manage textual page content, as well as meta-data such as page categories or media licenses.

Over the next months, we should survey the kinds of content we currently support and want to support in the future, and assess whether the current mechanisms we have for managing different types of content are sufficient. We also need to establish a way to manage multiple different kinds of content together as one page or revision (and perhaps add the notion of sub-revisions), see below.

Multi-Content Revisions
Making more kinds of wiki content machine readable and machine editable requires us to move storage of these kinds of content out of wikitext, where is is currently inlined, typically as template parameters, magic words, or magic links like categories. In addition to that, we need a place to store derived content, such as rendered HTML for different target platforms.

Over the next months, we should establish a generic storage interface for storing arbitrary blobs (ideally in a content-addressable way). On top of that, we should establish a lookup service that associates any number of such blobs, along with information about their role, content model and serialization format, with any given revision. This would allow us to manage multi-part content (attachments) as well as derived content with minimal disruption, though it may not be possible to avoid a breaking change to the XML dump format, if we want to include multiple content objects per revision there.

Modularity and Testability
MediaWiki has grown for more than a decade now, and several parts of the code base have become hard to maintain. In particular, a large part of the code base is inter-dependent, cannot be re-used without all the other parts, or even be tested without a dummy database. Improving modularity should improve maintainability, testability and reusability, allowing us to add new features more easily, and refactor with more confidence.

Over the next months, we aim to develop an architecture guideline describing best practices for designing classes and other components. We should also work on refactoring some key parts of MediaWiki core towards compliance to these best practices. A good starting point seem to be high profile classes like Title or User, which could in may places be replaced by "dumb" value objects like the existing TitleValue or new classes like PageRecord and UserRecord.

Service Oriented Architecture
MediaWiki is being developed towards being a versatile collaborative content management platform that can handle various kinds of content, can be federated with other wikis, and can be integrated with different kinds of services. To allow federation and integration in a less ad-hoc fashion, as well as to improve scalability, the idea of improving modularization should be extended to the level of services, leading to a service oriented architecture.

Over the next months, we should develop a SOA strategy, identifying some key services we want to define and implement. Furthermore, we should specify how services are represented and used from inside the PHP code base. We also need to decide on if and how we want to continue support for a "monolithic" distribution bundle that can easily be used on a shared hosting plan.

Generalized Transclusion
MediaWiki currently features several transclusion mechanisms (image, template, special page, parser function, wikidata usage, dynamic injection of graphs, etc). The transclusion mechanism should be generalized, and the interfaces involved should be streamlined to allow a content representation based on composing elements of different kinds. The ultimate goal should be allowing us to assemble page content as late as possible, at the edge of our application (see "smart caching" below), or even on the client.

Over the next months, we should investigate how the different transclusion mechanism used with wikitext content can be unified and extended to work with non-wikitext content. Special attention should be given to the handling of parameters, especially such parameters that contain markup and, quite possibly, further transclusions. This may involve generalizing the preprocessor facility to work with other kinds of content beyond wikitext.

Smart Caching
The web caching layer used on the Wikimedia cluster is quite efficient for static files and "canonical" page content, but not very smart about other kinds of content. We should be able to cache (and efficiently purge) various "renderings" of a given page, for different user languages or target devices. Eventually, we should move towards a model where different parts of an HTML page get assembled only at the edge of the application, at the web cache layer (or even on the client).

Over the next months, we should work on improving caching for multi-lingual sites like commons and wikidata. We should further investigate how inclusion mechanisms could be implemented in the caching layer.