Architecture guidelines

From MediaWiki.org
Jump to: navigation, search

This document describes guidelines that all MediaWiki developers should consider when they make changes to the MediaWiki codebase, especially code meant to run on Wikimedia sites. It refers to all MediaWiki developers, past and present, as "we," and you should feel yourself included. It is a work in progress, and is the result of developers meeting at Wikimedia-sponsored events and on IRC, as well as online discussions via wikitech-l.

This guide interrelates with the performance guidelines, user experience guidelines, and security guidelines.

This page reflects MediaWiki as it is now and does not get into the details of how the overall architecture of MediaWiki can be improved; for that, see Architectural vision discussion.


Process for implementing architecture changes[edit | edit source]

Incremental change[edit | edit source]

MediaWiki core changes incrementally. We value the time of our third-party users and we value backwards compatibility. Be careful about taking on complete rewrites of large sections of code, as they often are more difficult than they appear and are often abandoned; this is in line with how we try to refactor code.

Our design principles below should last a long time. To discuss changing them or adding more, we have an Architectural vision discussion.

Good example: old transition from cur/old tables to page/revision/text, which brought object intermediary interface in front of revisions. From there we were able to extend to sharded external storage with relatively little change to the rest of the code, and various funky compression techniques. That’s a case where we chose polymorphic objects over hooks and it worked out well.[1] And now we’re looking at further decoupling that storage layer to a separate service, which feels like it won’t be too disruptive because we made good choices on that 10 years ago.

The transition from cur/old tables to page/revision/text happened around 2004-2005. it was probably our first really successful big refactor. The experience of the horror of the pre-refactored code helped make this go well. We saw exactly several problems it created, and devised a new data structure to solve those problems. Then we designed a code structure that would abstract most of the actual table bits, which turned out to be very extensible.
Mostly, Brion Vibber worked on the main code, with Tim Starling working on the compression abstraction that was later extended to external storage, and it took a few months. The actual data table conversion took a few days on the biggest wikis.


Bad example: Authentication/authorization interface: basically, it was put together without a good idea of the requirements. It turned out to work okay for an initial version of CentralAuth, which went live around 2008, but had some weaknesses for LDAP and other uses. It lacked flexibility in the interface, and made assumptions about data flow. Thus, the bad interface has made auth fixes harder.

Introduction of new language features[edit | edit source]

Features introduced in the PHP language which impact architecture should be discussed (see examples below), and consensus reached, before they are widely adopted in the code base.

Rationale: Features being added to PHP aren't necessarily suitable for use in a large scale applications and/or our specific architecture. In the past, features have been introduced with caveats or pitfalls that were relevant to us (for example: Late Static Binding, which can be worked around by using a better design). Experienced MediaWiki developers are in a good position to critically review new PHP features.

Examples of PHP features that we aren't widely using yet, but could adopt (or reject):

  • Method chaining (enabled by PHP 5)
  • __get() magic method
  • Late static binding (PHP 5.3)
  • Namespaces (PHP 5.3)
  • Traits (PHP 5.4)
  • Generators (PHP 5.5)

Interface changes[edit | edit source]

An interface is the means by which modules communicate with each other. For example, two modules written in PHP may have an interface consisting of a set of classes, their public method names, the definitions of the parameters, and the definitions of the return values.

For interfaces which are known to be used by extensions, changes to those interfaces should retain backwards compatibility if it is feasible to do so. The rationale is:

  • To reduce the maintenance burden on extensions. Many extensions are unmaintained, so a break in backwards compatibility can cause the useful lifetime of the extension to end.
  • Many extensions are developed privately, outside of Wikimedia Foundation's Gerrit, so rectification of all extensions in the mediawiki/extensions/* tree does not necessarily address the maintenance burden of a change.
  • Some extension authors have a policy of allowing a single codebase to work with multiple versions of MediaWiki. Such policies may become more common now that we have Long Term Support (LTS) releases.
  • MediaWiki's extension distribution framework contains no core version compatibility metadata. Thus, the normal result of a breaking change to a core interface can lead to a PHP fatal error, which is not especially user-friendly.
  • WMF's deployment system has only rudimentary support for a simultaneous code update in core and extensions.
  • When creating hooks, try to keep the interfaces very narrow. Exposing a '$this' object as a hook parameter is a poor practice, which has caused trouble as we moved code from being UI-centric into separate API modules or similar.

Good examples:

  • File storage: When we redid the file storage system, we abstracted most of it away so front-end code never had to touch storage. We did have some code that had to actually touch files, and we gradually updated that code to use the new storage system, bit by bit: first, primary images and thumbs; then, things like math image generation.
  • notifications: We are dropping in Notifications (formerly Echo) as a supplementary layer, without fully replacing the user talk page notification system. Eventually we’ll probably drop the old bits and merge them fully. (It would be even better to systematically notify old clients of this changeover through a public comment, migration and revision period.)
  • ResourceLoader: When we moved to ResourceLoader, initially some scripts were still loaded in the legacy fashion, and we had a fairly long transition period where site scripts and gadgets got fixed up to play better with RL.

Requests for comment (RfC)[edit | edit source]

An RfC is a request for review of a proposal or idea to change the basic structure of MediaWiki. RfCs are reviewed by the community of MediaWiki developers. Final decisions on RFC status are made by the WMF architecture committee (currently Mark Bergsma, Tim Starling, and Brion Vibber).

Filing an RFC is strongly recommended before embarking on a major core refactoring project.

Data-driven change[edit | edit source]

Understand the parts of the overall Wikimedia infrastructure that your change would touch.

Do your homework before suggesting a change, so other people can check your math. And after you've made the change, repeat your benchmarks to check whether you've succeeded.

Design principles[edit | edit source]

Cf. the "end result" principles. We design our software to meet these principles.

Secure[edit | edit source]

We value the privacy of our users' data; see Security for developers/Architecture.

Efficient[edit | edit source]

We want users to be able to perform most operations within two seconds. Please see the performance guidelines.

Multilingual[edit | edit source]

We aim to empower people speaking, reading, and writing all human languages. New MediaWiki code must be internationalised and localisable. See Localisation to see how to do this.

Separation of concerns — UI and business logic[edit | edit source]

It is generally agreed that separation of concerns is essential for a readable, maintainable, testable, high-quality codebase. However, opinions vary widely on the exact degree to which concerns should be separated, and on which lines the application should be split.

MediaWiki began in 2002 with a very brief style where "business logic" and UI code were freely intermixed. This style produced a functional wiki engine with a small outlay of time and only 14,000 lines of code. Despite the MediaWiki core now weighing in at some 235,000 lines, the marks of the original style can still be seen in important areas of the core codebase. This design is clearly untenable as the core for a large and complex project.

Many features have three user interfaces:

  • Server-generated HTML
  • The HTTP API, i.e. api.php. This is used as both a UI in itself (action=help etc.) and as an interface with client-side UIs.
  • Maintenance scripts

Currently, these three user interfaces are supported by means of either:

  • A pure-PHP backend library
  • Having one UI wrap another UI (FauxRequest etc.)
  • Code duplication

Code duplication is generally frowned upon, but the other two approaches both have significant support. The traditional position is that pure-PHP backend libraries should be constructed, and this has been a common approach over the years. The progressive position is that business logic should be moved to the HTTP API, and that the other UIs should wrap the HTTP API.

Advantages of wrapping the HTTP API

  • Certain kinds of batching are naturally supported. Some existing pure-PHP interfaces suffer from a lack of batching, for example, Revision.
  • The HTTP API provides a boundary for splitting across different servers or across daemons running different languages. This provides a migration path away from PHP, if this is desirable.
  • Tight integration of backend and HTTP API functions provides compactness.

Disadvantages of wrapping the HTTP API

  • Loss of generality in the interface, due to the need to serve both internal and external clients. For example, it is not possible to pass PHP objects or closures.
  • The inability to pass objects across the interface has various implications for architecture. For example, in-process caches may have a higher access latency.
  • Depending on implementation, there may be serialisation overhead. This is certainly the case with the idea of replacing internal FauxRequest-style calls with remote calls.
  • The command line interface is inherently unauthenticated, so it is difficult to implement it in terms of calls to functions which implement authentication. Similarly, some extensions may wish to have access to unauthenticated functions, after implementing their own authentication scheme.
  • In general, tight integration of backend and HTTP API functions causes a loss of flexibility. Abstraction and flexibility are fundamentally linked.
  • More verbose calling code.

Separation of concerns — encapsulation versus value objects[edit | edit source]

It has been proposed that service classes (with complex logic and external dependencies) be separated from value classes (which are lightweight and easily constructed). It is said that this would improve testability. The extent to which this should be done is controversial. The traditional position, most commonly followed in existing MediaWiki code, is that code should be associated with the data it operates on, i.e. encapsulation.

Disadvantages of encapsulation

  • The association of code with a single unit of data tends to limit batching. Thus, performance and the integrity of DB transactions are compromised.
  • For some classes, the number of actions which can be done on/with an object is very large or not enumerable. For example, very many things can be done with a Title, and it is not practical to put them all in the Title class. This leads to an inelegant separation between code which is in the main class and code which isn't.
  • The use of smart but new-operator-constructable classes tends to lead to the use of singletons and global variables for request context. This makes unit testing more awkward and fragile. It also leads to a loss of flexibility, since the relevant context cannot easily be overridden by callers.

Whether or not it is used in new code, it is likely that encapsulation will continue to be a feature of code incrementally developed from the current code base. So we propose the following best practices intended to limit the negative impacts of traditional encapsulation.

Encapsulation best practices

  • Where there is I/O or network access, provide repository classes with interfaces that support batching.
  • A global singleton manager should be introduced, to simplify control of request-lifetime state, especially for the benefit of unit tests. This should replace global, class-static and local-static object variables.
  • Limit the code size of "smart object" classes by splitting out service modules, which are called like $service->action( $data ) instead of $data->action().

You aren't gonna need it[edit | edit source]

Do an inventory of currently available abstractions before you add more complexity.

Do not introduce abstraction in advance of need unless the probability that you will need the flexibility thus provided is very high.

This is a widely accepted principle. Even Robert C. Martin, whose "single responsibility principle" tends to lead to especially verbose and well-abstracted code, stated in the book "Agile principles, patterns, and practices in C#":

If, on the other hand, the application is not changing in ways that cause the the two responsibilities to change at different times, then there is no need to separate them. Indeed, separating them would smell of Needless Complexity.
There is a corollary here. An axis of change is only an axis of change if the changes actually occur. It is not wise to apply the SRP, or any other principle for that matter, if there is no symptom.

An abstraction provides a degree of freedom, but it also increases code size. When a new feature is added which needs an unanticipated degree of freedom, the difficulty of implementing that feature tends to be proportional to the number of layers of abstraction that the feature needs to cross.

Thus, abstraction makes code more flexible in anticipated directions, but less flexible in unanticipated directions. Developers tend to be very poor at guessing what abstractions will be needed in the future. When abstraction is implemented in advance of need, the abstraction is often permanently unused. Thus, the costs are borne, and no benefit is seen.

References[edit | edit source]

  1. (The hooks may have come along later.)