Platform Engineering Team/Expedition

The Expedition Team sub-team of the Platform Engineering Team formed in the beginning of 2021, consisting of Petr Pchelko, Cindy Cicalese, and Daniel Kinzler and augmented by contractors.

From the project proposal as written by Eric Evans: MediaWiki eschews decades of software engineering best-practice by failing to establish a separation of concerns. Business logic, data access, and presentation are tightly coupled with one another making code difficult to reason about, hard to test, and risky to change. Years of perpetuating antipatterns, quick-hacks, and Devil’s bargains have made the issues so acute, that it requires engineers with significant domain-specific knowledge, lengthy planning, and exhaustive review for even the most minor of changes. Developers have become risk averse, progress has ground to a standstill, and the organization has grown weary. In short, it takes an increasing amount of time and resources for diminishing returns.

We have invested a tremendous amount of time and resources understanding these problems. Countless analyses have been performed, consultants have been hired, tickets opened, proposals written, and many, many hours have been spent in discussions. The answer isn’t elusive. The solution isn’t complicated, clever, or interesting, and  should not be controversial. Years of technical debt must be repaid; The codebase must be incrementally refactored, applying basic software engineering practices to achieve a clean separation of interconnected elements. We propose utilizing model–view–controller, dividing the application into state, presentation, and the logic that intermediates them. The mission of the Expedition team is to decouple the MediaWiki code-base by applying industry standard design principles like MVC. The initial focus is on breaking up "monster classes" like Revision, User, Title, and Language. These "monsters" cause tight coupling because everything depends on them, and they depend on everything (compare Loose coupling).

Migration Strategy for Replacing Classes
Replacing the usage of highly used "monster classes" across MediaWiki core code as well as in extensions is a slow process that requires careful planning to avoid sudden breaking changes and minimize risk to the production sites. We have developed a multi-stage strategy that we follow for each new type we introduce:

Stage one: define. In the first stage, we define an interface that describes the new type. Make it so the old monster class implements that interface, so the old type can still be used as a parameter in places where the new type is required. Ideally, provide a light-weight alternative implementation of the new interface.

For example, we introduced the Authority interface to cover some of the functionality in the User class, made User implement the Authority interface, and provided several alternative implementations of Authority. Introducing the new interface and implementation can typically be done quickly, with only a handful of commits.

Stage two: accept. In the second stage, we change method signatures from requiring the old type to accepting the new type. Since the old class implements the new interface, this can be done without changing callers. It just means that the calling code no longer needs to have an instance of the old monster class. This approach however does not work for return types: here we have to introduce a new method that returns the new type, and deprecate the method that returns the old type.

When changing method signatures in this way, we only update the actual implementation if it is trivial. If the implementation still relies on the old class, we simply convert from the new type to the old using an upcast method such as Title::castFromPageIdentity. Such "upcasts" are free of performance penultimates as long as the concrete type of the object at runtime is still the old class. This approach solves a chicken-and-egg dilemma created by the fact that dependencies in our code base tend to be circular: if we tried to also convert the implementation right away, we would run into the issue that we are using other code that is still asking for the old type, so we would need to convert that first. But converting that code will have the same issue, and so on. To break this cycle, it is useful to convert "from the outside in": we address the public interface first. When all the public interfaces have been converted, converting the implementations as well becomes trivial, because methods we may want to call have already been converted to accept the new type.

For example, in WikiPage::doEditContent, we changed the parameter that represents the user who is performing an edit from a User to an Authority. Internally, we convert to a User object when needed. Depending on the usage of the old class, this stage can take quite a while, and require many patches. In the class of the Title and User class, we have to update many hundred method signatures.

Stage three: offer. In the third stage, we offer new extension points (hook interfaces, abstract methods, base classes) that use the new types in the method signatures. Old extension points are deprecated.

Since changing extension points requires action from extension maintainers, it should be done rarely and be communicated carefully. Depending on the number of extension points affected, this stage can require quite a bit of effort. In the case of the Title and User classes, there is on the order of a hundred hooks to be updated, used by about as many extensions. Also, because of the nature of this change and the requirement to provide a stable framework for extensions, the old and the new extension point have to be maintained side by side for at least one release of MediaWiki.

Stage four: migrate. In the forth stage, we change code in core and in extensions from using the old classes to using the new types, by updating implementations and migrating to new extension points. Here we start to reap some of the benefits of the migration: we should see that the implementations become less complex, and that coupling between unrelated parts of the code falls away. At this point, metrics measuring coupling should start to budge, and new developers will no longer have to understand the complexities of the old code.

Stage five: remove. In stage five, we can finally remove the old classes. With all calling code having been migrated to using the new types, the old classes are no longer needed. Only when the old code has been removed will we see the full impact of the migration in all metrics.

Metrics
The objective of the expedition is to improve code quality, and quality, by definition, cannot be quantified. So all we can do is measure indicators to guide our direction, and trust in past experience that tells us that these indicators are indeed good guidance. We must however be mindful of Goodhart's law: When a measure becomes a target, it ceases to be a good measure.

At present, our "compass" for day to day work is the Monster Monitor, which provides an overview of how far we got with removing the usages of undesirable classes.

For the future, we hope to conduct a survey of engineering satisfaction, measure the tangeldness of the code base, and track the number of incidents per line of code. However, we have not yet established resourcing for the necessary tooling and processes.

Links

 * Workboard: https://phabricator.wikimedia.org/project/board/5144/
 * Presentation at the Wikimedia Hackathon 2021: Slides, Recording