Platform Engineering Team/Expedition

The Expedition sub-team of the Platform Engineering Team was initially formed in the beginning of 2021 by Petr Pchelko, Cindy Cicalese, and Daniel Kinzler.

From the project proposal as written by Eric Evans:

MediaWiki eschews decades of software engineering best-practice by failing to establish a separation of concerns. Business logic, data access, and presentation are tightly coupled with one another making code difficult to reason about, hard to test, and risky to change. Years of perpetuating antipatterns, quick-hacks, and Devil’s bargains have made the issues so acute, that it requires engineers with significant domain-specific knowledge, lengthy planning, and exhaustive review for even the most minor of changes. Developers have become risk averse, progress has ground to a standstill, and the organization has grown weary. In short, it takes an increasing amount of time and resources for diminishing returns.

We have invested a tremendous amount of time and resources understanding these problems. Countless analyses have been performed, consultants have been hired, tickets opened, proposals written, and many, many hours have been spent in discussions. The answer isn’t elusive. The solution isn’t complicated, clever, or interesting, and should not be controversial. Years of technical debt must be repaid; The codebase must be incrementally refactored, applying basic software engineering practices to achieve a clean separation of interconnected elements. We propose utilizing model–view–controller, dividing the application into state, presentation, and the logic that intermediates them.

The mission of the Expedition team is to decouple the MediaWiki code-base by applying industry standard design principles like MVC. The initial focus is on breaking up "monster classes" like Revision, User, Title, and Language. These "monsters" cause tight coupling because everything depends on them, and they depend on everything (compare Loose coupling).

Team Process

Following the idea of an expedition, the team chooses and organizes its own work, pushing forward, detouring or backtracking as needed.

At the heart of the team's process are the collaboration sessions where we meet in a video chat to discuss design issues, agree on priorities, and decide on next steps. We sometimes also meet to write or review code together in a pair or mob programming style. We try to meet at least twice a week for 90 minutes.

Outside the meetings, we use the #pet-expedition channel on Slack to ask questions, coordinate work, and share information. The most important rule of collaboration is: ask questions!

The central place to keep track of future work and tasks that need to be picked up again later is the Expedition workboard on phabricator. It has the following columns:

Unsorted pile: this is where tickets land when they are first created. Tickets should be prioritized and moved out of the unsorted pile as soon as possible. This can be done by any team member.
Authority pile, UserStore pile, PageStore pile, etc: stack-ranked backlogs for the different themes we are currently working on. Tasks should be picked up from these piles roughly according to their priority and sorting order.
Doing: tickets that any team member is currently actively working on. All tickets in this column should be a signed to a team member.
Waiting for review: tickets that have code changes associated with them which are waiting for review. This is typicalyl only used wif the review is complex or for some other reason takes a long time.
Blocked: waiting for another team, or another aspect of the expedition to progress. The ticket should clearly state what it is blocked on.
Waiting for release: the ticket should be picked up again after the next MediaWiki release has been branched. This is typically when deprecated code needs to be removed.
On the train: The code change associated with the ticket has been merged but not yet (fully) deployed (it is "waiting for the train" or "riding the train").

Team members pick up work from the stack-ranked piles. Work progress should be tracked as follows:

When picking up a ticket, team members should move the ticket to the Doing column of the board and assign the ticket to themselves.
When work on the ticket is paused, the ticket should be placed in the column reflecting why the work has been paused: Waiting for review, Blocked, or Waiting for release. A comment should be added describing when the work has paused, and when it can be picked up again.
When the code changes associated with the ticket have been merged, the ticket should be placed in the On the train column. Once all code changes have been deployed, the ticket should be closed by setting the status the resolved.

In general, tickets can and should be moved on the board by any team member whenever they feel it is appropriate. If it is not clear where a ticket should go, this should be discussed in the channel or at a collab session.

Migration Strategy for Replacing Classes

Replacing the usage of highly used "monster classes" across MediaWiki core code as well as in extensions is a slow process that requires careful planning to avoid sudden breaking changes and minimize risk to the production sites. We have developed a multi-stage strategy that we follow for each new type we introduce:

Stage one: define. In the first stage, we define an interface that describes the new type. Make it so the old monster class implements that interface, so the old type can still be used as a parameter in places where the new type is required. Ideally, provide a light-weight alternative implementation of the new interface.

For example, we introduced the Authority interface to cover some of the functionality in the User class, made User implement the Authority interface, and provided several alternative implementations of Authority. Introducing the new interface and implementation can typically be done quickly, with only a handful of commits.

Stage two: accept. In the second stage, we change method signatures from requiring the old type to accepting the new type. Since the old class implements the new interface, this can be done without changing callers. It just means that the calling code no longer needs to have an instance of the old monster class. This approach however does not work for return types: here we have to introduce a new method that returns the new type, and deprecate the method that returns the old type.

When changing method signatures in this way, we only update the actual implementation if it is trivial. If the implementation still relies on the old class, we simply convert from the new type to the old using an upcast method such as Title::castFromPageIdentity(). Such "upcasts" are free of performance penultimates as long as the concrete type of the object at runtime is still the old class. This approach solves a chicken-and-egg dilemma created by the fact that dependencies in our code base tend to be circular: if we tried to also convert the implementation right away, we would run into the issue that we are using other code that is still asking for the old type, so we would need to convert that first. But converting that code will have the same issue, and so on. To break this cycle, it is useful to convert "from the outside in": we address the public interface first. When all the public interfaces have been converted, converting the implementations as well becomes trivial, because methods we may want to call have already been converted to accept the new type.

For example, in WikiPage::doEditContent, we changed the parameter that represents the user who is performing an edit from a User to an Authority. Internally, we convert to a User object when needed. Depending on the usage of the old class, this stage can take quite a while, and require many patches. In the class of the Title and User class, we have to update many hundred method signatures.

Stage three: offer. In the third stage, we offer new extension points (hook interfaces, abstract methods, base classes) that use the new types in the method signatures. Old extension points are deprecated.

Since changing extension points requires action from extension maintainers, it should be done rarely and be communicated carefully. Depending on the number of extension points affected, this stage can require quite a bit of effort. In the case of the Title and User classes, there is on the order of a hundred hooks to be updated, used by about as many extensions. Also, because of the nature of this change and the requirement to provide a stable framework for extensions, the old and the new extension point have to be maintained side by side for at least one release of MediaWiki.

Stage four: migrate. In the forth stage, we change code in core and in extensions from using the old classes to using the new types, by updating implementations and migrating to new extension points. Here we start to reap some of the benefits of the migration: we should see that the implementations become less complex, and that coupling between unrelated parts of the code falls away. At this point, metrics measuring coupling should start to budge, and new developers will no longer have to understand the complexities of the old code.

Stage five: remove. In stage five, we can finally remove the old classes. With all calling code having been migrated to using the new types, the old classes are no longer needed. Only when the old code has been removed will we see the full impact of the migration in all metrics.

More information:

Presentation at the Wikimedia Hackathon 2021: Slides, Recording

Metrics

The objective of the expedition is to improve code quality, and quality, by definition, cannot be quantified. So all we can do is measure indicators to guide our direction, and trust in past experience that tells us that these indicators are indeed good guidance. We must however be mindful of Goodhart's law: When a measure becomes a target, it ceases to be a good measure.

At present, our "compass" for day to day work is the Monster Monitor, which provides an overview of how far we got with removing the usages of undesirable classes.

For the future, we hope to conduct a survey of engineering satisfaction, measure the tangeldness of the code base, and track the number of incidents per line of code. However, we have not yet established resourcing for the necessary tooling and processes.