Architecture guidelines

Incremental change
The MediaWiki core should be changed incrementally. Folk wisdom has it that complete rewrites of large sections of code are more difficult than they appear, and are often abandoned.

Introduction of new language features
Features introduced to the PHP language which impact on architecture should be discussed, and consensus reached, before they are widely implemented.

Rationale: acceptance into the PHP core is not a guarantee that a feature is suitable for use in a large extant codebase with many developers. In the past, features have been introduced with caveats or pitfalls that are relevant to us. Experienced MediaWiki developers are in a good position to critically review new PHP features.

Examples of existing PHP features which impact on architecture:


 * jQuery-style method chaining (enabled by PHP 5)
 * Type hinting
 * __get magic method
 * Late static binding
 * Closures
 * Namespaces
 * Traits
 * Generators

Interface changes
An interface is the means by which modules communicate with each other. For example, two modules written in PHP may have an interface consisting of a set of classes, their public method names, the definitions of the parameters, and the definitions of the return values.

For interfaces which are known to be used by extensions, changes to those interfaces should retain backwards compatibility if it is feasible to do so. The rationale is:


 * To reduce the maintenance burden on extensions. Many extensions are unmaintained, so a break in backwards compatibility can cause the useful lifetime of the extension to end.
 * Many extensions are developed privately, outside of WMF Gerrit, so rectification of all extensions in the mediawiki/extensions/* tree does not necessarily address the maintenance burden of a change.
 * Some extension authors have a policy of allowing a single codebase to work with multiple versions of MediaWiki. Such policies may become more common now that we have LTS releases.
 * MediaWiki's extension distribution framework contains no core version compatibility metadata. Thus, the normal result of a breaking change to a core interface is a PHP fatal error, which is not especially user-friendly.
 * WMF's deployment system has only rudimentary support for a simultaneous code update in core and extensions.

Requests for comment (RFC)
The requirements for writing an RFC should be made more stringent. An RFC should be written when you have a concrete implementation in mind and when you explicitly intend to realize it fully and deploy it if you get the go-ahead. If not a working prototype, you should at least have some pseudo-code to illustrate your intent.

The current set of RFCs on MediaWiki.org contain a number of RFCs that are: --Ori
 * 1) written by someone who is hoping to motivate other people to actually implement something but who has no intent of implementing it herself.
 * 2) are thin on concrete implementation details, preferring instead to express a vague "wouldn't it be nice if..." aspiration.
 * I think you mean "accepting an RFC" instead of "writing an RFC." It also seems fairly misguided to suggest that a requests for comment process only be used in cases where there's a clear execution path. --MZ

Goals

 * Self-documenting, easy to follow code
 * More maintainable code
 * Separable and generic elements (e.g., reuse just the Http class, or just the Parser, or just the DB classes, without pulling in the entire framework)
 * Easier, stabler third party development in the future
 * True unit tests (via, for example, making it easier to mock interfaces)
 * Warning: we will have to break some old interfaces and kill some deprecated things. This will make future versions have more stable interfaces for implementations.
 * Can we standardize the deprecation process too? i.e, when we branch a new MW version, delete anything tagged as deprecated in $version-2 or something like that? Anomie (talk) 14:31, 25 May 2013 (UTC)
 * Didn't that already get decided on the ML a few months ago? or was that only deciding on hwo they should be labeled?. Peachey88 (talk) 07:43, 26 May 2013 (UTC)
 * Make it easier to parallelize and distribute processing; potentially implement parts in other languages
 * Use the same code paths (controllers) from both UI and API code

Known problems

 * Testability is limited due to lack of separation of concerns
 * High levels of abstraction / complexity / non-conformity are leading to bad security (chunked uploads).
 * This seems to be more an example of a bad abstraction / missing separation of concerns -- Gabriel
 * Lack of high-level documentation (how classes fit together)
 * Readability of the overall codebase (where do I find this object that is being used?)

Design principles
The MediaWiki code base has evolved over many years. While it currently serves us reasonably well, there were many design decisions that have outlived their useful lifetime. We intend to evolve the codebase toward more modern practices over time, using this document as a focal point for choosing which design patterns and best practices we should adopt in a long term document.

Candidate design patterns for consideration
These are design patterns that have been discussed in the community that need further consideration among MediaWiki architects and experienced developers.

One way to do things
(Borrowed from the Zen of Python / http://www.python.org/dev/peps/pep-0020/ ).

Rationale: code review back log is an issue; things would go quicker if we had canonical patterns for implementating certain things. If we agree on this, we probably need to eliminate some duplication.

Practical consequences: it should be a requirement for merging bold redesigns of existing interfaces that the previous interface be removed or a clear deprecation plan be articulated.


 * I agree we should write MediaWiki 2.0 using python go .NET.Haskell BrainFuck.WhiteSpace. Machine code nodejs
 * Let's bikeshed about whitespace! yay, WhiteSpace!
 * Switch to "more common" schema:  if (blah(blah) && blah) {

Clear separation of concerns
This proposal suggests that most objects in MediaWiki should handle one (and only one) of the following concerns:


 * Values : Plain old PHP objects (POPO) - data is stored and validated internally, but no outside calls (especially network / DB) are done from this object. The object could be mutable or not. The object could contain other POPO objects, but should not reference services/controllers. Should easily serialize to JSON. Should not know or use services.
 * Examples SHOULD be Title, User, etc.


 * Services : Narrow interfaces, could in principle be web services with JSON (see value objects); long life time (application, request). Usually only one instance (but don't use singne instance (but don't use singletons). Stateless.
 * Examples SHOULD be: BlobStore, LinkStore, RCNotifier, JobQueue, etc.


 * Controllers : Models busines logic (typically a "process" or "action", like "rename page" or "block user"); usually local, short lifetime. Stateful. Uses services.
 * examples SHOULD be: Edit (with permission checks, conflict resolution, etc), MovePage, BlockUser, etc.

...with a catch-all for "'misc."' that doesn't fit into the first three.
 * examples: exceptions, interators, callbacks, and other glue code.

Even with the current (3:07pm) definitions, these terms are somewhat opaque to me. -Ori
 * Me too. And what does this mean for users of these classes: if some function needs a Title object, how does it get the TitleService (or whatever it's called) to be able to create one? Anomie (talk) 14:36, 25 May 2013 (UTC)
 * If Title is a simple Value object, you just create it with new Title. If you need a service to do something to the title, e.g. a PageStore with a movePage operation, your class (which would probably be a controller) would require a PageStore in the constructor and store it as a member. -- Duesentrieb ⇌ 13:27, 26 May 2013 (UTC)
 * Isn't that exactly the opposite of the proposal here, where "Value" objects should not be accessing the database themselves? Anomie (talk) 13:59, 28 May 2013 (UTC)
 * You don't need a factory to create a Title, but yes, you'll need a service to load a Title. But that would not be a TitleService or TitleFactory, it would rather be a PageStore, corresponding to the page table and implementing more or less the functionality of WikiPage, just as a stateless service instead of an active record. -- Duesentrieb ⇌ 18:27, 30 May 2013 (UTC)

Keep aspects/concerns separate. Don't mix these in the same class:
 * Database / storage code
 * UI Code
 * Business Logic
 * HTTP handling
 * ...more?...

An example, though not perfect, is the separation between File objects, Store objects, and MediaHandler classes. This lets us use the same handlers regardless of storage backend, etc.

Where do process caches go? The File/FileRepo split was basically designed around the caches, i.e. File is basically a container for cached information about the file. It's hard to see how a "value object" can have references to rich and useful in-process caching.


 * How does this interact with the bit above about "values" and "services"? Do we end up with many services for each value, one for Database, one for UI, one for Business logic, and so on? Or is a "service" some sort of GenericValueFactory that handles getting all different types of Value objects? Anomie (talk) 14:36, 25 May 2013 (UTC)
 * You really don't need factories for value objects, because there is generally only one implementation. You may need factories for controllers or services when constructing them in a static context. But that should generally only happen in the set up phase.
 * Yes, there would be separate services for rendering, storing, etc, for different things. E.g. we could have a PageContentRenderer interface, and a PageTsore interface, as well as a RevisionStore and LinkStore interface. They may all be implemented by the same class though (e.g. SqlWikiStuffStore). Business logic would usually be covered by a controller, which can be instantiated locally with new ContentEditController or some such. Controllers can also be injected, directly or via a factory, but I don't think that this will often be necessary. -- Duesentrieb ⇌ 13:42, 26 May 2013 (UTC)
 * See above, isn't the point of "Value" objects that they don't access the database? So how do you get the Title object for a particular page without some "factory" that fetches the necessary information from the "storage" code? And from your second paragraph, I guess the answer to my other question is "Yes, there would be several service objects for each kind of Value object." Which seems like potentially a lot of classes. Anomie (talk) 13:59, 28 May 2013 (UTC)
 * Yes, as I said above, you need a service if you want to load a title from somewhere. You don't need a factory to create a title.
 * And yes, there would be different service interfaces for doing different things - having more granularity is the point, it makes code reuse and testing easier. However, also note that just because there may be a PageStore and a RevisionStore interface, that does not mean they can't be implemented by the same class, if that is convenient.
 * Also note that services are not bound to specific kinds of values. A serialization service might implement serialization for several kinds of values (that's quite OK if the values are strongly related, I think).
 * -- Duesentrieb ⇌ 18:27, 30 May 2013 (UTC)

What is the rationale for introducing value objects? This sounds very much like the Content/ContentHandler split, which I criticised during code review. I didn't manage to get a satisfactory answer out of Daniel at the time. To me, encapsulation, i.e. the close association of code with the data it uses, is fundamental to OOP design. It's certainly possible to separate code from data, and that is routinely done in non-OOP languages, but I hadn't heard it discussed as a defensible design principle in OOP. -- Tim Starling (talk) 23:52, 30 May 2013 (UTC)


 * "Value Objects" come from Domain-Driven Design (DDD makes another distinction between "Entities" and "Values", but that's not very relevant here; It does not differentiate between Services and Controllers, a distinction I find useful, but not critical). Keeping Services and Values/Entities separate ("newableas" vs. "injectable") seems to be a best practice for writing testable code
 * There should be two kinds of objects in your application. (1) Value-objects, these tend to have lots of getters / setters and are very easy to construct are never mocked, and probably don't need an interface. (Example: LinkedList, Map, User, EmailAddress, Email, CreditCard, etc...). (2) Service-objects which do the interesting work, their constructors ask for lots of other objects for colaboration, are good candidates for mocking, tend to have an interface and tend to have multiple implementations (Example: MailServer, CreditCardProcessor, UserAthenticator, AddressValidator). A value-object should never take a service object in its constructor (since than it is not easy to construct). 
 * To me, keeping service logic separate from value objects solves three problems:
 * A value object can be constructed on the fly. An "active record" otoh either global state internally, or access to services when constructing it. This means that it's hard to use them as input for tests cases and requires big and complicated fixtures for testing. The ORMRow class recently introduced into core had this problem, we are trying to fix this now.
 * A value object can easily be serialized, converted and passed around. An object bound to services can't. This is a serious problem with the Content interface - it doesn't allow the Content to be a simple value object. For Wikibase, this forced us to have a hierarchy of value objects (Entity, Item, Property) separate from the corresponding Content objects (EntityContent, ItemContent, PropertyContent). The value objects can be passed to and used on the client wiki, while the Content objects can't. This is quite anyoing.
 * A value object does not need to guess which operations are useful. Binding logic to values leads to a random collection of functionality on the record. Title is a good (bad) example for this: it has functionality for moving a page, while the functionality for deleting a page is in WikiPage. Also, it leads to a tangle of dependencies: If an objects knows how to store itself and render itself, it depends on the storage layer code and the output layer code. It seems that splitting the responsibility of objects along the axis of what is being done and how (and what is needed for that) is better than splitting along the axis of what is being acted on. It makes more sense to me for a booking service to be able to process credit cards than for a credit card to know how to process charges.
 * More generally, it seems to me that designing for testability, extensibility and maintainability means departing from some traditional OO principles. Most importantly:
 * Code sharing via inheritance is bad. Use composition. That makes it easier to replace individual components, to mix and match.
 * Binding logic to data (active record pattern) is bad. In traditional OO, the credit card would know how to process a charge, and hair would know how to comb itself. This is inflexible, since the behavior can't be replaced at runtime.
 * I'm still struggling with the old habits myself, though, especially with the inheritance bit. -- Duesentrieb ⇌ 09:53, 31 May 2013 (UTC)

Just-in-time abstraction
This is another way of saying "You ain't going to need it"

Writing unit tests for all code immediately requires more abstraction right away. -- Duesentrieb ⇌ 13:42, 26 May 2013 (UTC)


 * As Robert Martin says "An axis of change is only an axis of change if the changes actually occur. It is not wise to apply the SRP, or any other principle for that matter, if there is no symptom." Abstraction gives you flexibility in a particular direction, the challenge (often not met) is to identify the axes of change in advance. If you know that you need unit tests, fine. The point is to avoid abstraction that gives you flexibility in unnecessary directions. -- Tim Starling (talk)


 * But how do you condense this into concrete recommendations? Anticipating the directions in which some code will be extended is usually a matter of experience. -- Ori
 * Isn't that basically what this is saying? Until you have the experience in how this particular thing needs to be abstracted, you're just guessing. Robla told a story about some bus tunnel built in the 80s, where they thought they were eventually going to want to run rail through it as well so they ran track. Then a few years ago when they actually built the rail, they found the tunnel had a different gague... Anomie (talk) 14:39, 25 May 2013 (UTC)


 * "Do not introduce abstraction in advance of need unless the probability that you will need the flexibility thus provided in the short term is very high (>80%)" Obviously human judgement is needed at some point, but I think we are putting a foot in the ground by asking for a high probability of need rather than a conceivable need.-- Tim Starling (talk)


 * Yes, I think this works. ContentHandler is a good example: there was pre-existing evidence that the notion of "content" merited being made abstract -- namely, w/image content and stuff in the mw namespace. -- Ori


 * I agree that premature abstraction is bad. I would argue though that there are some boundaries that should not be crossed, and require decomposition per default (e.g. the value object / service boundary, or the database code / ui code boundary). Any abstraction beyond that should and can be done "just in time"; Note however that the requirement to write (actual) unit tests may require some abstraction right away that would not be needed otherwise. -- Duesentrieb ⇌ 13:46, 26 May 2013 (UTC)
 * OTOH, if your "unit" tests are forcing you to add layers of abstraction that are not otherwise necessary, then perhaps your units are too small. Anomie (talk) 14:06, 28 May 2013 (UTC)

There is a counterpoint, which is that we know from experience that abstraction left to be done "just in time" sometimes isn't done at the time it is actually needed. Instead, the second developer hacks something up on top of whatever the first developer left, in the ugliest possible way. Consider the introduction of ApiEditPage without appropriate refactoring of EditPage, or the introduction of action=render without introducing the non-static methods that would be needed to allow generation of absolute URLs for links, dependent on ParserOptions. Also, I think it's fair to say that some developers just don't have a head for architecture and can't reasonably be expected to rearchitect a module in the course of adding a feature to it. Maybe this implies a more moderate wording for this policy, along the lines of "avoid unnecessary abstraction" rather than "just-in-time abstraction". -- Tim Starling (talk) 14:02, 26 May 2013 (UTC)
 * I'd also hope that code review and RFCs actually being useful could help with the "don't have a head for architecture" problem. Anomie (talk) 14:06, 28 May 2013 (UTC)

Code structure

 * Type hinting (both in parameter lists and in doc comments) avoid errors
 * PHP provides __call which allows for proxy objects, i.e. objects which duplicate the methods of a class but do not actually inherit from that class. Consider awesome backwards compatibility hacks like WikiPage and StubObject. Type hinting breaks this. So I am hesitant to introduce it to public interfaces. -- Tim Starling (talk)
 * how does type hinting break this? -- Duesentrieb ⇌ 13:49, 26 May 2013 (UTC)

For example:

Breaks if RequestContext::getMain->getUser is migrated to return a StubUser instead of a User. -- Tim Starling (talk) 01:10, 30 May 2013 (UTC)
 * Oh right, subs can't "dynamically" implement interfaces. I want Java's proxy objects :) -- Duesentrieb ⇌ 18:19, 30 May 2013 (UTC)


 * Type hinting also prevents b/c migration hacks like

If a core class has child classes in extensions, the type hints cannot change without generating parse errors in the derived classes. So this prevents the type of a parameter from being migrated.

Since we are discussing renaming every class and completely changing all the interfaces, I don't think support for b/c hacks can be easily dismissed. The success of any renaming/refactoring project may depend on them being available. -- Tim Starling (talk)


 * Personally, I don't advocate any big rewrites. I'm mostly thinking of which guidelines should apply to new code.
 * But this is an important point: we should make clear which rules should apply to new code, and if and how they can be applied to legacy code, and which b/c strategies are to be employed, and which pit falls are to be considered. -- Duesentrieb ⇌ 13:49, 26 May 2013 (UTC)

Refactoring best practices

 * It should be easy to identify what 'concern' a class is dealing with (without even reading the code). Naming convention or phpdoc markers ?
 * The historic lack of namespaces in PHP led to the proliferation of "bags o' vaguely related things" classes. Now we have namespaces, but they look a bit alien in the MediaWiki codebase -- can we decide on the role of namespaces in future code? Are they going to be a bit of cleverness that is tolerated, or a design pattern we encourage / enforce? Anonymous functions ditto -- there are a lot of MyExtensionHook classes that are more concisely written as  --Ori
 * For abstract classes, have clear documentation on what is expected of children
 * Along the lines of the comment about type hinting above: there is a set of tricks and techniques for performing complex migrations "in-flight" that should be documented, like using a temporary configuration var that selects the old or new implementation.

Let's please try to focus on a) guidelines for new code and b) guidelines for careful, local, just-in-time refactoring. Let's avoid planning or even specing out a huge refactoring of the entire codebase. -- Duesentrieb ⇌ 13:51, 26 May 2013 (UTC)

Context and object lifetime

 * Registration systems in which class names are used instead of object instances (ie special pages or API modules) prevent control over object lifecycle and thus proper dependency injection
 * On the other hand, using objects means that you have to actually instantiate a possibly large number of classes every time when only a few will actually be used—wastes time and memory. --Brad
 * I agree that it would be bad to instantiate the entire object graph in the setup phase. There should be a limited number of static entry points that just construct the things they need, using the relevant configuration, factories, etc. E.g. an API module or a hook function may construct/request the services and controller it needs just in time.
 * In other cases, lazy instantiation can be implemented using the holder or factory pattern, though I find that a bot awkward/annoying. -- Duesentrieb ⇌ 13:56, 26 May 2013 (UTC)
 * Singletons are problematic for unit testing, and lifetime control generally, but they are an improvement over global variables
 * [for an example we should refactor the Wiki class to actually encapsulate configuration state, so multiple object trees could exist in one process -- makes cross-wiki stuff more feasible]
 * how are they an improvement? For all intents and purposes, they are global variables. -- Duesentrieb ⇌ 13:56, 26 May 2013 (UTC)


 * Singletons are better than globals because:
 * They allow for lazy initialisation. This improves performance. Lazy initialisation can be used to reduce the overhead required for backwards compatibility, by deferring the load of backwards compatible interface code until it is actually required.
 * Singleton requests can easily be flagged with wfDeprecated.
 * A global declaration provides a reference, rather than a copy of an object handle, which has negative performance consequences as described in my recent wikitech-l post.
 * Singletons could in principle be registered in some process lifetime object, to allow them to easily be discarded in bulk at the start of each unit test run. By a singleton, we mean an object with a global scope which is retreived with a function or static method. Infinite lifetime is not inherent to the concept.


 * -- Tim Starling (talk) 01:06, 31 May 2013 (UTC)

Impact on casual contributors

 * Some rules can be difficult for newbies, like requiring unit tests
 * help them write their tests!
 * don't be afraid to refactor in code review
 * Differ between (hard)core mediawiki in test case requirements vs for instance non-deployed extensions
 * Advice a good IDE/toolset, because that might help
 * Open source, please! --Brad


 * More separation makes code easier to understand and extend; understanding huge classes like User or Title is daunting.
 * On the other hand, mazes of tiny little classes that all depend on each other are difficult to understand too. We need a middle ground. --Brad
 * the win is usually when dependencies are one-way and limited. Avoid mazes but go for little clouds.
 * Tell users to be proactive about their changes when required (don't leave them hanging)
 * We can keep settings for most popuplar IDEs in source control, allowing contributors to start quickly.

Use dependency injection for external resources
If you initialize connection objects to external resources somewhere deep in your code, it makes writing unit tests very difficult and expensive. Possible pattern:

^ This makes it very easy to write unit tests that use PHPUnit mock objects --Ori


 * Passing in NULL like this is discouraged as it makes the signature less clear
 * The tight coupling is still there
 * Production code will end up being too lazy to properly inject dependencies
 * The assumption is that the dependencies are properly instantiated by the class itself and callers can and should be lazy; the ability to inject them is put in place specifically for unit tests. --Ori
 * In that case there indeed still is tight coupling and no poper dependency injection. So most problems are not solved
 * Yes, you're right: this partially solves the testing problem but misses an opportunity to make the implementation more flexible. --Ori

I'm not really on board with this and may need some more convincing. Daniel showed a code snippet from Wikibase yesterday with half a dozen constructor parameters. I am not sure it is worthwhile making unit tests easier by making everything else several times harder and uglier. I am happier at this point with the idea of a single context object passed to the constructor, even if it does make unit testing slightly more difficult.

The problem with passing many dependencies as formal parameters to constructors is that it is difficult to know in advance how many will be required, especially if you call hooks. Wrapping up all the context into a single object makes it easier to extend.


 * But it makes it hard(er) to substitute individual bits; you always have to provide everything that the object might use from the context.
 * Example: IContextSource has a getSkin method. So if I want to provide the contextual Title and User to something that takes an IContextSource, I also have to somehow provide a Skin. That doesn't make me happy...
 * It would still be an improvement over the current situation though -- Duesentrieb ⇌ 14:05, 26 May 2013 (UTC)

I'm not saying that all constructors should use the same context class, just that I prefer a single context parameter in constructors over a dozen.
 * I agree that having many parameters in the constructor is ugly. A retort I got when complaining about this is that a class needing many different services is an indication of the class doing too much. Split up the class and have less responsibilities per class, then you need fewer services in the constructor - or so they say. I can't say yet whether this always holds true, but so far, it seems like a reasonable rule of thumb. -- Duesentrieb ⇌ 14:05, 26 May 2013 (UTC)

If you don't think your class requires RequestContext::getUser, and don't want to set up a user in a test, why not use an IContextSource which throws an exception when getUser is called on it? That way, if you are proved wrong and a User is actually required, you can add one without adding constructor parameters. -- Tim Starling (talk)

I see your points, but now you have to step through the code to see what it does in fact require. I'll try to refactor a class that uses optional constructor parameters to take a single context object to see how well it works. --Ori

Perhaps a compromise would be to say that individual parameters are preferable, but a context object ("service bundle") is acceptable for convenience. The downsides should be made clear though. -- Duesentrieb ⇌ 14:05, 26 May 2013 (UTC)

Templates for common requirements

 * Taking some unit of work that has grown too expensive and converting it to a job
 * How to implement simple locks in memcached
 * Creating a new namespace for a ContentHandler implementation

I don't think any of this should be in the architecture guidelines. These sound like good candidates for a cook book page. -- Duesentrieb ⇌ 14:08, 26 May 2013 (UTC)

Declaring unit tests in an extension
--Ori

This requires one to manually maintain a list of the tests, which leads to errors. It also is not inline with the standards of the wider PHP community. Defining the PHPUnit config in a phpunit.xml.dist file and allowing extensions to register this is perhaps better. --Jeroen De Dauw (talk)

Part of the problem is that there are two different ways tests are found, depending on how PHPUnit is invoked. Invoking it against core uses UnitTestsList, while pointing it at the extension directly just looks for files ending in "Test.php". And this is non-obvious. -- Brad


 * an important discussion, but also not something I'd want to have in the architecture guidelines. It rather sounds like "best practices for testing" or some such. -- Duesentrieb ⇌ 14:09, 26 May 2013 (UTC)

Things to Read

 * Writing Testable Code on the Google Testing Blog. Concise overview of dos and don'ts.
 * Actually, we can just take that blog post as our architecture guide. It should do. -- Duesentrieb ⇌ 09:26, 31 May 2013 (UTC)