User:Duesentrieb/Architecture Guidelines

This is a DRAFT of a PROPOSAL for a RECOMMENDATION of BEST PRACTIVE regarding code architecture. It is written with MediaWiki core in mind, but should be applicable to any software project of similar size and intent, writte for a similar platform.

The general principles laid out below should serve as a guideline when writing and reviewing new code, and when refactoring old code.

Rationale
The principles layed out below are designed to improve two desirable properties of any software system: maintainability and reusability. Key to improving these two properties is the idea of modularity, meaining the idea that software components (functions, classes, libraries, and applications) should have well defined, narrow interfaces and well defined, minimal dependencies. Better modularity also means better testability, which again improves maintainability because it allows confident change.

It should be noted however that none of these principles is absolute, and sometimes they even conflict, so a balance needs to be struck. For instance, improving the separation of concerns may have a negative effect on information locality. More generally, reducing the complexity of individual components often means increasing granularity, causing an increase the complexity of the larger system that composes and uses these components. Finer granularity brings a higher degree of abstraction, making it easier to understand intent, whereas coarser granularity makes it easier to understand the actual operation. The right balance between such aspects of a software system often depend on external factors, including the available tools, code review process, as well as social and cultural factors.

Furthermore, even though the below principles are generally helpful for creating maintainable and reusable software, any of them may be discarded temporarily or even permanently for a given component, if there is a good reason to do so. Such a reason should however be thoroughly documented, to allow others to understand why the component was written in violation of best practice, and thus avoid others trying to fix that perceived shortcoming.

With regards to MediaWiki, it's important to realize that a lot of the core code is quite old, or was built upon antique foundations, and often does not conform to the principles described in this document (as of summer 2015). This legacy code can more often be used as a bad example rather than a guide to how new code should be written. However, writing new code in such an environment often requires compromises, improving the legacy code base requires patience and due care as well as attention to practical factors such as runtime performance and scalability.

-> Reusability (link/move)
 * documentation

-> Maintainability (link/move)
 * what vs why
 * operation vs intent
 * understand and change

---

No Global State
Global state should be avoided at (nearly) all cost. Global state prevents isolation of components and violates the principle of information hiding. In more practical terms, it makes it very hard if not impossible to re-use a component elsewhere, use two instances of a component in parallel, provide a reliable testing environment for a component, or prevent leakage of state between tests.

Global functions should also be avoided. As long as they do not modify state, they are not quite as bad, but they are prone to naming conflicts, and cannot easily be replaced for testing. For example, tests for code that uses the global function wfTimestampNow will need to use a lot of sleep calls when testing code pathes that depend on timestamps not being equal.

Static methods should be avoided for similar reasons: even though they do not have the issue of naming conflicts, they cannot easily be substituted for testing. Also, if their implementation was ever to change in a way that would require any service to be injected, they would have to be turned into instance methods, which would require all calling code to be adjusted.

One notable exception to this rule are constructor-like static methods, that serve to work around the fact that PHP does not support constructor overloading.

Dependency Injection
Dependency injection refers to the idea that objects should get all they need to function in the constructor (or through a call to a setter, if the respective dependency is optional). Ideally, an implementation would have static dependencies only on interfaces. Concrete implementations would be "injected" by passing the respective instances to the constructor (or to a setter, if the dependency is optional). Using this pattern, classes explicitly declare their dependencies on things they work with, as constructor parameters (typically services of some kind), in addition to what they work on, as parameters to individual methods.

For dependency injection to work, it is essential that classes have control over their constructor parameters. That means that when registering a class as an API module, or media handler, etc, it should be possible to specify how the class should be instantiated, typically be supplying a callback function, instead of relying on a fixed signature for the constructor. MediaWiki allows now this for API modules, SpecialPages, and ContentHandlers.

It is important to note that a class' constructor should only ask exactly for what the implementation needs. In particular, it should use narrow interfaces; it may even make sense to define specific interfaces just for the need that this particular class has. On the other hand, passing around "kitchen sink" objects with lots of dependencies should be avoided; in particular, service registries, configuration objects, or request context objects should not be passed in if this can be avoided.

TBD: move the boostrapping / app scope explanation The idea of dependency injection (DI for short) is that an application composes a network of service objects that can perform all the operations needed for implementing the application's logic. DI frameworks allow this network to be defined in a declarative fashion. However, using such a framework is not necessary in order to enjoy the primary benefit of DI, namely improved modularity. The construction of the application level network of service objects can be implemented explicitly, in a "top level" (or "application scope") factory (or "registry"):

The top level factory, let's call it AppServices, would implement a getter function for each service needed in the application. That getter would return the service (a singleton with respect to the application scope) after constructing it if necessary, using the lazy initialization technique. In many cases, the desired concrete implementation can simply be hard coded here.

Often, static entry points (such as static hook handler functions) will need access to the AppServices, in order to get access to the services they need to operate. To allow this, there would be a default instance of AppServices, available from a static method, e.g. AppServices::getDefaultInstance. During the bootstrapping phase of the application (essentially, in the global scope of index.php), this default instance is created, and used to instantiate and call the dynamic entry point of the application. After the initialization phase, there would typically be a call like AppServices::getDefaultInstance->getRequestHandler->handleRequest( $request ). This call starts the actual application logic.

Note: setting up an extensive network of service objects can be costly, and the bootstrapping code runs for every request, before doing anything else. Hence it is essential for the application scope factory to use lazy initialization for the service objects, instead of creating all of them immediately.

TBD: -> Migration (move/link)

Caveat: The fact that DI isolates the implementation of business logic from any knowledge about the implementations of any logic (services) it may need is often an advantage, but can also be problematic: it makes it impossibly to determine by static analysis how concrete method implementations get called. This makes it harder to review code for security issues, especially with respect to proper quiting/escaping. Unit tests alone do not fix this problem, since vulnerabilities arise especially when individual components work as designed, but there is a misunderstanding about the contract of the interfaces involved (in particular, about which parameters are expected to be escaped, and how).

Interface Segregation

 * narrow interfaces -> avoid unnecessary depdencies
 * less to implement if an alternative imple (or mock) is needed
 * Interface design should be driven by usage/need, not by "what can be done with this".
 * Interfaces cost little. Create many.


 * transitive dependencies
 * Lo level vs high level, hubs vs authorities...

Separation of Concerns

 * Reduce component complexity
 * mixing concerns -> bindeling depdenendcies
 * warning signs:
 * "and" in the description
 * some dependencies only used in part of the interface or implementation
 * more than 500 ELOC per class, more than 50 ELOC per function.

Information Locality

 * need to know
 * keep it local
 * information hiding

Composition
Composition vs. Inheritance Protected vs. Private Static comp vs injection

Error Handling

 * Use Exceptions
 * Catch late
 * Localize later
 * Document which exceptions are thrown when

Performance (link/move)

 * Hot path vs cold path
 * no premature, no micro. measure!
 * scalability >> performance
 * -> lazy init
 * -> cacheing
 * -> DB

---

Values

 * Typically "newable"
 * Equals, toString, hashing, etc
 * Immutable, unless there is a very good reason
 * If Mutable: LSP, cloning
 * Often no interface, single impl
 * If multiple implementations, use a factory
 * Mutable value -> Model, mutable list/map, etc


 * -> LSP
 * avoid mutable
 * needs generics
 * typical issue for lists/sets

Builders and Cursors

 * Represent state
 * Cursors typically for I/O
 * Builders typically for complex values
 * Model == Builder?

Services

 * Typically application scope
 * "Stateless" singletons (state: lazy initialization, caching)
 * Storage, views, etc
 * converters (formatters, serializers, etc)

Factories

 * newXXX has parameters (or the whole purpose is lazy init)
 * inject context knowledge
 * Constructs values or controllers, rarely services

Registries

 * Lazy init of services / factories / registries
 * newXXX has no parameters
 * DI entry point
 * Should not be passed around
 * App: defines application as service network
 * App: may have static singleton

Controllers

 * Business case
 * Use services
 * Created by factory
 * May be stateful
 * MVC

Glue

 * adapters
 * decorators
 * event/hook handers
 * callbacks

---

Class Level

 * Concept
 * Purpose
 * Relationship with other classes/interfaces
 * Usage

Method Level

 * Contract vs Implementation
 * array structures
 * formats / escaping

Medium Level

 * logical data model
 * information flow
 * hooks, options, etc

Guides and Howtos

 * Install & Update
 * Backup/Restore
 * Extending

Testing

 * Injection & Mocking
 * CI & confidence
 * Coverage
 * Unit vs Integration
 * Security -> Interface contracts (escaped vs unescaped)