Manual:MediaWiki architecture

This page contains the actual text of the MediaWiki architecture document project. This content will be merged into the appropriate mediawiki.org pages once the specific formatting required for inclusion in the AOSA book is not needed any more.

We are tentatively using an "architecture through history" approach, where MediaWiki's history is told chronologically, and its different parts presented along a timeline, surrounded by the context (reasons, assumptions, constraints) in which they were introduced.

Introduction

 * context
 * Wikipedia and sister sites
 * optimization and performance because of requirements of high-profile sites like those operated by the WMF

Phase I: UseModWiki
Wikipedia was launched in January 2001. At the time, it was mostly an experiment, to try and boost the production of content for Nupedia, a free-content, but peer-reviewed, encyclopedia created by Jimmy Wales. Because it was an experiment, Wikipedia was originally powered by UseModWiki, an existing GPL wiki engine written in Perl, using CamelCase and storing all pages in individual text files.

It soon appeared that CamelCase wasn't really appropriate to name encyclopedia articles. In late January 2001, UseModWiki developer and Wikipedia participant Clifford Adams added a new feature to UseModWiki: free links, e.g. the ability to link to pages with a special syntax (double square brackets), instead of automatic CamelCase linking. A few weeks later, Wikipedia enabled the new version of UseModWiki supporting free links, and enabled them.

While this initial phase isn't about MediaWiki per se, it provides some context and also shows that, even before MediaWiki was created, Wikipedia started to shape the features of the software that powered it. UseModWiki also influenced some of MediaWiki's features, for example its basic text formatting syntax.

Phase II: the PHP script
In 2001, Wikipedia was not yet a top 10 website; it was an obscure project sitting in a dark corner of the Interwebs, unknown to most search engines, and hosted on a single server. Still, performance was already an issue, notably because UseModWiki stored its content in a flat file database. At the time, Wikipedians were worried about being "inundated with traffic" following articles in the New York Times, Slashdot or Wired.

In Summer 2001, Wikipedia participant Magnus Manske (then a university student) started to work on a dedicated wiki engine for Wikipedia in his free time. The goal was to improve Wikipedia's performance using a database-driven software, but also to be able to develop Wikipedia-specific features that couldn't be provided by a "generic" wiki engine. Written in PHP and MySQL-backed, the new engine was simply called the "PHP script", "PHP wiki", "Wikipedia software" or "phase II".

The "PHP script" was made available in August 2001, shared on SourceForge in September, and tested until late 2001. As Wikipedia suffered from recurring performance issues because of increasing traffic, the English language Wikipedia eventually switched from UseModWiki to the PHP script in January 2002. Other language versions also created in 2001 were slowly upgraded as well, although some of them would stay powered by UseModWiki until 2004. An automated program, called "User:Conversion script", converted the last version of the existing articles to the phase II format; previous revisions of the articles' history pre-January 2002 were partly restored by developer Brion Vibber in September 2002.

As a PHP software using a MySQL database, the PHP script was the first iteration of what would later become MediaWiki. Not only that, but it also introduced many critical features still in use today, like namespaces to organize content (including talk pages), skins, and special pages (including maintenance reports, contributions list and user watchlist).

Phase III: MediaWiki

 * performance /load issues led to another rewrite by Lee Daniel Crocker: "Phase III". Same PHP/MySQL architecture, same basic interface
 * https://en.wikipedia.org/w/index.php?title=Wikipedia:Software_Phase_III&oldid=172368
 * +interwiki links
 * deployed to Wikipedia on Saturday, July 20, 2002, along with a hardware move to a new server
 * August 28, 2002: deployed to The German Wikipedia; deployed to other language wikipedias over 2002, with interwiki links being added along the upgrades
 * new features over 2002: maintenance special pages, edit on double click, etc.
 * still performance issues. On Nov. 11th, 2002, Brion Vibber disabled the view count and site stats, which were causing two database writes on every page view. Occasional switching to read-only mode were sometimes necessary. Re-enabled later in Nov.
 * table locking problems => some maintenance pages disabled during high-access times in late November 2002


 * May 2003 we upgraded to 4.0.12.


 * mid-June 2003: DB server separate from web server; new page caching system for logged-out users ; "As of mid-may 2003: we use the filesystem to cache rendered, ready-to-output pages for anonymous users. "


 * WMF announced in June 2003. MediaWiki named in July. Confusion
 * July 2003: new features: Table of contents, section editing


 * September 2003: still only 2 servers: a web server for the English Wikipedia (Larousse), and another server acting as a web server for non-english Wikipedia, and as a database server for all Wikipedias (Pliny); load balancing set up in November 2003 between en.wp.o & en2.wp.o (Larousse and Pliny?)


 * January 2004: still performance problems
 * January 29th, 2004: new features: Edit toolbar "can be enabled in prefs (works perfectly in IE, near perfect in Mozilla, not so great in most others); will be refined and possibly made default in the future", Extended image syntax "allowing automatic generation of small versions of images, and alignment of images without HTML"


 * January 30, 2004: Nine new Wikimedia servers, purchased using about $20K of the money generously donated in the December/January fundraising drive. Brought online over February 2004


 * Late May 2004: MediaWiki 1.3


 * December 12, 2004: Spam blacklist


 * January 2005: MediaWiki 1.4 on the English Wikipedia

The Coming of age

 * See also MediaWiki

2004
 * Categorization system

2005
 * Hooks!
 * Major database redesign decoupling text storage from revision tracking

2006
 * ParserFunctions extension

2007
 * Gadgets extension

2008
 * FlaggedRevisions extension

2009-2010
 * ResourceLoader
 * Usability initiative (Vector skins/WikiEditor extension etc.)

Milestones to integrate

 * creation, testing, and initial deployment of Magnus' "phase 2"
 * creation, testing, and initial deployment of Lee's "phase 3"
 * Refactorings & performance improvements in the early Brion & Tim years
 * internationalization & unicode
 * addressable logs
 * 1.5 schema refactor; Proposed Database Schema Changes/October 2004
 * compression & external storage
 * web-based installer (1.2)
 * regular releases
 * Early empowerment of end-users
 * user/site JS/CSS
 * extensions
 * 1.3: The MonoBook skin, categories, templates and extensions.
 * CentralAuth
 * Gadgets
 * API
 * 1.12 or so - preprocessor rewrite (Tim); improvements to template performance
 * 1.17 - resourceloader; beginning of strong JavaScript module APIs
 * The creation of api.php, and the addition of write actions (including edit) to it

Use

 * reusability of MediaWiki by other people, not just for our purposes; used to be difficult to install (command line installer, many references to Wikipedia, hardcoded paths); now easier with the web-based installer
 * installer. a quick note about the installer would be good. During the early days (ask Tim Starling), you add to run a shell script to install MW. Then it was made to just upload file and run the /config/ wizard.
 * Making it FLOSS since the very beginning was very important for its popularity. MediaWiki has become the 800-lb gorilla of wiki software, and it wouldn't have happened with a closed development model.
 * Reusing MediaWiki to build commons: not adapted to handling millions of media files.

Configuration

 * Globals for configuration -- partially this is because we started out in PHP4 world, but it has really hurt 3rd parties over time and made the software seem rather difficult to configure/maintain.


 * Configuration variables are placed in the global namespace.
 * This had serious security implications with register_globals.
 * It limits potential abstractions for configuration, and makes optimisation of the startup process more difficult.
 * The configuration namespace is shared with variables used for registration and object context, leading to potential conflicts.

Architecture overview

 * PHP, DB
 * what is the overall workflow of a user request?
 * index.php dispatches to MediaWiki class; SpecialPage class; page, revision, & user tables, Title & WikiPage classes

PHP

 * Unprefixed class names.
 * PHP core and PECL developers have the attitude that all class names that are English words inherently belong to them, and that if any user code is broken when new classes are introduced, it is the fault of the user code.
 * Prefixing e.g. with "MW" would have made it easier to embed MediaWiki inside another application or library.


 * We try to do things cleanly if there are benefits to it (e.g. separation of logic and output in the architecture) but at the same time we're not afraid to ignore standards or rules if that's better for us (e.g. not fully complying with stupid provisions of HTML4, denormalizing the DB schema where that brings performance benefits)


 * MediaWiki seems to have been started mostly by people who weren't really very expert in their field, and as a result a lot of ugly old code is lying around that lacks proper logic/view separation and has other nasty issues

MediaWiki grew organically and is still evolving. It's hard to criticise the founders for not implementing some abstraction which we now find to be critical, when the initial codebase was so small, and the time taken to develop it so short.

Structure and classes

 * main classes

We've seen major new architectural elements introduced to MediaWiki throughout its history, for example:


 * The Parser class
 * The SpecialPage class
 * The Database class
 * The Image class, then later the media class hierarchy and the filerepo class hierarchy
 * ResourceLoader
 * The upload class hierarchy
 * The Maintenance class
 * The action hierarchy

MediaWiki started without any of these things, despite the fact that all of them support features that have been around since the beginning. Many developers are driven primarily by feature development -- architecture is often left behind, only to catch up later as the cost of working within an inadequate architecture becomes apparent.


 * Internal API


 * Special pages
 * June 2003: PageHistory:Page name, UserContributions:User name, BackLinks:Page name

Database

 * performance tweaks


 * schema rewrite for performance (MediaWiki 1.5. ?); early enough. very well thought out. We've added on to it over time, but by and large it has remained the same to this day and continued to serve us reasonably well.


 * DBMS support & abstraction
 * Manual:Database layout
 * http://yellowstone.cs.ucla.edu/schema-evolution/index.php/Schema_Evolution_Benchmark


 * DB API, cache handling (Squid purge, memcached).
 * everytime the workflow request data, there is a cache to speed it up
 * the first one being web caches (Squid, Varnish), then memcached for data/parser, then database query cache (not sure it is used)
 * split queries accross multiples databases, partitions caches ...)

Wikipedia's software

 * software originally written for a very specific purpose: serving a community for the creation & curation of knwoledge; not the case of most CMS.
 * constant design decisions that never come up in other CMSes
 * openness of wikis; technology response: abusefilter, etc.
 * community defining roles, processes, to deal with vandalism, for example. That doesn't come up in other contexts: technological answers that don't come up with regular CMS. other CMSes: corporate customers, existing workflows
 * needs of the community
 * the community and its need evolved over time; it wouldn't have been possible to foresee the features the community would need
 * interplay of editing community and development

the architecture of MW has been driven many times by initiatives started of requested by the community. e.g. creation of Wikimedia Commons, FlaggedRevs, Wiki loves monuments

Performance

 * schema rewrite for performance (MediaWiki 1.5. ?);
 * we wrote our own database abstraction layer and load balancer. at that time, there were not much around so we HAD to write one


 * relying on PHP/MySQL: probably not the best choices for performance. But very popular and facilitates recruitment of new devs
 * hurt: PHP has not benefited from performance improvements that some other dynamic languages have seen in recent years (eg JavaScript VMs now have aggressive JITs etc, but Zend's PHP still doesn't ship an opcode cache, much less try to actually compile anything)
 * Obviously using Java would have been much better for performance, and scaling up the execution of backend maintenance tasks would have been simpler.


 * hurt: MySQL has had a few specific areas it's lagged in that have been problematic:
 * lack of full native UTF-8 (this is finally in in the latest versions, but you have to jump through some hoops and we have years of legacy databases)
 * no or limited online alter table makes even simple schema changes painful to deploy, slowing some development
 * data dump format is very hard to parallelize well; even with few changes to the database it takes forever to build one due to the compression.


 * Adding support for memcached (in memory cache) and APC (PHP opcode cache) had a HUGE impact in improving perfs.


 * endless compromise between performance and features. e.g. template sysem degraded performance. One possibility is to turn them into dedicated extensions


 * ResourceLoader
 * Using jQuery for javascript.


 * hurt performance: awful awful syntax making it harder to plug in better-performing parser etc bits


 * MediaWiki has to be webscale ;-) Unlike most PHP applications MediaWiki has been built for years now with performance as a major design goal since it absolutely must scale to WMF sites (primarily: enwiki)


 * Adding a generic caching layer did amazing things for performance. We can throw practically anything expensive into the cache and expect it to come out :)


 * As discussed above: the configuration and extension registration systems hurt MediaWiki's performance.


 * The template argument default parameter feature was ultimately the most costly feature in terms of performance. It enabled the construction of a functional programming language implemented on top of PHP, starting with.


 * When the limitations imposed on us by WMF's caching infrastructure caused problems in MediaWiki or friction with things we wanted to do in MediaWiki, we found ways around that. We didn't try to shape WMF's caching-heavy optimized-to-death architecture around MW, but we did almost the reverse: make MW more flexible so it can handle our crazy caching setup, without compromising on our performance and caching needs.


 * things we support that 'normal' people never think about:
 * DB replication/lag handling
 * things we support that 'normal' people never think about: reverse caching proxies


 * Factors that contribute to a 'performance culture':
 * We have a few very expert people on board who know a lot about performance optimization (DB performance is a big chunk of it, but the rest is important too)
 * MediaWiki must run on a top-ten web site. Things that would not scale to that size are fixed, reverted, or put behind a config var
 * MediaWiki must run on a top-ten web site operated on a shoestring budget, which has additional implications for performance and caching
 * Specific things that have improved performance:
 * Generic caching layer support that looks the same to the developer whether the backend is memcached (preferred), APC/ECache/whatever, a database, or even nothing (null cache)
 * Using disk-backed object cache for the parser cache greatly improved the pcache hit rate and produced some awesome hit rate graphs
 * PoolCounter prevents a Michael Jackson-esque cache stampede. It's sort of difficult to verify it actually works, but we've seen things that lead us to believe it does indeed work
 * Specific things that have harmed performance
 * The fact that wikitext slowly evolved into this almost Turing-complete programming language. Wikitext that exploits these features takes forever to parse


 * couple slides from talk at CCC in 2004

Security

 * Strongly focusing on security by providing wrappers around HTML output and DB queries that handle escaping for you, and making their use pretty much mandatory. This means that everyone is expected to write secure code, while at the same time writing secure code is made easy so everyone can do it. Thanks mostly to Tim Starling, we have institutionalized a security-minded development culture, and I think that contributes to the low number of security flaws found in MediaWiki.


 * WebRequest / Sanitizer
 * user input sanitization. We have webRequest to grab parameters given by a user and make them safe. to avoid code injection

Languages

 * localization and internationalization
 * 2003: providing PHP patches and sending them to wikitech-l
 * December 6, 2003: The MediaWiki namespace, a new MediaWiki feature for interface translation and customisation, has been switched on.

(both for content and interface, and omnipresent in MediaWiki, hence dedicated section)


 * Using per language encoding led to a lot of issues. Eventually everything migrated to UTF-8 which makes things easier when you deal with hundreds of different languages. was eliminated in 1.5, along with the "big schema change".


 * i18n/l10n system


 * We are fully committed to internationalizing our software in any imaginable language. This i18n support is quite pervasive and impacts many parts of MediaWiki, but despite that we stuck with it anyway, and we now have a very feature-rich i18n system.


 * things we support that 'normal' people never think about:
 * internationalization in 350+ languages
 * the ability to have the interface and content in different languages
 * right-to-left languages
 * mixed directionality (i.e. interface language and content language have opposite directionality)

Authentication

 * A cleaner accounts system that spanned multiple sites from the beginning would have saved lots of trouble; CentralAuth is still a bit hacky to work with.

Permissions

 * 1) Lack of a unified, pervasive permissions concept.
 * 2) * This stymied the development of new user rights and permissions features, and led to various security issues.

Content structure

 * Page title: CamelCase, then free links. Allowed for greater flexibility in link text and page names and made it less confusing. Free links have since become the de-facto standard for internal links in most wiki software now
 * subpages (/talk subpages initially)
 * namespaces, parentheses in titles; related to the abandon of subpages decided by Larry Sanger in November 2001 after intense debate.

Namespaces

 * namespaces
 * The flat namespace for articles is too simple: for Wikipedia it encourages overly long pages (leads to performance problems as we have to parse and copy around huge chunks of text that will not usually get read all at once, and makes it harder to navigate to relevant, more digestable chunks of data). For other sites like wikibooks, wikisource, wikiversity, heck even mediawiki.org we could benefit a lot from more structured entities that consist of multiple subpages. It also means it's harder to separate draft or provisional pages from the published article space.


 * Magnus Manske implemented them in the first PHP script, then they were reimplemented a few times
 * way to separate the different spaces for the community to evolve
 * created the necessary preconditions for the community for meta-level discussions, community processes, user profile

Wikitext & Parser

 * tokenizer to parse wikitext (JeLuF wrote it). Unfortunately lack of performances with PHP array memory allocations led to a revert after 3 days of having it running on live site. We are back to the huge pile of regexp since them.
 * Cleaner markup syntax near the beginning would simplify our lives a lot with template & editing stuff


 * The parser wasn't formally spec'd from the beginning--it just morphed and evolved as needs have demanded. This makes it difficult for alternative parsers to exist and has made changing the parser hard. The parser's spec is whatever the parser spits out, plus a few hundred test cases.


 * The parser has to remain very very stable. Hundreds of millions of wikipages worldwide depend on the parser to continue outputting HTML the way it always has. It makes changing the parser difficult.

Making wikitext such a complex and idiosyncratic language that parsing it with 'normal' parsers is very hard was definitely a bad move, and we're feeling the pain now

Editing

 * metadata (categories, interwiki) in the body of text. This should really have been coded in a different table / interface.


 * The visual editor project is way overdue. We're fixing it now, and that's good, but it's kind of ridiculous that, in 2011, the main interface of one of the largest sites on the web is still a  from the 90s

Templates

 * While we have plenty of things to whinge about in the syntax, management etc, the ability to create partial page layouts and reuse them in thousands of articles with central maintenance has been a big boon.

ParserFunctions

 * ParserFunctions never should've seen the light of day. Granted we were responding to the needs of the time, but if we did it over, we probably should've taken the time to solve the problem properly rather than putting PFuncs as a stopgap measure. This page has some interesting history on the subject (Also this and also older versions of this page)

Media

 * DB & filesystem storage layout for media files is very awkward, with a number of problems that hinder our ability to mirror, cache, and do major online maintenance.


 * File repository code could've been done slightly differently. Ideally wikis	 should be able to upload *to* foreign repos, rather than just read from. Also, most of the code assumes a local filesystem or NFS which isn't very flexible--other backends like the database, Swift, etc. shouldn't be so hard to add.


 * Replication support

Community

 * Open source from the start


 * MediaWiki's early couple years involved zero engineering budget and a lot of volunteer turnover -- the original authors of the phase 2 and phase 3 PHP codebases were Wikipedians with a technical bent as were most of our other early devs; folks like Brion started on fixes, internationalization, and features support based on their ability to see the source in CVS & the bug tracker on SourceForge, discuss with other devs on the wikis & mailing lists, and get software updates actually pushed to production.


 * Regular releases and installer. Making sure the software was easy to set up was a BIG factor in getting more volunteer developers involved, both directly (people already possibly interested wanting a small impedence to start working) and indirectly (easier for 3rd-party usage, leading to people sending fixes and customizations upstream).


 * Deploy-then-release

backwards compatibility

 * Some aspects such as hooks or configuration variables, remain very stable for a long time. When they change, they typically go through a slow deprecation process to allow users and extension authors to catch up.
 * However our internal apis change all the time which can be frustrating to extension authors (and even core devs!)
 * I think this will improve in the coming releases though. A lot of our "omg rewrite" situations in the past ~2 years have been to bring ancient code into the 21st century.


 * Inconsistent. Such is the result of having many volunteer developers with many different opinions on this fraught issue.


 * Most old methods / functions are kept in the code, nowaday they are marked as deprecated and removed after 2 or 3 releases. We still support the 10 years old skins.
 * The wikitext parser and render still supports hacks we really want to remove for performances reasons.

QA

 * Peer reviewing of every single patch since day 1.

Beyond MediaWiki

 * Several levels:


 * System administrator:


 * Wiki administrator:


 * Common.js, Common.css, and skin-specific counterparts
 * Gadgets (per wiki, but can be enabled/disabled by each user)
 * New developments on gadgets (central repo, UI improvements)


 * User:


 * User:Foo/vector.css, User:Foo/common.js

Side programs

 * interaction with other pieces of software (i.e. LaTeX, ImageMagick, etc.)
 * January 6, 2003: Inline TeX math formulas are now supported

Skins & extensions

 * hooks
 * Extension architecture. fairly flexible infrastructure which has helped us to make specialized code more modular, keeping the core software from expanding (too) much and making it easier for 3rd-party reusers to build custom stuff on top.


 * The skin system has been terrible since the beginning. It's damn near impossible for 3rd parties to write their own custom skins without reinventing the wheel.


 * Extension registration based on code execution at startup rather than cacheable data.
 * Limits abstraction and optimisation

Gadgets and site/user JS/CSS

 * hugely impactful, this has greatly increased the democratization of MediaWiki's software development. Individual users are empowered to add features for themselves; power users can share these with others both informally and through globally-configurable admin-controlled systems.


 * Using jQuery for javascript.

Future

 * Current RfCs, etc.
 * New parser (specification), wiki dom, Visual editor


 * MW is a tool that's used for very different purposes: media categorizations, mgmt of structured data, curation of content, patrolling of content, corporate CMS
 * we'll see an increased need for specialized uses
 * ex: MW used for Commons. Not built for that. Benefits for doing that, but not built for that. So needs improvements for that specific use case
 * other example: structured data
 * ways to make MW better : look at the workarounds (toolserver, bots, etc.)