Manual talk:MediaWiki architecture

Feedback is welcome, and even encouraged. If you see that something's wrong, please fix it or report it below

Factual errors
Ideally, please fix the errors directly if you notice any. If you really don't want to, you can leave a note below.


 * "The code executed from  performs security checks, loads default configuration settings from , guesses configuration with   and then applies site settings contained in  ."  I actually don't know whether this is right -- could someone check?  Does setup.php get hit every time someone hits index.php?
 * I asked Chad to look at the Configuration section. "'Global configuration variables offered better performance than other configuration methods in older versions of PHP' ... Doesn't make sense. They still offer better performance over my new config model. '...and makes it more difficult to optimize the start-up process.' I kind of get what you're saying here, but I think it's superfluous and will just confuse readers." Sumanah 16:38, 31 October 2011 (UTC)
 * Also from Chad: I would also rephrase "and hurt third-party reuse of MediaWiki's code (since most other projects cannot share the same MediaWiki global variable names)"... Basically the idea there is "MediaWiki pollutes the global namespace" but the phrasing is a bit awkward. Re: "The future: storing configuration in the database" -- Please don't promise that just yet ;-) I'm not 100% settled on the model yet.

Stuff that's missing
If you see that a major piece of information related to MediaWiki's architecture is missing from this document, please add it below. Ideally, you would add the content directly to the page, or at least provide pointers to where relevant information can be found.

Please do keep in mind, though, that we are limited to ±5000 words, and we're already way over the limit, so we can't go into too much detail, and we can't mention everything. Also, this document is specifically about MediaWiki's architecture, so it's normal that every single feature isn't included in it.


 * To trim the length, I'd recommend removing some more from the caching section--it's a little verbose as is ^demon 14:41, 26 October 2011 (UTC)

Content organization and major changes
If you'd like to suggest major refactoring or reorganization of content, please do so here before editing the page, to minimize disruption.



Review
If you've reviewed the whole document, or sections of it, please add your name below and say what you've reviewed. This will help identify what has been reviewed and what hasn't, in order that the whole document is accurate.

Tim's comments

 * "The object/parser cache used by Wikimedia is memcached, with dozens of servers dedicated to it"
 * We use MySQL for the parser cache now.


 * "ResourceLoader is a particularly interesting case, as it's one of the few core components of MediaWiki that benefited from proper architecting prior to development."
 * Seems unfair, potentially offensive.


 * "such as the impossibility to write native names in a language that required a different encoding"
 * The grammar is incorrect here. Also there is a technical point: we could use foreign scripts, we just had to use HTML entities. However, in page titles and usernames, they couldn't be used. I suggest "For example, foreign scripts could not be used in page titles", as its own sentence, omit the part about HTML entities. Also "Latin-1 support was dropped in 2005": to be precise, support for character sets other than UTF-8 was dropped, Latin-1/CP1252 was just one of them.


 * "Characters not available on the editors keyboard can be customized and inserted via MediaWiki's Edittools, or its JavaScript version"
 * Suggest explaining what Edittools does.


 * "Localization of the user interface messages was implemented in many different ways in the early years of MediaWiki, especially in MediaWiki extensions. Efforts were made to standardize them; interface messages are now all stored in PHP arrays of key-values pairs. Each message is identified by a unique key, which is assigned different values across languages. This standard was established for legacy reasons, and also because other systems were deemed not to be flexible enough for MediaWiki. For example, gettext doesn't support plural forms for multiple variables."
 * I don't understand this whole paragraph. Everything in it seems to be incorrect. Messages were always stored in PHP arrays with key/value pairs, even in extensions. Messages always had a unique key which had different values in different languages. Only the registration interface has changed. How can a system be established for "legacy reasons"? I would have thought that when something is established, by definition there's no legacy. I don't know why Lee set up the i18n system the way he did, but I'm sure he didn't have plurals of multiple variables in mind. I'm not sure if he even considered gettext, but if he had, he probably would have discarded it on the basis that it's not compiled into PHP by default. I don't recall Lee using optional PHP extensions at all.


 * "MediaWiki extensions provide such features to some extent, but they are often fragile at best."
 * "often" contradicts "at best". I think "at best" is unfair and suggest removing it, since we now have fine-grained edit permissions hooks which support edit all sorts of edit restrictions (e.g. AbuseFilter) quite robustly.

Roan's comments

 *  "Because MediaWiki is the platform for high-profile sites such as Wikipedia, security is paramount[1]. Core developers and code reviewers keep up a strong security-minded development culture. From the editors: Explain how"
 *  sumanah: For that you need only go back to my original contribution (the survey thing) and read what I wrote about security
 *  oh wait
 *  You did and it's right below there
 *  oh its not a diagram of our cache setup i was thinking about, its a squid logging setup http://wikitech.wikimedia.org/view/File:Squid_logging.jpg
 *  I don't really understand the editors' question then. I guess what I mean is that there is almost a social stigma on writing insecure code, committers are told to read the security manual as soon as they get commit access, and anything that smells of not being secure is an immediate fixme
 *  sumanah: for the fixme in the skin section you could probably pounce on dantman whens hes next on
 *  "The database went through dozens of schema changes over the years, the most notable being the decoupling of text storage and revision tracking in MediaWiki 1.5. This change, aiming to make better use of the database cache and disk I/O, resulted in significant performance boosts for some operations, like rename and delete operations on pages with very long edit histories. From the editors:...
 *  ...What was the performance bottleneck, how did you decouple them, etc.?"
 *  The reason the editors commented on this paragraph is that it's wrong
 *  It seems to describe the 1.4->1.5 big schema change, but confuses it with external storage
 *  To clarify
 * RoanKattouw: just so you know, I'm just going to copy everything you write here into the talkpage under "review" or a similar section heading so guillom can integrate your review
 *  OK
 *  The 1.4->1.5 schema change made it so that you could rename pages without having to rename each revision, IIRC
 *  I would have to dig up a 1.4 schema to verify that
 *  The external storage change concerns splitting revision metadata and revision text, and does indeed make deletes faster
 * <RoanKattouw> This is because, when we delete a page, we delete all its revisions from the revision table and reinsert them into the archive table
 * <RoanKattouw> Obviously if the text is somewhere else and you can copy a pointer to the text instead of the text itself, that's faster
 * <RoanKattouw> (I guess this isn't external storage then, but the revision/text split, which was also part of the 1.4->1.5 change, sorry for the confusion. ES is completely unrelated)
 * <RoanKattouw> Yup
 * <RoanKattouw> So, to more comprehensively explain the 1.4 vs 1.5 chnage
 * <RoanKattouw> In the 1.4 model, there were two important tables, cur and old
 * <RoanKattouw> cur (for current), contained the current version of the page, including 1) revision metadata for the most recent revision, 2) the text for that revision and 3) per-page metadata
 * <RoanKattouw> old contained the old versions of the page, including 1) revision metadata for the old revisions, 2) the text for those old revisions and 3) for some reason I don't understand, the name (in addition to the ID) of the page the revision belongs to
 * <RoanKattouw> This meant that if you edited a page, the previously current revision became old and had to be copied to the old table (including the text!)
 * <RoanKattouw> Then the entry in the cur table would be updated to reflect your edit
 * <RoanKattouw> If you renamed a page, you would have to rename the cur entry, obviously, but because the old table also included the page title you'd have to rename all the old revisions too
 * <RoanKattouw> Finally, if you deleted a page, you'd have to copy the cur entry and all of the old entries into the archive table, then delete them. Because the cur and old entries include the full revision texts, this means you may be moving megabytes of text around
 * <RoanKattouw> In the 1.5 model, cur and old were replaced with page, revision and text
 * <RoanKattouw> This model is used to this day
 * <RoanKattouw> page contains the per-page metadata that previously lived in cur, and nothing else. This includes the page ID, page namespace+title, and the latest revision ID
 * <RoanKattouw> revision contains the revision metadata for each revision, regardless of whether it's old or current (so you don't have to more revisions around between tables anymore when a revision becomes non-current, you simply insert a new one and update page_latest). Each revision row has a page ID referring to a page row, but doesn't contain the page name itself (so you don't have to rename revisions...
 * <RoanKattouw> ...anymore, you just rename the page table entry and you're done). The revision table also doesn't contain the revision text, but instead contains a text ID, which points to...
 * <RoanKattouw> the text table, which is really just a mapping of IDs to text blobs. So now when you delete a page, the revision table rows that you copy to the archive table only contain metadata, and the text itself isn't copied
 * <RoanKattouw> At some point (seems to be before 1.5 but I don't remember) a flags field was added so you could indicate that the text blob was gzipped (yay space savings) or that the text blob doesn't contain the text at all but is a pointer to some other place where the text can be obtained (external storage)
 * nod
 * <RoanKattouw> WMF uses MySQL for external storage. Previously this lived on Apache servers because their disk space was unused (cheapness), then later we started using a dedicated cluster of 3 storage servers (MySQL boxes with large disks). Only recently have we started moving the old blobs off the Apaches into another dedicated ES cluster
 * <RoanKattouw> Our ES is optimized for space, not speed, which is why revision text fetches are cached in memcached
 * <RoanKattouw> IIRC what we use is blobs with ~50 (*) (**) revisions where the first revision is stored in full and the others are stored as diffs relative to the previous revision. These blobs are then gzipped. Because the revisions are grouped per page (contiguous revs to the same page, not contiguous revIDs) they tend to be similar, so the diffs are relatively small and gzip works quite nicely. IIRC the...
 * <RoanKattouw> ...compression ratio we achieve is something like 98%
 * <RoanKattouw> (*) The number might be something other than 50, I forget
 * <RoanKattouw> (**) There is also a size cutoff, we fill blobs up to 50 revs or 10 MB, whichever comes first. I am not sure 50 and 10 are the exact numbers
 * <RoanKattouw> sumanah:
 * * sumanah enjoyed watching it
 * <RoanKattouw> From the editors: explain how MediaWiki's support for load balancing works.
 * <RoanKattouw> This might be better explained by Tim, but I'll take a shot
 * <RoanKattouw> Essentially you can specify that there is one master DB server and any number of slave DB servers, and you can assign weights to each server. The LB (which is PHP code in MW that decides which server to connect to) will send all writes to the master, and will balance reads according to the weights. Typically the master has weight 0, but that's not required.
 * <RoanKattouw> The LB also keeps track of the replication lag of each slave. If a slave's replication lag exceeds 30 seconds, it will not receive any read queries to allow it to catch up. If all slaves are lagged >30s, MW will automatically put itself in read-only mode
 * <RoanKattouw> Also, if a request has resulted in a write query, the master position is stored in the user's session. Upon the next request from the same user, the LB will read this information from the session try to select a slave that has caught up to that replication position. If none is available, it will wait until one is.
 * <RoanKattouw> This is called the chronology protector, and it ensures that replication lag will never cause a user to see a page that claims an action they've just performed hasn't happened yet
 * <RoanKattouw> (It may appear to /other/ users as though the action hasn't happened yet, but at least the chronology is consistent for each user
 * <RoanKattouw> From the editors: how is your log actually implemented, and what impact does it have on performance?
 * <RoanKattouw> On the cluster we have UDP profiling for a percentage of all requests
 * <RoanKattouw> So MW just fires off UDP packets to a central server that collects them and produces profiling data
 * <RoanKattouw> From the editors: how? examples?
 * <RoanKattouw> I meant things like $wgShowIPInHeader

Has this page been useful to you?
If you've learned something new about MediaWiki's architecture while reading this document, please leave a message here. It's really difficult to assess the impact and usefulness of projects like writing this document, so any feedback is appreciated. It'll help determine if similar projects should be attempted in the future.


 * Yes. I especially appreciate:
 * Execution workflow of a web request: I intend on integrating this into the intro to MediaWiki hacking workshop as soon as possible.
 * The mentions of specific historical figures, like Lee Daniel Crocker, whom new developers now wouldn't run into. Yay for recognizing important legacies!
 * Explanations of how the importance of performance affected how we do database stuff.
 * Customizing and extending MediaWiki -- again, this is the high-level overview that I will surely be giving new developers as they start learning MediaWiki.
 * And learning what paucal numbers are! :-) Sumanah 05:13, 31 October 2011 (UTC)

Other comments

 * People connect with images much better than just plain text. Could you include some architectural diagrams maybe? Probably not super detailed for the entire system. You could create a more detailed diagram per section when you talk about specifics in the text.
 * Here's a suggested diagram to add. Sumanah 20:39, 24 October 2011 (UTC) MediaWiki_database_schema_latest.png
 * Might want to update it for 1.18 (not a lot of changes), if we're going to go down that road. However, it's maybe too detailed, and too large to be easily included in a book... Reedy 02:01, 27 October 2011 (UTC)

Stuffs
In the introduction, the usage of the full stop to format numbers looks very wrong to me. I know it's a cultural/location/language thing. Should it be a comma?

Phase I "A few weeks later, Wikipedia enabled the new version of UseModWiki" - Is enabled the correct term? Upgraded to?

Execution workflow of a web request

"and crates a Title object" - Maybe note it's called $wgTitle, as it's somewhat infamous. Similar for "to create an Article object" later on - $wgArticle

Should references be before or after the full stop?

When talking about the language not being specified, is it worth mentioning that it can't be represented as a formal grammar?

Good work! :)

Reedy 02:00, 27 October 2011 (UTC)

Cross-wiki features
It seems a bit weird to me that things like CentralAuth, which are the major architectural headache for most people who tried to get into that area, are not mentioned. vvvt 22:36, 27 October 2011 (UTC)

Ecosystem
Early in this document, a description of the ecosystem or surroundings should be given, so the reader can understand what the task of the software is. This can be given for two cases, a small stand-alone wiki on a single server, and for Wikipedia's real cluster of Squid caches and multiple Apache servers and MySQL backends. The document now mentions that Wikipedia receives 100,000 hits per second, but doesn't mention how many of these are caught by Squid caches, how many reach MediaWiki, how the load is balanced across servers, or what mix of read/write/special requests to expect. Such numbers and a description of the surrounding is part of defining the problem, for which MediaWiki aims to be the solution. --LA2 04:57, 3 November 2011 (UTC)

Namespaces and categories
says: «namespaces and categories are two examples of rarer cases where, conversely, MediaWiki developers introduced unexpected features that have influenced how Wikipedia functions and how users work». Are we sure of this? I mean, categories exist also on UseModWiki, see CategoryCategory; CategoriesAndTopics doesn't mention MediaWiki in the history of the concept. Moreover, namespaces seem to have emerged from the needs of the wikis as well, for instance user pages overlapping with article titles being moved elsewhere, "namespaces" prepended or appended to page titles (e.g. /Talk), MediaWiki: used to store templates. Nemo 21:59, 4 November 2011 (UTC)
 * There's nothing technically interesting about categories as implemented on MeatballWiki. They are just lists of links. Wikipedia had lists of links since the beginning too. Any wiki running on wiki engine can do that. What MediaWiki's categories did was provide a solutions to problems that previous categorization methods had. And I guess there wasn't much appeal in categorization before that happened. I don't know, wasn't following Wikipedia at that time. Reach Out to the Truth 00:23, 5 November 2011 (UTC)
 * It might be not technically interesting but the concept is the same, categorization is a link to the category and the category itself is the automatically generated list of pages which link the category; the only difference (not small!) is that you have to click. Nemo 07:36, 5 November 2011 (UTC)