How to become a MediaWiki hacker/2011 Workshop

A workshop to teach developers how to hack MediaWiki. Syllabus is in progress.

Allot three to four hours, including a 20-min biobreak in the middle. Works all right with 3-4 teachers and about 20 learners.

Originally taught to about 24 participants on Aug 2, 2011, Hecht House, Haifa, Israel (part of Developers' Days). Sumana is using them to create a useful syllabus for future teachers.

Overview
In this workshop, we will go over the basic coding toolchain and our code intake/review/merge/deploy/release workflow. We will cover the topics in "HOWTO Become A MediaWiki Hacker".

We will explain how one might change the desired behavior in MediaWiki for various purposes. We want to start with easy things, & work our way up to more time-consuming/pioneering work.
 * User preferences
 * Config options
 * Skins
 * Extensions (refer to the example extensions, which are up-to-date)
 * Gadgets (JavaScript-based site extensions, requiring the Gadget extension to be installed. Especially good for eye candy.)
 * Special pages
 * Parser hooks
 * Hooks in general
 * Parser functions & parser tags
 * Modifying MediaWiki core

Then we ask each person to choose a small task and spend some time, maybe 20-45 minutes, working on it. Experienced people will circulate, teams pair up, etc.

What to have prepared ahead of time?

 * Have the LAMP stack installed: Linux, Apache, MySQL (or SQLite) and PHP
 * If you are using Windows, you may want to install an Ubuntu Linux virtual machine by downloading the latest Ubuntu ISO file and VirtualBox
 * Have a text editor such as vim, TextEdit, or Notepad++. Windows's default Notepad won't work and will cause Byte Order Mark errors.
 * Install an IRC client such as xchat or ChatZilla
 * Have an account on http://bugzilla.wikimedia.org
 * Install an Subversion (SVN) client

If you have time, install MediaWiki -- download 1.18.1 tarball, use installer
 * if adventurous, try downloading & installing from trunk, but don't worry

And circulate/post links to IRC or the wikitech-l mailing list so people will find it easy to access them if they have trouble.

Poll at start of workshop

 * Ask learners: what's your reason/interest/focus?
 * Then point out areas of interest in MediaWiki or Wikimedia. Example: tell Java people about Lucene search plugin.  When we talk about where the source code lives, tell RSS/API people where that code lives.

= Syllabus =

Intro resources
It's a good idea for beginners to register for an account on Bugzilla: https://bugzilla.wikimedia.org

Help & discussion about MediaWiki development: #mediawiki on freenode -- MediaWiki on IRC

How to become a MediaWiki hacker is a good resource.

Show an example of a bug -- 28296 -- and go through the bug comments to show the narrative of how a bug gets reported and fixed.

Show the room our pools of easy-to-fix bugs:
 * https://bugzilla.wikimedia.org/buglist.cgi?keywords=easy
 * Annoying Little Bug

The process for new developers to contribute code: providing patches by attaching them to Bugzilla issues, with an experienced developer reviewing the patch, committing it, and then the patch being deployed and included into the release.

Format: unified diff. Short version:


 * Useful training in patches & source control: OpenHatch training missions in diff and patch and Subversion.

How your code gets reviewed, then deployed on Wikimedia servers
You download the source code from our code repository and make a change, then make a diff and attach it to an issue in Bugzilla more details on this. Then experienced developers review your code and either tell you how to improve it, or accept it into our code repository. You can see recent updates to our code repository at Special:Code/MediaWiki. Then there is an additional, more thorough code review step; someone has to mark your patch as "ok" or tell you how to fix it. Every few months, we take all the new code that's been marked "ok" and deploy it onto Wikimedia servers. Then we fix any bugs that arise and release a new version of MediaWiki.

URLs that every developer should remember: http://mediawiki.org/wiki/Special:Code/MediaWiki/status/fixme and the like.

FIXME means: the commit can't be deployed to production sites because there's something wrong with it But we really want people to test their work before they commit it, in order to keep trunk in a usable state. This helps with the code review process and ensures changes are deployed and released more regularly Some areas of the job are covered with tests. See also Continuous integration

You usually request commit access when you submit enough (good) patches (through Bugzilla) that it's more convenient for everyone to give you access. This might take a few weeks of dedicated effort, or months or years if you don't have time to submit patches very often.

Q: Is there a place in SVN where you can download a good working copy (without fixmes etc.)? A: No. But we do have SVN branches for stable releases (eg. /mediawiki/branches/REL1_18) and we have the tarball releases as downloadables. Our goal is to get trunk into an always-deployable state by getting closer and closer to continuous integration, getting tests to automatically run, and bouncing off changes that break tests
 * Recommendation: Checkout the latest branch that is being prepared for a release (right now this is branch 1_18).
 * Download from SVN
 * Tip: If you submit patches, they should be made based on trunk to avoid merge conflicts.
 * Do not submit patches based on a branch or a stable tarball download!
 * Also note that if there are serious FIXMEs they will not be in trunk, they will be reverted, trunk must always be stable enough to run a small wiki on (http://translatewiki.net also runs on the live trunk!)


 * Time for a 20-30 min break.

Preferences
The easiest way to play around with MediaWiki is to change user preferences. And if you're running your own MediaWiki site, you can change/customize default values, like the default skin, etc. Examples: enhanced recent changes, gadgets! http://mediawiki.org/wiki/Extension:Gadgets

Gadgets are a nice way for new developers (who know JavaScript) to start customizing their wiki experience, and eventually get involved in MediaWiki development. There's a detailed intro at Gadget kitchen and a detailed training document and video at Gadget kitchen/Training.

Q: How is the JavaScript from Gadgets executed? A: The gadget extension is loaded on every page and checks if the user has the preference for this page is enabled. If that is the case, it loads the javascript as a &lt;script&rt; page. The gadgets are only editable/publishable by wiki administrators and are stored as wiki pages on MediaWiki:Gadget-{gadgetname}.js. Inside the script a gadget maker can do things like "if ( wgPageName == 'Amsterdam' )" that way the script will only be loaded on that page name.

Starting with MediaWiki 1.17, you can use jQuery in Gadgets.
 * To "create" a new gadget the wiki administrator adds a list item on MediaWiki:Gadgets-definition.
 * Non-resourceloader gadgets all run in the global scope. This means that you should wrap your script in a closure to avoid naming conflicts and scope leakage.  This will be fixed in MediaWiki 1.19 (in 1.19+ gadgets are always loaded through ResourceLoader, which means each module executes in a local (new) scope by default).

Extensions
Manual:Extensions

Manual:Developing extensions there's a separate extensions respository in SVN trunk

Go to http://mediawiki.org/wiki/Manual and read it, and fix it if it's out of date.

In the svn tree, there is an extensions directory. This is where extensions are. If you write one and commit it to the extensions directory, it will be translated by the translatewiki.net translators. Most of these extensions aren't suitable for the Wikimedia sites, either because they're specialized, unneeded (e.g. proofreading tools on wikisource) or unsecure for sites like Wikipedia (for example a "Who is online" extension is more of a social extension, not for an encylopedia). The Special:Version page on any wiki will show where extensions are in use. On any MediaWiki wiki, you can see the list of installed extensions by looking at your wiki's Special:Version page

An extension should be considered fairly safe & stable if it's used on a Wikimedia production wiki (indicated by a box at the bottom of the extension page).

example: in SVN /trunk/extensions/examples/ http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/examples

http://mediawiki.org/wiki/Manual:Parser_functions

Special pages
Manual:Special pages

Magic words
Manual:Magic words

A magic word lets you add your own wikitext tags to extend wikitext. Extensions can introduce both magic words and tags.

There are three kinds of magic words (Parser functions, variables and boolean triggers)

Look like
 * Parser functions

Parser functions included in core MediaWiki: Manual:Parser functions Extensions can introduce their own parser functions, the most popular one: Extension:ParserFunctions

etc.
 * Variables


 * Behaviour switches (like _TOC_ would place the table of contents in that position, extension can add their own behaviour triggers (like _NOINDEX_).

Tags
Manual:Tag extensions

New tags can be introduced by an extension (ie.  could be a parser tag created by an extension to show recent posts of a blog inside a wiki page)

Tags cannot be nested, because the contents are handled by the tag function and not by the wikitext parser.

A special page is just a PHP file - you can output whatever HTML you like to do what you want to do with it. There is example code in the "examples" SVN subdirectory.

Hooks
Manual:Hooks

A hook is something that calls a function whenever a specific event happens, and it can do things like notifying the IRC channel when a specific page is changed.

MediaWiki core overview
Everything comes in through index.php which dispatches to MediaWiki class, determines your action parameter (?), logic handled in article class, & that dispatches various aspects.

Specialpage class -- all special pages. Preferences, contributions, version, etc. Easy place to jump into code. self-contained. Easy place to jump into.

There is a nice pic of the structure of the DB schema.

Then: workshop! Bug triage or testing for people who don't want to contribute just yet, and Annoying Little Bug work for people who want to dive in and code. This would last for the rest of the 90 minutes, and if one of the developers leading it felt the need to burst into a few minutes of lecturing (because a few people were having the same problem), that would be fine.

During the coding/workshop, if anyone is ready to actually use trunk (to install or to suggest a patch), be ready to explain the directory structure, e.g., what's phase3? “phase3” is Magnus Manske’s fault (it actually is the “third rewrite”) and historically called so.

MediaWiki history

Important Subversion directories:
 * http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3 "MediaWiki core"
 * http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions "MediaWiki extensions"
 * http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools "Tools not written in or for MediaWiki but are somehow related (like a WordPress plugin for MediaWiki, or a dump script, etc.)"

File structure overview of MediaWiki core (http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/ ): index.php

A small file that does some initialization and then goes off to WebStart. All article views, edits, actions, special pages all of it goes through this file. api.php .. /skins/ .. /includes: the "php includes directory" Contains default settings, global functions, and all classes (such as the Special pages, Wiki page actions (read, edit, etc.), ResourceLoader, Api modules and lots more.

Example case: a reading list for a user to track what they've read or not in a given category and to make an educational game which asks questions about the pages that have been read.
 * needs a hook to do something in the DB when a user reads something
 * needs a special page to show the user what they've read

You could do this as a gadget, but you wouldn't have access to the database, so it would be simpler to do it as a PHP extension.

Configuration
There are also things you can do to configure MediaWiki, as the site's administrator, to customize how it runs. See Manual:Configuration settings

Bots
Bots are not in the MediaWiki source tree: they're programs that have (for example) Wikipedia accounts use the API to (for example) edit the text of pages or look at the recent changes. They are often written in Python (using pywikibot framework) or in any programming language that can do HTTP requests (eg. in PHP with cURL to the API).

See Pywikipediabot/Basic use http://mediawiki.org/wiki/API

For more information about the API or about bots on Wikipedia you can ask Roan or Krinkle.

Vandalism
In MediaWiki everybody with the 'edit' right can make an edit. Users can use 'undo' button from history page to undo any previous edit.

There is little to no vandalism detection in core, but there are many good extensions to help with this. For example Wikipedia uses CheckUser, Nuke, AbuseFilter, AntiBot, AntiSpoof, ConfirmEdit, GlobalBlocking, SimpleAntiSpam, SpamBlacklist, Title Blacklist, TorBlock, etc.

Side track:
 * rollback state in database for edits (From Daniel)
 * Perhaps a new FlaggedRevisions state ? (ie. a "added vandalism" and "repairing" status)

Special:Code/MediaWiki/ Extension:Gadgets

Plugins to other systems
If you use the MediaWiki API you can create plugins that integrate Wikipedia or other MediaWiki installations into other apps or tools.

Example WordPress plugin: PhotoCommons
 * source code
 * Workflow: Dialog to enter search keywords, find images through Wikimedia Commons API, click thumbnail to insert into wordpress post/page.
 * Author: Krinkle and Husky

Drupal plugin to parse wikitext and render through MediaWiki API
 * source code
 * Author:Holger Motzkau (User:Prolineserver), Manuel Schneider (User:80686)

Start hacking!

 * Annoying Little Bug
 * https://bugzilla.wikimedia.org/buglist.cgi?query_format=advanced&keywords=easy&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&product=MediaWiki&list_id=20029
 * Installing MediaWiki
 * Developer hub
 * Manual:Code


 * Now the teacher has a choice -- code tour or individual hacking.

Code Tour
Now the teacher goes over the general architecture of MediaWiki on a code level, covering important tables and classes.

There is a lot of stuff in this diagram of MediaWiki's database tables! But you don't need to know most of it. We'll go over the important tables.

Tables

 * 'page' table

Page table is the centerpiece of lots of interactions
 * page_id
 * every page has a numerical ID
 * exposed in the interface sometimes.
 * autoincrement. Every page as it is created gets one.
 * page_namespace
 * Counterintuitively, the namespace of a page is a number, not a textfield. (Surprising.  You'd expect the text to be somewhere.)
 * Text is a configuration thing.... pages in the user namespace have a "2" in that field (includes/Defines.php).
 * A separate file says that 2 means user. This is for localisation.
 * And so you can rename namespaces. If you add a new namespace you have to give a number
 * it should be over 15 for core.
 * For extensions adding namespaces: use something higher than 100
 * 0-99 are "reserved" as core namespaces.
 * Different extensions reserve diff namespaces. Semantic, for example, uses 100-199.
 * Negative namespaces are special, don't try to use them)
 * Wikimedia-specific namespaces are not in the main code
 * see $wgExtraNamespaces — you can create your own.
 * Many languages have "Portal", but not all, and it's not in the core.
 * Example: Lithuanian Wikipedia, articles that are lists, they put in a different namespaces.
 * How high can you go? 32 bits.  So 2 or 4 billion namespaces are possible for a single wiki.
 * Every namespace should have an associated talk namespace.
 * Subjects are even numbers and Talk namespaces are odd. If you break this convention, MediaWiki breaks!
 * page_title
 * title itself, stored without a namespace prefix.
 * Example: User:Sumanah -- Database stores: namespace=2, title=Sumanah
 * page_restritions
 * page_counter
 * page_is_redirect
 * Is this page a redirect?
 * page_is_new
 * is it new?
 * page_random
 * page_touched
 * page_latest
 * Relates to: Revision table! specifically, this relates to a row in that table.
 * page_len


 * 'revision' table

Revision table records metadata for every revision, including creation.


 * rev_id
 * a unique ID for a revision
 * sometimes visible in the interface as “oldid”
 * rev_page
 * rev_text_id
 * see the text the text table for the text of the edit
 * ID, blob of text, & some flags. In theory, just text.  In practice, gzipped (then the gzip flag would be flagged), or stored externally, etc.
 * rev_comment
 * edit summary
 * rev_user
 * user who made the edit
 * rev_user_text
 * rev_timestamp
 * timestamp
 * rev_minor_ecit
 * rev_deleted
 * rev_len
 * rev_parent_id


 * 'user' table


 * user_id
 * autoincrement ID
 * user_name
 * user's name

The rest of the stuff in a user record is all sorts of random metadata about a user.


 * ‘recentchanges'

is a separate summary table (although the data can be inferred from other tables) for performance reasons.

Classes
The best reference resource is http://svn.wikimedia.org/doc/
 * Doxygen documentation, similar to PHPdoc, but can be used in many languages.
 * Autogenerated documentation, comments start with /** instead of just /*


 * About a wiki page (Title class, WikiPage class)

Every page is uniquely identified in two ways:
 * "page id", stored in page.page_id

or:
 * "namespace & title" pair, that together are always unique (there can be two pages with the same title (Project:Foo, User:Foo, Category:Foo), but there can be only one with the same namespace + title pair.
 * (A page like "Brion Vibber" could be a User page, Article, Category, or anything; the page title itself isn't enough to be unique within a wiki. You need the namespace to know exactly what the page is.)


 * Title class

Inside MediaWiki, pages are identified through instances of the "Title" class, not by passing s or namespace/title combos. So if an extension hooks into core functionality it will get the instance of the Title class for the current page (stored in the global $wgTitle) and can use methods like $mytitle->isRedirect to see if a page is a redirect.

To create an instance of Title use any of the constructor methods such as Title::newFromId or Title::newFromText etc.

The Title class has several static utility functions to offer. Most used is "makeTitleSafe". Always use SAFE methods when your input is coming from users, or you aren't sure that it's sanitized. May be OK to use non-safe methods when your input is coming directly from the database through.

A Title-constructor may return null, this indicates that the title is invalid. Always check for s!

Q: We see getters, but no setters. How do we set methods? A: That's not possible, title objects would otherwise be mutable/not be reliable. Besides, they represent a title, not a modifiable page.

On that, to actually touch the database / modify an actual wiki page, we use the WikiPage class.


 * WikiPage class

On a high level: $myTitle = Title::newFrom***( ... ); $myArticle = new WikiPage( $myTitle ); $myArticle->doEdit( ... );

Wrap-up

 * Q There is an extension that calculates statistics -- how does it work?
 * A The view-counters are actually not an extension but part of MediaWiki core (uses page.page_counter).
 * Since large sites may not want to increment a database table field, it is possible to disable this feature (See $wgDisableCounters) (Wikipedia has disabled it).
 * additional notes from Sumana: hmm, core developers do not understand request-context??
 * There is an extension that calculates statistics -- how does it work?
 * view counters are in MediaWiki -- you can disable them, as we have on WMF because of performance
 * page counter field in the page table
 * and there is a hit counter table
 * periodically, whenever you hit a page, it adds a row, this is a temporary buffer, then ever

What participants want/like
They find it informative, even if they already know some of the material. They find the database overview and code architecture behind Wikipedia interesting:


 * what makes it run
 * process of being able to interact with the code
 * bugtracking system
 * how to check out code
 * edit it
 * check it back in
 * what happens after that