Extension:Blahtex/Embedding Blahtex in MediaWiki

Blahtex is a program for converting TeX to MathML, written by Dmharvey. Its main purpose is to be used in MediaWiki to enable it output mathematical expressions in MathML, in addition to the HTML and PNG output provided by texvc. This page describes the efforts to get this running.

If you want to have a look at the current code and play with it, you have two possibilities:
 * use the BlahtexWiki test site, or
 * install the code yourself, as described below.

How to install a MathML-enabled MediaWiki
Step 1: Get the source. The blatex CVS repository at BerliOS contains a recent development version of MediaWiki patched to use blahtex (see for more details). You can read m:Help:Working with CVS for further guidance, but use cvs.blahtex.berlios.de instead of cvs.sourceforge.net as server name, by /cvsroot/blahtex instead of /cvsroot/wikipedia as directory name, and blahtex instead of phase3 as module name. If you are using the command line, the following two commands should do the trick: cvs -d:pserver:anonymous@cvs.blahtex.berlios.de:/cvsroot/blahtex login cvs -z3 -d:pserver:anonymous@cvs.blahtex.berlios.de:/cvsroot/blahtex co blahtex When prompted for a password for anonymous, simply press the Enter key.

Alternatively, you can get the latest MediaWiki source as described in Download from SVN, apply the patch included at 3504, and get blahtex from blahtex.org.

Step 2: Install MediaWiki as usual. See Manual:Installation. You also need to compile and enable the texvc subsystem, see math/README for details.

Step 3: Compile Blahtex. Go to the blahtex directory and type make</tt> ; again, see the README</tt> file for details.

Step 4: Enable Blahtex. Add the following line to LocalSettings.php</tt> $wgBlahtex = './blahtex/blahtex';

Step 5: Enjoy! Your newly installed wiki is now able to output MathML. By default, it renders &lt;math&gt; ... &lt;/math&gt; environments as HTML and PNG, as in a usual MediaWiki installation. You need to change your user preferences to make the wiki render the mathematics as MathML.

Step 6: Provide feedback. Pretty please &hellip;

Result
That is all, you should now be able to play with a MathML-enabled MediaWiki! The BlahtexWiki test site gives you an idea of what to expect. The screenshot on the right shows you how the Archimedes article on BlahtexWiki is displayed in a web browser.

Issues

 * Extension:Blahtex/Bugs/Bugs in BlahtexWiki lists bugs in the Blahtex-MediaWiki interaction. There may also be Bugs in blahtex itself and bugs in browser MathML support.
 * Not all pages served by MediaWiki are valid XHTML.
 * No distinction between displayed and inline mathematics is made.

The code
The definitive source of information is of course the source, which can be found on the BerliOS site and as a patch to MediaWiki at 3504. The descriptions of the code below were probably correct at some point at time, but possibly not at this point of time.

The source repository contains a recent development version of MediaWiki and blahtex 0.4.

The MediaWiki code has to be changed in a number of places: if the user selects MathML in the preferences, we need to ensure that MathML is generated, that the MathML survives unscathed, and (because MathML can only be embedded in an XHTML document) we need to make MediaWiki emit XHTML.

Generating MathML
In the vanilla version of MediaWiki, the contents of the &lt;math&gt; tags are passed to texvc (if $wgUseTex is set), and texvc attempts to convert them to HTML, PNG and MathML. The results will be stored in the math table (so that texvc will not have to be invoked again the next time that the page is viewed), and the PNG in a file.

In the patched version of MediaWiki, there is a new configuration setting: $wgBlahtex. If this variable is set, then it should contain the path to the Extension:blahtex executable. In that case, MediaWiki will run both texvc and blahtex. It will store the HTML and PNG generated by texvc, and the MathML generated by blatex. However, if texvc cannot parse the input (which is quite possible as blahtex can handle more TeX commands than texvc), then blahtex will also be used to generate the PNG. Again, the results are stored in the math table.

Which format (HTML, PNG or MathML) is actually sent to the user, depends on the preferences and what is generated by texvc and blahtex. The main difference with the vanilla version is that for some inputs, only MathML will be generated. In that case, the user will get MathML independent of the user preference. This is not ideal, since many browsers will make a mess of it (typically, non-MathML-aware browsers will ignore all MathML tags), but the alternative is to generate an error for users that do not have MathML in their preferences while users with MathML can view the output without realizing that there is a problem.

An error message will only be displayed if neither texvc nor blahtex can generate any output. In that case, the error message of blahtex will be taken, as it is generally more informative. New pages like MediaWiki:Math AmbiguousInfix [default content: Ambiguous placement of "$1" (try using additional braces "{ ... }" to disambiguate)] are created so that admins can customize and translate these messages.

Guiding the MathML through the rest of the code
There are two dangers for the generated MathML before it comes to the output routine (OutputPage.php). Firstly, the Sanitizer (specifically, normalizeCharReferences, called from Parser.php) replaces all character entity references it does not know about (for instance, it replaces &amp;af;</tt> by &amp;amp;&amp;af;</tt>). This can be resolved by using numeric character references. Unfortunately, some MathML entities, like &amp;Ascr;</tt>, lie in the Supplementary Multilingual Plane and browsers (specifically, Gecko-based browsers) have problems with them if numeric character references are used. Therefore, we emit these as character entity references, and we add them to $wgHtmlEntities so that Sanitizer leaves them alone.

Secondly, MediaWiki uses HTML Tidy to clean up the HTML, if $wgUseTidy is set. This is needed because the HTML generated by MediaWiki is not always well-formed, especially if HTML is embedded in the wikitext. Unfortunately, HTML Tidy does not understand MathML (this is bug 3504). This is resolved with a rather ugly hack: the MathML is stripped out before the HTML is passed to HTML Tidy, and plugged back in afterwards (see the patch included in bug 3504). It seems that a less ugly solution would require some hefty changes in the parser.

Emitting XHTML
One major problem is that MathML can only be used in XHTML documents and not in HTML documents (the media type is specified in the HTTP Content-Type header). On the other hand, MediaWiki emits HTML pages, though MediaWiki's output is usually XHTML, or at least close to it.

So, if the user selects MathML in the preferences, then an XML declaration is added to the page and the Document Type Declaration is changed from "XHTML 1.0 Transitional" to "XHTML 1.1 plus MathML 2.0". The "Content-Type" header is set as follows: The parser cache only stores the (X)HTML, but not the HTTP headers, so this does not interfere with the cache. It does not interfere with the Squid cache either, because the Squids are only employed for users without accounts and they do not have MathML set in the default preferences.
 * If the browser supports XHTML (as determined by the HTTP Accept header sent by the browser), then the media type is set to "application/xhtml+xml".
 * If the browser does not support XHTML, then it makes no sense to send it XHTML. Instead, we set the media type to "text/html", so that the browser can at least display part of the document. Both Internet Explorer (without MathPlayer) and Safari do a decent job in this case, so that the user can access the preferences page to switch MathML off without problems.
 * As a special case, if the MathPlayer plugin is detected then the media type is set to "application/xhtml+xml" but without specifying the character encoding, because MathPlayer doesn't work if the character encoding is specified.

List of changes by file
The MediaWiki code is changed in the following places:
 * includes/DefaultSettings.php : add wgBlahtex variable
 * includes/ImageGallery.php : XHTML fix from 5005
 * includes/Math.php : many changes, as described in
 * includes/OutputPage.php : adjust Content-Type header
 * includes/Parser.php : tidy hack from 3504
 * includes/Sanitizer.php : add plane-1 entities from MathML
 * includes/SkinTemplate.php : set xmlheaders, doctype, dtd and mimetype depending on user prefs
 * languages/Language.php : blahtex error messages
 * skins/MonoBook.php : include xmlheaders, doctype and dtd