Extension:Collection/XML Bridge

Wiki syntax, due to its lack of formalization and “ad hoc” nature, is not well- suited for text transformation to other formats. It is desirable to implement support for an intermediate format based on XML, which will make it possible to use standard XML parsing and transformation libraries on the source content. While MediaWiki's native parser exports to XHTML- transitional, the conversion from wiki syntax to XHTML is a lossy one: information about the templates used, the parameters for extensions and images, and so on, is not preserved. This makes many conversions impossible, because the information needed for the conversion is not present.

It is therefore planned to develop software that converts MediaWiki-articles to an XHTML-based representation. XHTML is well suited to derive other formats like PDF or ODF.

As much semantic information from the wiki source text as possible will be preserved by using XHTML features such as namespaces.

The transformation to an XHTML-based format that preserves semantic information will enable a vast number of uses by programmers, and will also allow a long-term transition to XML as a backend storage format for wiki articles.

Development is assigned to PediaPress and funded by the Commonwealth of Learning

Current Status
An initial alpha code release is available as part of the mwlib python MediaWiki library. (see xhtmlwriter.py). Feel free to use and comment on it.

Although this code is still lacking some features it may be a good starting point to develop alternative XML output formats.

There is a google group for support and discussion of mwlib and derived applications.

See this page for installation instructions.

Implementations
The XML is generated based on the parse-tree generated by the mwlib MediaWiki-markup parsing library.

There are currently four implementations:

MWXHTML
This implementation is based on XHTML1.0 transitional extended by (unapproved) Microformats where necessary.

This is to support the presentational HTML4.01 Elements allowed in wikitext by MediaWiki.

A future implementation could be based on XHTML1.1 strict plus MathML.

You may want to have a look at the proposed XML-Format Extension:XML Bridge/MWXHTML or the examples.

Docbook
This export emits DocBook XML V4.5.

Open Document
Open Document is also based on XML. See the OpenDocument export page for more info on this.

Development & Evaluation
The xhtmlwriter.py is part of the mwlib python library. See this page for installation instructions.

xml-server.py
There is a xml-server.py app in the sandbox/ directory, which acts as a Mediawiki (which must support the new API) proxy, converting wikitext to xhtml as you browse.

XML-dialect, wikisite, article can be encoded in the url like this:


 * http://localhost:8000/mwxhtml/mediawiki.org/w/Extension:XML_Bridge
 * http://localhost:8000/mwxhtml/en.wikipedia.org/w/Mediawiki

mw-render
mw-render can be used to generate XML documents from a list of articles.

To export the article "Articlename" from the english Wikipedia one, can simply use these commands:

mw-render -c :en -o out.xml -w docbook Articlename mw-render -c :en -o out.xml -w xhtml Articlename mw-render -c :en -o out.xml -w odf Articlename

Examples
Have a look at Extension:XML_Bridge/Examples to see the output of this extension

Long Term Goal
... is to have a solid XML-Export/Import that allows to replace the MediaWiki-Markup with a XML-representation, this may coincide with WYSIWIG-editing in MediaWiki.

Steps toward this goal:
 * initial release of XML-Exporter code
 * develop XML->mw-markup converter so one can convert back and forth
 * discuss and incrementally improve the xml-markup
 * discuss whether usage of certain html-styling and template usage can be labeled deprecated
 * check edits and notify users if using wrong or deprecated markup
 * fix or remove all broken/deprecated markup,html-styling,inappropriate template usage
 * switch to xml

Templates

 * Currently it seem impossible to correctly mark all uses of templates within the XML output.

Related Projects

 * Extension:Collection - works in conjunction with this XML export
 * Extension:OpenDocument_Export
 * Extension:Wiki2xml (abandoned)
 * DocBook_XML_export (never started)
 * Extension:Open_Office_Export (abandoned)
 * Extension:Data Transfer
 * Connexions XML Language
 * WikiXML: XML format
 * An XML Interchange Format for Wiki Creole 1.0
 * DocBook Wiki
 * about XHTML produced by MediaWiki
 * Wikitext Standard ... describe and formalize a 1.0 version of the Wikitext language, based on what is used currently. (last edit: 29 June 2007)
 * WikiMarkup Standard discusses ways to allow visitors from one wiki engine to edit pages on other wikis without having to learn their WikiSyntax. (last edit: May 10, 2008 active discussion)
 * page collecting info on "wiki conversion" (last modified 15:40, 21 Jun 2006.)
 * Wikipedia DTD (last real edit 9 April 2006)