Extension:Collection/XML Bridge

Introduction
Wiki syntax, due to its lack of formalization and “ad hoc” nature, is not well- suited for text transformation to other formats. It is desirable to implement support for an intermediate format based on XML, which will make it possible to use standard XML parsing and transformation libraries on the source content. While MediaWiki's native parser exports to XHTML- transitional, the conversion from wiki syntax to XHTML is a lossy one: information about the templates used, the parameters for extensions and images, and so on, is not preserved. This makes many conversions impossible, because the information needed for the conversion is not present.

It is therefore planned to develop software that converts MediaWiki-articles to an XHTML-based representation. XHTML is well suited to derive other formats like PDF or ODF.

As much semantic information from the wiki source text as possible will be preserved by using XHTML features such as namespaces.

The transformation to an XHTML-based format that preserves semantic information will enable a vast number of uses by programmers, and will also allow a long-term transition to XML as a backend storage format for wiki articles.

Development is co-funded by the Commonwealth of Learning

Long Term Goal
... is to have a solid XML-Export/Import that allows to replace the MediaWiki-Markup with a XML-representation, this may coincide with WYSIWIG-editing in MediaWiki.

Steps toward this goal:
 * initial release of XML-Exporter code
 * develop XML->mw-markup converter so one can convert back and forth
 * discuss and incrementally improve the xml-markup
 * discuss whether usage of certain html-styling and template usage can be labeled deprecated
 * check edits and notify users if using wrong or deprecated markup
 * fix or remove all broken/deprecated markup,html-styling,inappropriate template usage
 * switch to xml

Current Status
Testing - no code released yet.

MediaWiki specific XML Language
Don’t Invent XML Languages

DocBook
DocBook is a very large markup language. A more abbreviated version, Simplified DocBook, removes a number of redundant elements. DocBook NG schema (customizable namespaces) is under development.

Conclusion: DocBook is overly complicated while still lacking features in order to fully support a lossless representation of MW markup.

XHTML + Additional namespace
XHTML is well supported by many applications and libraries, can be mix with other namespaces (xhtml remains if stripped)

We considered to combine XHTML with a proprietary MediaWiki specific namesspace (xmlns:mwx) using the best of two worlds (compatibility with existing tools and lossless representation).

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 

For e.g. a category-link could be written as:

Extensions

Other suitable namespaces can be included like MathML

Unfortunately this requires to patch the xhtml-dtd in order to validate properly.

XHTML + Microformats
Use Microformats to semantically annotate generated XHTML.

some heading some page within the same wiki

Discussion on microformats in MediaWikis

'' There's a recurring mild flame war on the xml-dev mailing list about when one should use attributes and when one should use elements. There's a slightly hotter one about whether one should ever use attributes at all. The bottom line is that it's really up to you. Do what feels right for your application. Most developers prefer to use attributes for metadata as opposed to the data itself, but this is a very rough rule of thumb at best. Of course, what's data and what's metadata depends heavily on who is reading your documents for what purpose. One way to determine whether information is metadata or not is ask yourself whether a person reading the text would want to see it.''

Templates

 * Currently it seem impossible to correctly mark all uses of templates within the XML output.

Related Projects

 * Extension:Wiki2xml (abandoned)
 * DocBook_XML_export (never started)
 * Extension:Open_Office_Export (abandoned)
 * Extension:Data Transfer
 * Connexions XML Language
 * WikiXML: XML format
 * An XML Interchange Format for Wiki Creole 1.0