DocBook XML export

This is a proposal page opened some years ago. See Extension:XML_Bridge for a current implementation.

MediaWiki 1.1.0 has a feature for exporting articles to a simplified XML format which wraps the raw wikitext with revision metadata (author, timestamp, comment, title). This proposal is for supplementing this feature with an export to DocBook XML.

Rationale
DocBook XML is a standard for marking up books and articles in XML. The original standard began in the early 1990s, as an interchange format for printers and desktop-publishing software, and was tuned for creating software documentation. It has become a de facto standard for marking up formatted text documents of all kinds.

Using a standard XML markup for MediaWiki means that we can leverage other work in the field of document processing. There are a number of existing tools -- Open Source and proprietary -- for converting DocBook XML to other formats, such as Rich Text Format, PostScript, and PDF. Some word processors, such as OpenOffice, now support DocBook as an input and output format.

DocBook is a very large markup language -- it has something like 400+ elements in the Document Type Definition (DTD). A more abbreviated version, Simplified DocBook, removes a number of redundant elements. It would probably be sufficient for MediaWiki articles.

Comment: I really like this idea, but perhaps the target ought to be the new DocBook NG schema under development, one of the benefits of which is really easy customizability. It's also namespaced, and allows doing away with doctype declarations.

Paragraph blocks

 * paragraph &rarr; &lt;para&gt;
 * italics &rarr; &lt;emphasis&gt;

Section headings

 * =heading 1= &rarr; &lt;sect1&gt;
 * ==heading 2== &rarr; &lt;sect2&gt;
 * ===heading 3=== &rarr; &lt;sect3&gt;
 * ====heading 4==== &rarr; &lt;sect4&gt;
 * =====heading 5===== &rarr; &lt;sect5&gt;

An alternative is to use the recursive &lt;section&gt;, which is currently preferred for DocBook. However, this might be difficult to accomplish with the current Wikitext parsing structure.

Lists

 * *list &rarr; &lt;itemizedlist&gt;
 * #list &rarr; &lt;orderedlist&gt;

User interface
The user interface for exporting to DocBook would be similar to the "Printable version" link found currently on MediaWiki links. "Save as XML" or "Save as Docbook" would be a separate link that would export the current article to DocBook XML.

It may be useful to dump DocBook information rather than the SQL dumps that are currently used for Wikipedia. This would include author credits and other information necessary to conform with copyleft licenses.

Content cleanup
A fair bit of the html in the current content won't validate as xhtml/xml. A possible solution to this could be tidy -asxhtml applied to newly saved/ previewed content. The special wiki tags would need to be ignored by tidy, there are some options in tidy.conf that allow the configuration of custom tags. This would need a fair bit of testing.

Related Projects

 * Extension:XML_Bridge - implements a DocBook export next to other XML dialects
 * Wiki2XML - Separate (abandoned) web-based conversion utility. Convert from MediaWiki wiki text to various formats including DocBook. PHP source code in subversion. Ignores any blocks of preformatted text.
 * wt2db - WikiText to DocBook. A linux utility source code
 * Html2DocBook