DocBook XML export

From MediaWiki.org
Jump to: navigation, search
Wikimedia-logo-meta.png

This page was moved from MetaWiki.
It probably requires cleanup – please feel free to help out. In addition, some links on the page may be red; respective pages might be found at Meta. Remove this template once cleanup is complete.

This is a proposal page opened some years ago. See Extension:XML_Bridge for a current implementation.

MediaWiki 1.1.0 has a feature for exporting articles to a simplified XML format which wraps the raw wikitext with revision metadata (author, timestamp, comment, title). This proposal is for supplementing this feature with an export to DocBook XML.

Rationale[edit | edit source]

DocBook XML is a standard for marking up books and articles in XML. The original standard began in the early 1990s, as an interchange format for printers and desktop-publishing software, and was tuned for creating software documentation. It has become a de facto standard for marking up formatted text documents of all kinds.

Using a standard XML markup for MediaWiki means that we can leverage other work in the field of document processing. There are a number of existing tools -- Open Source and proprietary -- for converting DocBook XML to other formats, such as Rich Text Format, PostScript, and PDF. Some word processors, such as OpenOffice, now support DocBook as an input and output format.

DocBook is a very large markup language -- it has something like 400+ elements in the Document Type Definition (DTD). A more abbreviated version, Simplified DocBook, removes a number of redundant elements. It would probably be sufficient for MediaWiki articles.

Comment: I really like this idea, but perhaps the target ought to be the new DocBook NG schema under development, one of the benefits of which is really easy customizability. It's also namespaced, and allows doing away with doctype declarations.

Design[edit | edit source]

Mapping Wikitext to DocBook elements[edit | edit source]

Paragraph blocks[edit | edit source]

  • paragraph → <para>
  • ''italics'' → <emphasis>

Section headings[edit | edit source]

  • =heading 1= → <sect1>
  • ==heading 2== → <sect2>
  • ===heading 3=== → <sect3>
  • ====heading 4==== → <sect4>
  • =====heading 5===== → <sect5>

An alternative is to use the recursive <section>, which is currently preferred for DocBook. However, this might be difficult to accomplish with the current Wikitext parsing structure.

Lists[edit | edit source]

  • *list → <itemizedlist>
  • #list → <orderedlist>

User interface[edit | edit source]

The user interface for exporting to DocBook would be similar to the "Printable version" link found currently on MediaWiki links. "Save as XML" or "Save as Docbook" would be a separate link that would export the current article to DocBook XML.

It may be useful to dump DocBook information rather than the SQL dumps that are currently used for Wikipedia. This would include author credits and other information necessary to conform with copyleft licenses.

Content cleanup[edit | edit source]

A fair bit of the html in the current content won't validate as xhtml/xml. A possible solution to this could be tidy -asxhtml applied to newly saved/ previewed content. The special wiki tags would need to be ignored by tidy, there are some options in tidy.conf that allow the configuration of custom tags. This would need a fair bit of testing.


Related Projects[edit | edit source]

See also[edit | edit source]