Jump to content

DocBook XML export

From mediawiki.org
This is a proposal page opened some years ago. See Extension:XML Bridge for a current implementation.

MediaWiki 1.1.0 has a feature for exporting articles to a simplified XML format which wraps the raw wikitext with revision metadata (author, timestamp, comment, title). This proposal is for supplementing this feature with an export to DocBook XML.

Rationale

[edit]

DocBook XML is a standard for marking up books and articles in XML. The original standard began in the early 1990s, as an interchange format for printers and desktop-publishing software, and was tuned for creating software documentation. It has become a de facto standard for marking up formatted text documents of all kinds.

Using a standard XML markup for MediaWiki means that we can leverage other work in the field of document processing. There are a number of existing tools -- Open Source and proprietary -- for converting DocBook XML to other formats, such as Rich Text Format, PostScript, and PDF. Some word processors, such as OpenOffice, now support DocBook as an input and output format.

DocBook is a very large markup language -- it has something like 400+ elements in the Document Type Definition (DTD). A more abbreviated version, Simplified DocBook, removes a number of redundant elements. It would probably be sufficient for MediaWiki articles.

Comment: I really like this idea, but perhaps the target ought to be the new DocBook NG schema under development, one of the benefits of which is really easy customizability. It's also namespaced, and allows doing away with doctype declarations.

Design

[edit]

Mapping Wikitext to DocBook elements

[edit]

Paragraph blocks

[edit]
  • paragraph → <para>
  • ''italics'' → <emphasis>

Section headings

[edit]
  • =heading 1= → <sect1>
  • ==heading 2== → <sect2>
  • ===heading 3=== → <sect3>
  • ====heading 4==== → <sect4>
  • =====heading 5===== → <sect5>

An alternative is to use the recursive <section>, which is currently preferred for DocBook. However, this might be difficult to accomplish with the current Wikitext parsing structure.

Lists

[edit]
  • *list → <itemizedlist>
  • #list → <orderedlist>

User interface

[edit]

The user interface for exporting to DocBook would be similar to the "Printable version" link found currently on MediaWiki links. "Save as XML" or "Save as Docbook" would be a separate link that would export the current article to DocBook XML.

It may be useful to dump DocBook information rather than the SQL dumps that are currently used for Wikipedia. This would include author credits and other information necessary to conform with copyleft licenses.

Content cleanup

[edit]

A fair bit of the html in the current content won't validate as xhtml/xml. A possible solution to this could be tidy -asxhtml applied to newly saved/ previewed content. The special wiki tags would need to be ignored by tidy, there are some options in tidy.conf that allow the configuration of custom tags. This would need a fair bit of testing.


[edit]

See also

[edit]