Extension:XML Bridge

From MediaWiki.org

Jump to: navigation, search
Manual on MediaWiki Extensions
List of MediaWiki Extensions
XML Bridge

Release status: unstable

Implementation Data extraction
Description Converts MediaWiki markup to XHTML
Author(s) PediaPress
Download no link
Note Note: This page is about software currently in development. No software was released yet.


Contents

[edit] Introduction

Wiki syntax, due to its lack of formalization and “ad hoc” nature, is not well- suited for text transformation to other formats. It is desirable to implement support for an intermediate format based on XML, which will make it possible to use standard XML parsing and transformation libraries on the source content. While MediaWiki's native parser exports to XHTML- transitional, the conversion from wiki syntax to XHTML is a lossy one: information about the templates used, the parameters for extensions and images, and so on, is not preserved. This makes many conversions impossible, because the information needed for the conversion is not present.

It is therefore planned to develop software that converts MediaWiki-articles to an XHTML-based representation. XHTML is well suited to derive other formats like PDF or ODF.

As much semantic information from the wiki source text as possible will be preserved by using XHTML features such as namespaces.

The transformation to an XHTML-based format that preserves semantic information will enable a vast number of uses by programmers, and will also allow a long-term transition to XML as a backend storage format for wiki articles.


Development is assigned to PediaPress and funded by the Commonwealth of Learning

[edit] Immediate Goal

Release a XML Exporter at all!

There is potential for lengthy discussions on:

  • XML dialect used (DocBook, ODF, XHTML, ...)
  • whether to develop and use a proprietary Wikipedia DTD
  • limitations (parser and representation) resulting from
    • broken markup
    • arbitrary HTML
    • proper detection of template boundaries

Our goal is to have an initial release out asap and initiate these discussions, hopefully leading to better future implementations.

Once this is available, developers can:

  • use the XML for various envisioned applications
  • transform (e.g. using XSLT) to other XML
  • clone the code to directly generate the desired markup
  • start discussing the flaws and how to fix the current mediawiki markup specifcation/conventions

[edit] Long Term Goal

... is to have a solid XML-Export/Import that allows to replace the MediaWiki-Markup with a XML-representation, this may coincide with WYSIWIG-editing in MediaWiki.

Steps toward this goal:

  • initial release of XML-Exporter code
  • develop XML->mw-markup converter so one can convert back and forth
  • discuss and incrementally improve the xml-markup
  • discuss whether usage of certain html-styling and template usage can be labeled deprecated
  • check edits and notify users if using wrong or deprecated markup
  • fix or remove all broken/deprecated markup,html-styling,inappropriate template usage
  • switch to xml

[edit] Current Status

Testing - no code released yet.

[edit] Implementation

We plan an initial implementation based on XHTML extended by Microformats where necessary.

You may want to have a look at our proposed XML-Format Extension:XML Bridge/MWXHTML

The XML is generated based on the parse-tree generated by the mwlib MediaWiki-markup parsing library.

[edit] Considered Implementation Options

[edit] MediaWiki specific XML Language

Don’t Invent XML Languages - at least for now.

[edit] DocBook

DocBook is a very large markup language. A more abbreviated version, Simplified DocBook, removes a number of redundant elements. DocBook NG schema (customizable namespaces) is under development.

Conclusion: DocBook is overly complicated while still lacking features in order to fully support a lossless representation of MW markup.


[edit] XHTML

XHTML is well supported by many applications and libraries. XHTML can be mixed with other namespaces (xhtml remains if stripped). Currently MW-markup allows to mix in HTML and even css styles. Therefore a lossless XML representation would need to support a subset of the XHTML specification.

MW-markup expresses semantics (e.g. sections) which are not supported by XHTML1.0. Hence pure XHTML is not sufficient.

[edit] XHTML + Additional namespace

We considered to combine XHTML with a proprietary MediaWiki specific namesspace (xmlns:mwx) using the best of two worlds (compatibility with existing tools and lossless representation).

For e.g. a category-link could be written as:

<a href="Kategorie:Extensions" mwx:linktype="category">Extensions</a>

Other suitable namespaces can be included like MathML

This still requires to invent and add a new XML-Language which is considered harmful(see above).

[edit] XHTML + Microformats

Use Microformats to semantically annotate generated XHTML.

<div class="mwx.section" title="some heading">
  <h2>some heading</h2>
  <p>
    <a href="SomePage" class="mwx.link.internal">some page within the same wiki</a>
  </p>  
 </div>


Discussion on microformats in MediaWikis

See the planned implementation: Extension:XML_Bridge/MWXHTML

[edit] Open Issues

[edit] Templates

  • Currently it seem impossible to correctly mark all uses of templates within the XML output.


[edit] Related Projects

[edit] See also

Personal tools