Markup spec/DTD

From MediaWiki.org
Jump to: navigation, search

Introduction[edit | edit source]

This is a draft of Wikipedia DTD, an interchangeable XML representation of the content of wikipedia articles. It contains elements for the content of an article (Wikitext DTD). This is also a contribution to get a formal wikitext standard that is still lacking. When there is such a standard multiple suites of software (some based on mediawiki, some not) could support it, and each Wikipedia could choose between them, while supporting the same portable open content.

Important notes[edit | edit source]

Wikipedia DTD is an interchange format. It is not to be meant to write articles in it nor to replace a database!

Up to now I (Jakob Voss) am the only author of the Wikipedia DTD but this is meant to be collaborative work in progress—feel free to contribute. Though you should at least know XML basics. Please add comments on Talk:Wikipedia DTD. I am also working on simple WikipediaDTD-to-HTML and WikipediaDTD-to-Wikitax scripts that will be made public soon. The most important thing to use Wikipedia DTD in real life is a Wikitax-2-XML parser. The possibility to mix wiki syntax with invalid HTML is quite complicated.

On Compatibility: Up to now there are wikipedia-articles containing data that will not fit into wikiarticle.dtd because they contain broken or ugly HTML. Some elements that are still allowed in wikitex should have to be removed (for instance, HTML-coded headings and horizontal lines or the font-element). This is a topic to discuss.

Still missing parts: table, dl, pre, div, ruby, font, var. And many attributes

See also[edit | edit source]


Advantages[edit | edit source]

You could also use an XML representation to

  • produce valid XHTML
  • produce printable output (for instance PDF with FO), see also de:Automatisierte PDF-Erstellung
  • export Wikitax to other formats (for instance w:DocBook)
  • export other formats to Wikitax (for instance OpenOffice files)
  • perform automatic analysis on structure and layout of articles
  • exploit better XML-based tools to define and extend wikitax
  • integrate specialized DTDs for "who" (m:person_DTD), "when and where" (m:spacetime_DTD) easily, as long as the tag spaces are distinct

Wikipedia articles[edit | edit source]

Root element[edit | edit source]

Up to now the Wikipedia DTD is only meant to single articles so the root element is article. There is a metadata section containing the title and other stuff and the article content itself that may be text or a redirect.

<!ELEMENT article (meta, (wikitax | text | redirect))>

Linking model[edit | edit source]

Links (to other articles) are one of the most important things in Wikipedia. In most cases they simply consist of an article name, but there may also be namespaces. In Wikipedia there are 9 namespaces: default (none), talk, user, user-talk, wikipedia, wikipedia-talk, image, image-talk and special. The names of namespaces may differ in different languages but the namespaces remain the same. For instance there is no namespace Diskussion, it is only the german name for the talk namespace. Local names are not part of the Wikipedia DTD.

I prefer separating the talk-property of a namespace.

<!ENTITY % possible-namespaces "(special | user | wikipedia | image)">

<!ENTITY % local-link-model " 
        talk (talk) #IMPLIED
        namespace %possible-namespaces; #IMPLIED
        article CDATA #REQUIRED
">

The attributes of the local-link-model parameter entity form a full local link destination.

Metadata elements[edit | edit source]

The metadata section contains information about title, status, version history and interwiki links. Only the title is obligatory. There may be added elements for copyright information, category-links and other this-is-much-more-dublin-core-than-article-content-stuff.

<!ELEMENT meta (title, status?, interwiki*, history?)>

Title[edit | edit source]

The title of an article does not change - so it is not part of the article history. Since a title may contain namespaces it is the easiest to specify the full title as a link to the article itself. The interwiki-attribute may specify the wiki the article comes from (normally a language).

<!ELEMENT title EMPTY>
<!ATTLIST title      
        interwiki NMTOKEN #IMPLIED
        %local-link-model;
>

Interwiki links[edit | edit source]

There are several reasons why interwiki links belong into the metadata section. Interwikilinks are relations of an article (maybe there will be other relation types in the future) in spite of normal links in the article content that may mean several things. Concurrently there may be links to other wikipedias inside the article content. Do not mix this. Interwiki links use the same #Linking model as other links but they must specify a known language.

<!ELEMENT interwiki EMPTY>
<!ATTLIST interwiki  
        language NMTOKEN #REQUIRED
        %local-link-model;
>

Article Status[edit | edit source]

The status element contains status information like whether the article is protected, whether a table of contents should be shown etc. Since the status may change due edits it´s also part of the article history.

<!ELEMENT status EMPTY>
<!ATTLIST status
        protected (protected) #IMPLIED
        counter CDATA #IMPLIED
        notoc (notoc) #IMPLIED
>

Note: In the actual database the counter (number of times a page has been viewed) is only saved for the current version. This should be changed in the future to get more usage information (for instance to see how often a page has been viewed since the last edit or to detect edit-wars automatically).

Version history[edit | edit source]

The version history simply contains a number of edits.

<!ELEMENT history (edit)+>

Each edit contains the edit information (user, timestamp...) and the current status, interwiki-links and content of an article after the edit. The article content is optional—if it is not provided there is just no change or it is just not included because we are not interested.

<!ELEMENT edit (status?, interwiki*, (text | redirect)?)>
<!ATTLIST edit
        user CDATA #REQUIRED
        comment CDATA #IMPLIED
        timestamp CDATA #IMPLIED
        minor (minor) #IMPLIED
>

An example of a timestamp is

2002-02-25T15:43:11Z

This is YYYY-MM-DDThh:mm:ssZ, where YYYY is year, MM is month, DD is date, hh is (24-hour) hour, mm is minute, ss is second, and T is a separator, and Z (here) is Zulu, a letter in the international code of signals. There are 24 major time-zones in the world, represented by the letters A(lpha) to Z(ulu). The letters I(ndia) and O(scar) are omitted to avoid confusion with the numbers 1 and 0. Zulu hour corresponds to Greenwich Mean Time, GMT, also called Universal Time Corrected, UTC, which is the standard time used in navigation and on the Internet.

Redirects[edit | edit source]

Either an article contains a redirect or text. Redirects are links to articles in the same wiki.

<!ELEMENT redirect EMPTY>
<!ATTLIST redirect   
        %local-link-model;
>

Wikitax[edit | edit source]

Since there is no Wikitax2XML-parser yet the article content could also be transfered in Wikitax. Since Wikitax depends on the Wikimedia software and might change a version information should be provided.

<!ELEMENT wikitax (#PCDATA)>
<!ATTLIST wikitax 
  version CDATA #REQUIRED
>

Wikitext[edit | edit source]

An XML representation of article content is the core of Wikipedia DTD. You can also use this part alone.

<!ENTITY % wikitext-block "ul | ol | center | blockquote | pbr | hr | h1 | h2 | h3 | h4 | h5 | h6">
<!ENTITY % wikitext-inline-format "b | i | sub | sup | big | small | tt | u | br | nowiki">
<!ENTITY % wikitext-inline-special "math | wikivar | link | reference | url">

<!ENTITY % wikitext-inline "%wikitext-inline-format; | %wikitext-inline-special;">

There are missing some not yet defined elements in the parameter entities.

<!ELEMENT text (#PCDATA | %wikitext-block; | %wikitext-inline;)*>

Block elements[edit | edit source]

Headings, horizontal line[edit | edit source]

In contrast to HTML there are no attributes.

<!ELEMENT h1 (#PCDATA | %wikitext-inline;)*>
<!ELEMENT h2 (#PCDATA | %wikitext-inline;)*>
<!ELEMENT h3 (#PCDATA | %wikitext-inline;)*>
<!ELEMENT h4 (#PCDATA | %wikitext-inline;)*>
<!ELEMENT h5 (#PCDATA | %wikitext-inline;)*>
<!ELEMENT h6 (#PCDATA | %wikitext-inline;)*>

<!ELEMENT hr EMPTY>

Indented lines[edit | edit source]

<!ELEMENT indent (#PCDATA | %wikitext-inline;)*>
<!ATTLIST indent
  depth CDATA '1'
>

Lists[edit | edit source]

To avoid #PCDATA and sublist mixing we define oli=li+ol and uli=li+ul

<!ELEMENT ol (li | ol | uli)+>
<!ELEMENT ul (li | oli | uli)+>
<!ELEMENT oli (li | ol | uli)+>
<!ELEMENT uli (li | oli | uli)+>

<!ELEMENT li (#PCDATA | %wikitext-inline;)*>

TODO:

  • attributes: "type", "start", "value", "compact",
  • definition lists

Tables[edit | edit source]

TODO

  • elements: table, tr, td, th
  • attributes: "summary", "width", "border", "frame", "rules", cellspacing", "cellpadding", "valign", "char", charoff", "colgroup", "col", "span", "abbr", "axis", headers", "scope", "rowspan", "colspan"

center, blockquote[edit | edit source]

<!ELEMENT center (#PCDATA | %wikitext-inline;)*>
<!ELEMENT blockquote (#PCDATA | %wikitext-inline;)*>
<!ATTLIST blockquote
        cite CDATA #IMPLIED
>

pre, div[edit | edit source]

TODO (what is allowed inside? - div as block and inline?)

Paragraph breaks[edit | edit source]

Wikitax provides a way to separate paragraphs: just add an empty line. In Wikitext DTD this is represented by the tag <pbr/> (paragraph break). The possibility to create paragraphs with <p> should be abolished because it leads to broken XML and we should reduce the number of allowed HTML-tags.

<!ELEMENT pbr EMPTY>

Inline elements[edit | edit source]

Wikitext special elements[edit | edit source]

TODO: nowiki, media

nowiki parts[edit | edit source]
<!ELEMENT nowiki (#PCDATA)>
Links[edit | edit source]

See #Linking_model for details.

<!ELEMENT link (#PCDATA | %wikitext-inline-format;)*>
<!ATTLIST link
        interwiki NMTOKEN #IMPLIED
        %local-link-model;
>
Math[edit | edit source]

The image attribute may provide an image representation

<!ELEMENT math (#PCDATA)>
<!ATTLIST math 
  image ENTITY #IMPLIED
>
URL[edit | edit source]
<!ELEMENT url (#PCDATA | %wikitext-inline-format;)*>
<!ATTLIST url
  href CDATA #REQUIRED
>
Reference[edit | edit source]
<!ELEMENT reference EMPTY>
<!ATTLIST reference
   system (email | RFC | ISBN) #REQUIRED
   value CDATA #IMPLIED
>
Images and other media files[edit | edit source]
<!ELEMENT media EMPTY>
<!ATTLIST media
   name CDATA #REQUIRED
   data ENTITY #IMPLIED
>

bold/italic[edit | edit source]

strong and em will never be used in the right way so use b and i instead. There are no attributes allowed.

<!ELEMENT b (#PCDATA | i | big | small | sub | sup | tt | u | br | %wikitext-inline-special;)*>
<!ELEMENT i (#PCDATA | b | big | small | sub | sup | tt | u | br | %wikitext-inline-special;)*>

Several HTML tags[edit | edit source]

Several HTML-tags are also allowed in Wikitext DTD, but most of them are simplified in some way (for instance no or less attributes). These tags are not HTML - they are like the same HTML-tags, not equal!

<!ELEMENT tt (#PCDATA | b | i | big | small | sub | sup | u | br | %wikitext-inline-special;)*>
<!ELEMENT u (#PCDATA | b | i | big | small | sub | sup | tt | br | %wikitext-inline-special;)*>
<!ELEMENT sub (#PCDATA | %wikitext-inline;)*>
<!ELEMENT sup (#PCDATA | %wikitext-inline;)*>
<!ELEMENT big (#PCDATA | %wikitext-inline;)*>
<!ELEMENT small (#PCDATA | %wikitext-inline;)*>
<!ELEMENT br EMPTY>

TODO: ruby-tags

Variables[edit | edit source]

Some dynamic variables can be used in Wikitax as {{VARNAME}}.

<!ELEMENT wikivar EMPTY>
<!ATTLIST wikivar
  name (CURRENTMONTH | CURRENTMONTHNAME | CURRENTDAY | CURRENTDAYNAME |
        CURRENTYEAR | CURRENTTIME | NUMBEROFARTICLES)
  #REQUIRED
>

Open questions[edit | edit source]

  • div (yes), font (no), var (why).

Why use a container? Use something like <para/> instead. Div's are usually used for content formatting, not structuring (at least I believe they should not be used for this). One container for paragraphs is absolutely sufficient.

  • What to do with/where to allow universal HTML attributes (id, class, name, style)?

You should never allow id, class or style - all are used to refer to a CSS stylesheet. The Wikipedia DTD must not care about content formatting! The XML should be treated as a data container only. If you want special formattings, use an XSL stylesheet.

  • remove id and name, allow class and style at some elements
  • Should we create a custom DTD? Would Docbook or Simplified Docbook work? If so, would it be better to use a standard format rather than make up our own?

If Docbook would work, it would definitely be a wiser choice than inventing the wheel twice. There are many utilities and stylesheets to work with Docbook content. If Docbook alone is not enough, try to use it as a base and only define the extensions of the DTD.

  • Or, create a DTD with an ontology optimized for mediawiki, but with a mapping to (simplified) docbook. Have it both ways.