User:GorillaWarfare/pandoc

From mediawiki.org

Per discussion at bug 46517, I've been testing out pandoc as an option to convert wiki pages to other formats. Pandoc is able to convert text to and from several potentially useful formats, including wikimarkup, HTML, LaTeX, markdown, and plaintext. It seems to have some serious difficulty in some areas, which I'm trying to document here for later reference.

Wikimarkup → LaTeX[edit]

Running pandoc using the following command: pandoc -f mediawiki -t latex -s input.wiki -o output.tex

General issues[edit]

Pandoc has some issues that affect almost every article that it tries to parse:

  • Images aren't included in articles. The documentation suggests that images will be downloaded if the standalone flag is set, but they are not. LaTeX attempts to find them in the directory from which it's building, and when it's unable to do so, the build fails.
  • Many accented/special characters aren't recognized.
  • All templates are omitted completely or fragments of the template syntax ends up in the output.
  • No attribution is included.
  • Footnote superscripts appear in the text, and the numbers appear at the bottom of the page, but the footnotes themselves are almost always incomplete or empty.
  • Piped and non-piped links are formatted differently from each other.
  • Links appear to be relative filepaths, which is not feasible considering how many links occur in each article, each linked article, and so on.
  • Endnotes might be more appropriate than footnotes, considering that they can often take up the better part of a page.
  • Categories are displayed as raw text at the end of the page.
  • There are lots of issues with tables:
    • Run off the edge of the page if they're too wide
    • LaTeX formatting is often broken, causing the document to fail to build.
    • They're centered, which is more of a stylistic choice I disagree with.
    • There are no cell dividing borders, which is really important for some tables (particularly those with cells that span multiple rows/columns)

Specific page tests[edit]

Bolded issues in the "results" column prevented a successful build. Word count is prose size only—text in tables, templates, etc. is omitted from this count.

Page Attributes Length Built? Results Parsoid
John George Herriot Very short article. No images. Only a stub template. No inline citations; just a list of references. 20 words Both pandoc and LaTeX build. One reference appears empty. Stub template omitted. 0 roundtrip differences.
Banksia violacea {{Taxobox}}, {{clade}}. Multiple indentation. Images. 1537 words Pandoc builds, LaTeX does not. Images not found. Fragment of image wikitext remains in the text. The taxobox and clade are omitted. Multiple indentation works fine. 0 roundtrip differences.
Acetic acid {{Chembox}}, chemical formulas in text. Images. 3815 words Pandoc builds, LaTeX does not. Images not found. Special characters aren't compatible. No support for SVGs. Unknown control sequence. Fragment of a link remains in the text. Pronunciation information and chembox omitted. Extra bullet in external links list. Chemical formulas seem okay except where they include special characters; should be tested more. Tables are okay. 0 roundtrip differences.
CFM International CFM56 Aircraft infoboxes, {{jetspecs}}. More extensive tables. Images. 5143 words Neither pandoc nor LaTeX builds. Spaces before IBX params causes pandoc to fail completely. Images not found. Part of a sentence is inexplicably missing. Large tables overhang page margin. Unit conversion template syntax appears in the output. Jetspecs and infobox templates omitted. Broken external link formatting. Awkward spacing when sections and subsections are adjacent. Adjacent reference numbers in text look like one number. 36 syntactic differences.
List of fictional doctors Hatnotes. Long, but not wide, tables. Red links. No images. 128 words Neither pandoc nor LaTeX builds. Missing pipe and a space before a pipe in table syntax causes pandoc to fail. MAJOR LaTeX compilation issues, to the point where it was not worth fixing them all by hand for this analysis. Misplaced \noalign, \cr, & all over the place. Links are sometimes split between two lines, causing the \href{} command to break. Some brackets for wikilinks remain in the text. Hatnotes omitted.
Modern Family (season 1) Large (in both dimensions) and specially-formatted tables. {{Quote box}} and {{start date}}. Large number of references. Images. 1920 words Both pandoc and LaTeX build. Fragment of image markup appears in output. The large, formatted table in the Episodes section is missing entirely, except for a chunk of wikimarkup that is displayed raw. Ratings table runs off the edge of the page, much of the table formatting is raising errors. {{Quote box}} is omitted. {{Start date}} is broken and showing raw syntax. One page consists entirely of footnotes. Footnotes on another page run off the end. Two of the external links bullets are empty, one of the bullets appears to have a number overlapping it.
Lazy evaluation <source>, <syntaxhighlight> and <code> tags. Leading spaces to limit wiki markup/display monospace. No images. 1388 words Both pandoc and LaTeX build. The <source> tags look lovely—they even have syntax highlighting! Leading spaces are ignored completely, which is very confusing. <ref>raw url</ref> is very broken, leaving the URL and some fragments of wikimarkup in the output. Colon-indenting only indents the first sentence, not the whole block. <syntaxhighlight> tags are forced onto their own lines, aren't highlighted, and run off the page if the code is too long.
Babel (album) {{Album ratings}}, {{track listing}}, {{Certification Table Entry}} and associated templates. Lists, tables, tables with multi-row cells. Images. 1018 words Both pandoc and LaTeX build. {{Album ratings}}, {{track listing}}, and certification templates omitted. Tables split across several pages are not handled well. Lack of cell-dividing borders makes the multipart table hard to understand. Broken external link. Interlanguage link is shown.
Molly Disambiguation page. Red links. No images. 4 words Both pandoc and LaTeX build. Red links are treated the same as blue links. Interlanguage link is shown.
Chinese classifier Chinese characters. Tables with chinese characters. Colors declared with <span> tags. {{lang}}, {{du}}, {{multiple image}} templates. Harvard citations. 5478 words Neither pandoc nor LaTeX builds. |-style=... causes the pandoc build to fail. Once those are fixed, Chinese characters cause LaTeX to fail so much that it's not worth fixing by hand. From looking at the raw TeX, the colorful table is declared with far too many columns. There are a number of lists that consist only of empty bullet points.
Ulysses (poem) <poem> tags. Images. 3413 words Both pandoc and LaTeX build. Images are missing, and LaTeX appears to try to replace it with alt text in a very inelegant way. HTML comment is replaced with a space (causing a space between a word and a comma). <poem> tags are ignored altogether. Indented text in poem tags is formatted as monospace text.
Pi Mathematical formulas, sometimes in tables or quote boxes. {{pi}}, {{sfrac}}, {{gaps}}, {{math}}, {{multiple image}}. Harvard refs. Images. 6279 words Both pandoc and LaTeX build. All templates, including {{pi}}, omitted, which is quite an issue. Again, images are missing and have odd alt text. Some simpler mathematical formulas typeset nicely; others show up as raw LaTeX, oddly enough. ≈ is ignored completely. Whole chunks of raw LaTeX appear in various places. Tables run off the edge of the page. Raw image and ref syntax appears once in a while.
British monarchs' family tree Extensive use of {{chart}}. 35 words Both pandoc and LaTeX build. Entire {{chart}} structure omitted, meaning the article is more or less devoid of its content.

Better soution[edit]

Essentially I made a better solution here http://de.wikibooks.org/wiki/Benutzer:Dirk_Huenniger/wb2pdf Dirk HĂźnniger (talk) 06:25, 16 May 2013 (UTC)