Offline content generator/2013 Code Sprint/Brainstorming

From mediawiki.org

Let's rewrite the stack to make it more maintainable for ops.

  • MobileFrontend with minor tweaks to support hardcopy output (it could be defined as a device type) combined with something like domodf (http://code.google.com/p/dompdf/) could make for a much faster and easier to support book / article pdf export feature.
  • If a PDF renderer supports CSS (which it of course should), you should be able to achieve what you need with a print stylesheet, without resorting to DOM transformations. The mobile site uses HTML transformations mostly for legacy reasons, we're working on minimising them to the point of total extinction for HTML user agents.
  • There's been some progress integrating BookType

[dead link]

into MediaWiki for this very purpose. ... mostly does what one would want -- the only potential bump in the path that we found was its need for from 23+ which at the time PhantomJS didn't use. In terms of rendering though; large tables are evil -- so are sidebars (or see also templates at the bottom of the page). For print view - need to improve this.
  • replacing mwlib: Parsoid? Yeah, we have been talking about that since somebody from the book.js team visited the office a while ago. The rich RDFa info in Parsoid HTML makes it easy to massage the rendering for print (using DOM transformations and/or CSS). The Kiwix folks already customize the Parsoid DOM for offline ZIM files and used this to export Wikivoyage during Wikimania. We could try an HTML-only print pipeline if at all possible. We should give book.js / booktype a more serious try!
  • One recommendation on this is some kind of embedded WebKit that can do HTML->PDF directly, either using the existing MediaWiki output or doing something via parsoid. Brion did a little experimentation a few years ago using the WebKit version bundled with Qt4, which made it very easy to prototype as the PDF output for printing support was already built-in. One issue to consider is bundling a large number of articles into one output document -- when it runs into hundreds of pages he's not sure how well the web rendering engine will handle it. Will CPU and memory usage be reasonable or go through the roof? Or is bundling multiple articles in one PDF actually a feature we want to keep? In the worst case, we could make sure adjacent articles have page boundaries, do each document separately, force the page numbers, and stitch the PDFs together... (book.js is capable of supporting this type of book generation for exactly this reason.)