Offline content generator/2013 Code Sprint

Things to address:


 * 1) the Collection extension which is used to manage sets of articles in the Book: namespace and its various equivalents on different projects;
 * 2) the support for print-on-demand books rendered via PediaPress itself;
 * 3) the support for download of collections in ODT, EPUB and ZIM format;
 * 4) the support for download of individual pages or collections in PDF format.

The straight PDF export relies on the PediaPress mwlib parsing library -- would be good to replace this. If the EPUB and ZIM formats simply bundle output from the API, then are less problematic. The Collection extension is a useful tool, although it could use a UX overhaul at some point. Print-on-demand is old-school, but nice to have for a while longer.

One nice feature of the PediaPress PDF library is that it renders images in print resolution rather than screen resolution. That's a feature to retain in whatever future approach we adopt.


 * Higher-resolution imagery support is something we've added for screens in the meantime -- for "retina" quality displays on mobile & desktop web browsers. It should actually be really easy to make sure that the high-res images get used for print-targeted output whereever we already support them on-screen. In a few places like math, we need to do some more work (either using the MathJax mode within the offscreen HTML rendering, or changing the texvc backend to render high-res images as well).

Let's rewrite the stack to make it more maintainable for ops.
 * If we're looking to what our users are asking for, then it's Indic and complex language support. This came up at Wikimania from multiple volunteers.


 * MobileFrontend with minor tweaks to support hardcopy output (it could be defined as a device type) combined with something like domodf (http://code.google.com/p/dompdf/) could make for a much faster and easier to support book / article pdf export feature.


 * If a PDF renderer supports CSS (which it of course should), you should be able to achieve what you need with a print stylesheet, without resorting to DOM transformations. The mobile site uses HTML transformations mostly for legacy reasons, we're working on minimising them to the point of total extinction for HTML user agents.


 * There's been some progress integrating BookType into MediaWiki for this very purpose. ... mostly does what one would want -- the only potential bump in the path that we found was its need for from 23+ which at the time PhantomJS didn't use. In terms of rendering though; large tables are evil -- so are sidebars (or see also templates at the bottom of the page). For print view - need to improve this.


 * replacing mwlib: Parsoid? Yeah, we have been talking about that since somebody from the book.js team visited the office a while ago. The rich RDFa info in Parsoid HTML makes it easy to massage the rendering for print (using DOM transformations and/or CSS). The Kiwix folks already customize the Parsoid DOM for offline ZIM files and used this to export Wikivoyage during Wikimania. We could try an HTML-only print pipeline if at all possible. We should give book.js / booktype a more serious try!


 * One recommendation on this is some kind of embedded WebKit that can do HTML->PDF directly, either using the existing MediaWiki output or doing something via parsoid. Brion did a little experimentation a few years ago using the WebKit version bundled with Qt4, which made it very easy to prototype as the PDF output for printing support was already built-in. One issue to consider is bundling a large number of articles into one output document -- when it runs into hundreds of pages he's not sure how well the web rendering engine will handle it. Will CPU and memory usage be reasonable or go through the roof? Or is bundling multiple articles in one PDF actually a feature we want to keep? In the worst case, we could make sure adjacent articles have page boundaries, do each document separately, force the page numbers, and stitch the PDFs together... (book.js is capable of supporting this type of book generation for exactly this reason.)


 * there are severe Collection bugs, e.g. https://bugzilla.wikimedia.org/show_bug.cgi?id=47575#c3 requiring to fix https://bugzilla.wikimedia.org/show_bug.cgi?id=47867