Offline content generator/2013 Code Sprint

IRC Channel: #mediawiki-pdfhack on irc.freenode.net

Git Repository: https://gerrit.wikimedia.org/r/mediawiki/extensions/Collection/OfflineContentGenerator.git

Labs Test Instance: http://mwalker-enwikinews.instance-proxy.wmflabs.org/

Primary participants

 * Matt Walker
 * Max Semenik
 * Brad Jorsch
 * C. Scott Ananian
 * potentially, Jeff Green
 * E. Engelhart

Goals

 * Primary goal
 * Resolve dependency on mwlib and PediaPress rendering setup for PDF generation for single-page documents and collections with the Collection extension. Minimum viable release would be an additional PDF output option being available via the Collection extension, to potentially phase out the old rendering pipeline. Rationale: Current service/architecture is not maintainable and we'd like to clean things up before moving things out of our old Tampa data-center into our new primary DC.


 * Stretch goals
 * Continued support for PediaPress print-on-demand service, completely separate from Wikimedia's internal service;
 * /print-on-demand service
 * Fully replace the old pipeline in the course of the sprint;
 * Support for other formats. Highest priority: ZIM for offline use
 * Improvements to PDF layout.

Requirements

 * Functional requirements
 * Needs to integrate with Collection extension.
 * Needs to append legally required licensing information.
 * Needs to include images in print-level resolution.


 * Architectural questions
 * Use Parsoid HTML5 output and PhantomJS for PDF generation? (Spec here: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec)
 * need to parse collections (lists of articles in plaintext format, example) to aggregate potentially multiple Parsoid HTML files into one.
 * apply some nice transformations
 * ideally get author names, image credits
 * prepend, append some stuff (maybe TOC)
 * phantom.js with rasterize can do basic PDF output from HTML input
 * serve individual files through MediaWiki like Collection currently does it?
 * Hardware requirements; current system load? Provision VMs to test in Labs?
 * How to make service fully puppetized and deployable? Key dependencies? Security aspects e.g. private wikis?
 * Caching strategies for PDFs once generated?

Longer term questions

 * Issues ownership
 * Follow-up sprint(s) consistent with TPA migration timeline
 * Cross-datacenter failover?

Docs

 * Little bits of documentation of the old setup here: PDF Servers
 * Jeff's prior cleanup effort: mwlib
 * mwoffline, NodeJS based solution, using Parsoid output, to generate ZIM files
 * Tentative Architecture
 * PDF rendering/Brainstorming