Offline content generator/2013 Code Sprint

Start:	2013-11-12
End:	2013-12
Team members:	Matt Walker, Max Semenik, Brad Jorsch
Lead:	C. Scott Ananian

IRC Channel: #mediawiki-pdfhack on irc.freenode.net
Git Repository: https://gerrit.wikimedia.org/r/mediawiki/extensions/Collection/OfflineContentGenerator.git
Labs Test Instance: http://mwalker-enwikinews.instance-proxy.wmflabs.org/

Primary participants

Resolve dependency on mwlib and PediaPress rendering setup for PDF generation for single-page documents and collections with the Collection extension. Minimum viable release would be an additional PDF output option being available via the Collection extension, to potentially phase out the old rendering pipeline. Rationale: Current service/architecture is not maintainable and we'd like to clean things up before moving things out of our old Tampa data-center into our new primary DC.

Continued support for PediaPress print-on-demand service, completely separate from Wikimedia's internal service;
- /print-on-demand service
- Deprecating print-on-demand functionality

Use Parsoid HTML5 output and PhantomJS for PDF generation? (Spec here: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec)
- need to parse collections (lists of articles in plaintext format, example) to aggregate potentially multiple Parsoid HTML files into one.
- apply some nice transformations
- ideally get author names, image credits
- prepend, append some stuff (maybe TOC)
- phantom.js with rasterize can do basic PDF output from HTML input
- serve individual files through MediaWiki like Collection currently does it?
Hardware requirements; current system load? Provision VMs to test in Labs?
How to make service fully puppetized and deployable? Key dependencies? Security aspects e.g. private wikis?
Caching strategies for PDFs once generated?