User:Bmansurov (WMF)/Alternative way of generating PDF books

= Alternative way of generating PDF books =

Extension:Collection allows users to create books from wiki pages. Proton PDF Generator (PPG - see below) is a back-end for the extension which allows downloading books in the PDF format. Generation of the PDF file is done via Extension:ElectronPdfService. Extension:Collection is the glue between these two services.

Before coming up with this solution, we also looked at `wkhtmltopdf` and `Vivliostyle` - both of which didn't fit our needs for a reason or another (see below for links).

Bird's eye view
Our plan for creating this back-end is as follows (that steps that took us here have been omitted --- see the related links section below for context):
 * Build out concatenation - Build out article concatenation according to requirements for books
 * Create a script to post-process PDF and add page numbers and table of contents - Create a library to post-process PDF and add page numbers and table of contents
 * Spike on how to expose HTML concatenation and PDF post-processing scripts - [Spike] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection
 * Expose HTML concatenation and PDF post-processing scripts - Expose HTML concatenation and PDF post-processing scripts
 * Hook up the concatenation and post-processing scripts to Extension:Collection and generate the final PDF - Use PDF post-processing library to generate final PDF
 * Create an option to download books using the new back-end - Add an option in Special:Book to download PDFs generated by ElectronPdfService

Proton PDF Generator
PPG does two things: creates a single HTML file from a list of articles (we'll call this 'concatenation' for short), and adds the table of contents and page numbers to a PDF file (we'll call this 'PDF post-processing' for short). See the below sections for details.

HTML concatenation

 * Changes needed were first given in https://phabricator.wikimedia.org/T163272 (See the section called 'Outcomes' in the task description). A modified list is given at https://phabricator.wikimedia.org/T171964#3514554. The main difference is that in the former we were thinking about using `wkhtmltopdf` for PDF generation, but in the latter we're thinking about using 'ElectronPdfService'.
 * We decided to implement this feature in Python in https://phabricator.wikimedia.org/T171964#3530721. Python was chosen because the PDF post-processing (see the below section) needs to be done in Python. For this reason it made sense to bundle concatenation with PDF post-processing and make both features available as PPG. PHP has a proof of concept patch (with hacks) - https://gerrit.wikimedia.org/r/#/c/361453/6. Doing concatenation in Extension:Collection requires a lot of upfront work as we need to tidy up the extension which hasn't been done for many years. See [Spike - 8 hrs] Where should article concatenation be implemented? for details about not choosing JavaScript or PHP.

PDF post-processing
Once the HTML is concatenated, it's fed to Extension:ElectronPdfService which outputs a PDF file. We modify the PDF file to add page numbers and the table of contents with page numbers using a Python library `pdfrw`. We also looked at PHP libraries but nothing comparable to `pdfrw` was found. See https://phabricator.wikimedia.org/T168871 for more details.

Related Links

 * Parent task - [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
 * Why not `wkhtmltopdf` - Architecture of new rendering backend for Extension:Collection
 * Why `wkhtmltopdf` - [Spike 6hrs] Investigate ability of wkhtmltopdf to render single articles
 * Why not `vivliostyle` - [Spike 6hrs] Investigate ability of vivliostyle to render single articles