User:Bmansurov (WMF)/Alternative way of generating PDF books

Extension:Collection allows users to create books from wiki pages. Proton PDF Generator (PCG) is a back-end for the extension which allows downloading books in the PDF format.

Generation of the PDF file is done via Extension:ElectronPdfService. Extension:Collection is the glue between these two services.

Before coming up with this solution, we also looked at `wkhtmltopdf` and `Vivliostyle` - both of which didn't fit our needs for a reason or another (see below for links).

Bird's eye view[edit]

Our plan for creating this back-end is as follows (that steps that took us here have been omitted --- see the related links section below for context):

Build out concatenation - Build out article concatenation according to requirements for books
Create a script to post-process PDF and add page numbers and table of contents - Create a library to post-process PDF and add page numbers and table of contents
Spike on how to expose HTML concatenation and PDF post-processing scripts - [Spike] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection
Expose HTML concatenation and PDF post-processing scripts - Expose HTML concatenation and PDF post-processing scripts
Hook up the concatenation and post-processing scripts to Extension:Collection and generate the final PDF - Use PDF post-processing library to generate final PDF
Create an option to download books using the new back-end - Add an option in Special:Book to download PDFs generated by ElectronPdfService

Proton PDF Generator[edit]

PPG is a yet to be built service that does two things:

creates a single HTML file from a list of articles (we'll call this 'concatenation' for short)
adds the table of contents and page numbers to a PDF file (we'll call this 'PDF post-processing' for short).

See the below sections for more details.

HTML concatenation[edit]

Changes needed were first given in https://phabricator.wikimedia.org/T163272 (See the section called 'Outcomes' in the task description). A modified list is given at https://phabricator.wikimedia.org/T171964#3514554. The main difference is that in the former we were thinking about using `wkhtmltopdf` for PDF generation, but in the latter we're thinking about using 'ElectronPdfService'.
We decided to implement this feature in Python in https://phabricator.wikimedia.org/T171964#3530721. Python was chosen because the PDF post-processing (see the below section) needs to be done in Python. For this reason it made sense to bundle concatenation with PDF post-processing and make both features available as PPG. PHP has a proof of concept patch (with hacks) - https://gerrit.wikimedia.org/r/#/c/361453/6. Doing concatenation in Extension:Collection requires a lot of upfront work as we need to tidy up the extension which hasn't been done for many years. See [Spike - 8 hrs] Where should article concatenation be implemented? for details about not choosing JavaScript or PHP.

PDF post-processing[edit]

Once the HTML is concatenated, it's fed to Extension:ElectronPdfService which outputs a PDF file. We modify the PDF file to add page numbers and the table of contents with page numbers using a Python library `pdfrw`. We also looked at PHP libraries but nothing comparable to `pdfrw` was found. See https://phabricator.wikimedia.org/T168871 for more details.

Bird's eye view[edit]

Proton PDF Generator[edit]

HTML concatenation[edit]

PDF post-processing[edit]

Related Links[edit]