User:Bmansurov (WMF)/Alternative way of generating PDF books
Extension:Collection allows users to create books from wiki pages. Proton PDF Generator (PCG) is a back-end for the extension which allows downloading books in the PDF format.
Generation of the PDF file is done via Extension:ElectronPdfService. Extension:Collection is the glue between these two services.
Before coming up with this solution, we also looked at `wkhtmltopdf` and `Vivliostyle` - both of which didn't fit our needs for a reason or another (see below for links).
Bird's eye view[edit]
Our plan for creating this back-end is as follows (that steps that took us here have been omitted --- see the related links section below for context):
- Build out concatenation - Build out article concatenation according to requirements for books
- Create a script to post-process PDF and add page numbers and table of contents - Create a library to post-process PDF and add page numbers and table of contents
- Spike on how to expose HTML concatenation and PDF post-processing scripts - [Spike] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection
- Expose HTML concatenation and PDF post-processing scripts - Expose HTML concatenation and PDF post-processing scripts
- Hook up the concatenation and post-processing scripts to Extension:Collection and generate the final PDF - Use PDF post-processing library to generate final PDF
- Create an option to download books using the new back-end - Add an option in Special:Book to download PDFs generated by ElectronPdfService
Proton PDF Generator[edit]
PPG is a yet to be built service that does two things:
- creates a single HTML file from a list of articles (we'll call this 'concatenation' for short)
- adds the table of contents and page numbers to a PDF file (we'll call this 'PDF post-processing' for short).
See the below sections for more details.
HTML concatenation[edit]
- Changes needed were first given in https://phabricator.wikimedia.org/T163272 (See the section called 'Outcomes' in the task description). A modified list is given at https://phabricator.wikimedia.org/T171964#3514554. The main difference is that in the former we were thinking about using `wkhtmltopdf` for PDF generation, but in the latter we're thinking about using 'ElectronPdfService'.
- We decided to implement this feature in Python in https://phabricator.wikimedia.org/T171964#3530721. Python was chosen because the PDF post-processing (see the below section) needs to be done in Python. For this reason it made sense to bundle concatenation with PDF post-processing and make both features available as PPG. PHP has a proof of concept patch (with hacks) - https://gerrit.wikimedia.org/r/#/c/361453/6. Doing concatenation in Extension:Collection requires a lot of upfront work as we need to tidy up the extension which hasn't been done for many years. See [Spike - 8 hrs] Where should article concatenation be implemented? for details about not choosing JavaScript or PHP.
PDF post-processing[edit]
Once the HTML is concatenated, it's fed to Extension:ElectronPdfService which outputs a PDF file. We modify the PDF file to add page numbers and the table of contents with page numbers using a Python library `pdfrw`. We also looked at PHP libraries but nothing comparable to `pdfrw` was found. See https://phabricator.wikimedia.org/T168871 for more details.
Related Links[edit]
- Parent task - [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
- Why not `wkhtmltopdf` - Architecture of new rendering backend for Extension:Collection
- Why `wkhtmltopdf` - [Spike 6hrs] Investigate ability of wkhtmltopdf to render single articles
- Why not `vivliostyle` - [Spike 6hrs] Investigate ability of vivliostyle to render single articles