Setup a render server on Ubuntu 14.04 LTS

From mediawiki.org
The Collection extension got replaced by the ElectronPdfService extension. For details read Reading/Web/PDF Functionality. You can still use Collection and this document describes how to set up your own render server on Ubuntu Server 14.04 LTS (64bit). It was tried with Ubuntu Server 16.04 LTS but with mixed and unreliable results.

The Collection extension is a way to download single pages or generate complete books from articles on your wiki. It is also used on MediaWiki & the Wikipedia sites. The Pdf Export extension can also be used to output pages to pdf. These extensions can use a public render server to render the output like the one at PediaPress used by MediaWiki and the Wikipedia websites. If you can't or don't want to use a public render server it is relatively easy to setup your own render server. There is documentation on the subject but depending on what you want your render server to do you need to combine a lot of sources on the web and several Talk pages on MediaWikia. Because of this setting up your own render server can be a little frustrating. This document is an attempt to make the process more clear and hopefully a little less frustrating.

It is possible to setup a render server on many different Linux distro's or even on a MS Windows box but even with just some basic Linux know-how it is "easier" to use a Linux distro. When you are "forced" into a MS Windows environment and you can't setup a Linux box you could use VirtualBox, or any other virtualization software on the Windows host and virtualize your Linux distro. You will still be able to run your own render server, this works just fine.

Ubuntu Server 14.04 LTS (64bit) was used to write this documentation.

What you get is your own render server that outputs pdf files from a MediaWiki wiki.

Install Ubuntu[edit]

Install a clean version of Ubuntu Server 14.04 LTS (64 bit). After installation process is finished and Ubuntu has restarted your "render server" needs to be connected to the internet to be able to install the rest of the software. When you have a direct connection all is fine but if you have to go trough a proxy server then one way of setting this up is to use the apt.conf file located in /etc/apt/. Type:

  1. sudo nano /etc/apt/apt.conf
    
  2. Add the following line with your own user and proxy setup:
    Acquire::http::proxy "http://username:password@proxyname:portnumber";
    
  3. Then save the file.

Update Ubuntu[edit]

After Ubuntu has restarted type:

  • sudo apt update
    

After the package list is updated type:

  • sudo apt upgrade
    

When finished you probably need to restart Ubuntu if not it will not hurt to restart anyway.

Install Python and other dependencies[edit]

To be able to run your render server you need Python and some other dependencies. Python is included in the default installation of the server version of Ubuntu 14.04 but to make sure that Python and the other dependencies are installed paste the below code in the CL[1]. Not sure if you need all of them but it works:

sudo apt install -y gcc g++ make python python-pip python-dev\
 python-virtualenv libjpeg-dev libz-dev libfreetype6-dev liblcms-dev\
 libxml2-dev libxslt-dev ocaml-nox git-core python-imaging python-lxml\
 texlive-latex-recommended ploticus dvipng imagemagick pdftk

When finished you probably don't need to restart Ubuntu but it will not hurt to restart anyway.

Install mwlib & mwlib.rl[edit]

mwlib provides a library for parsing MediaWiki articles and converting them to different output formats[2].
You need to be root to install mwlib so use sudo.

  1. sudo pip install -i http://pypi.pediapress.com/simple/ mwlib
    
  2. sudo pip install -i http://pypi.pediapress.com/simple/ mwlib.rl
    
  3. sudo pip install gevent==1.0
    
  • mwlib is the core functionality, provides a parser.[3]
  • mwlib.rl generates PDF files from MediaWiki articles. This is what is being used on Wikipedia in order to generate PDF output.[3]

Math formulas support[edit]

If you are using the Math or MathJax extensions and you want to render formulas in your pdf output you also need to install support files for MediaWiki. You probably don't need all of it but it works. Run:

  1. sudo apt install mediawiki-math
    

Create a cache directory[edit]

Before you can start your render server you need to create a cache directory and make it accessible for your Ubuntu user.

  1. sudo mkdir /data
    
  2. sudo mkdir /data/mwcache
    
  3. sudo chown -R username:username /data/mwcache
    
    Note Note:Replace username with your Ubuntu user under which the render server will run

Start your server[edit]

  • nserve is a HTTP server. The Collection extension is talking to that program directly. nserve uses at least one mw-qserve instance in order to distribute and manage jobs[4]. Please note that nserve does not allow you to have your mediawiki base_url (equivalent to $wgServer in LocalSettings.php) set to localhost or an IP beginning with 127.0. or 192.168.
  • mw-qserve is a job queue server used to distribute and manage jobs. You should start one mw-qserve instance for each machine that is supposed to render pdf files. Unless you’re operating the Wikipedia installation, one machine should suffice[4].
  • nslave pulls new jobs from exactly one mw-qserve instance and calls the mw-zip and mw-render programs in order to download article collections and convert them to different output formats. nslave uses a cache directory to store the generated documents. nslave also starts an internal http server serving the content of the cache directory[4].
  • postman uploads zip collections to Pediapress in case someone likes to order printed books. You should start one instance for each mw-qserve instance[4].

To start your render server use:

nserve & mw-qserve & nslave --cachedir /data/mwcache

When you are using a print-on-demand service:

nserve & mw-qserve & nslave --cachedir /data/mwcache & postman --cachedir /data/mwcache

The "best" option is that the render server starts up automatically when Ubuntu boots. There is documentation on how to do this but it was not tried or tested yet[5].

Test your server[edit]

To test if your render server is working enter the following commands in the terminal screen[5]:

  1. mw-zip -c :en -o test.zip Formula NK-33 Jupiter
    
  2. mw-render -c test.zip -o test.pdf -w rl
    

This test will use the Formula, NK-33 & Jupiter pages on the English Wikipedia site for the render test but you can request any page you like. When a page consists of multiple words put the page name between brackets like "Jimi Hendrix". When the test is ready there should be a pdf file called test.pdf in the /home directory of your user. Check if the pdf output is correct.

Setup the Collection extension[edit]

To enable the Collection extension on your wiki follow the instruction on the Collection extension page.

LocalSettings.php[edit]

Because you are running your own render server you have to let your Wiki know where to find the render server. This means you have to put some extra configuration parameters into LocalSettings.php[6]

Note Note:If you don't alter any of the defaults used by the Collection extension and your Wiki can be accessed from the Internet your pages will get rendered by the render server from PediaPress: http://tools.pediapress.com/mw-serve/.

/** During testing I did notice that if your Wiki server has a 
"home" IP address starting with 198.168.... and you use this address 
in: $wgServer = '198.168...' 
You get "bad base URL" faults when trying to create a book or render
a page. I did not dive into the subject so I don't know 
what the problem was and if it was just my own problem but there probably 
are security reasons. 
If possible use your computer name for $wgServer and not the IP address*/
$wgServer = 'yourcomputername' 


/** Collection extension to create books.
There are a lot of default values for parameters that can be overwritten.
Look in /extensions/Collection/Collection.php for "all" of them.
Below is just an example setup. */
require_once("$IP/extensions/Collection/Collection.php");


/** List of available download formats. You can enable other formats 
but they probably need some setting up and/or tuning to work properly. */		
$wgCollectionFormats = array(
    'rl' => 'PDF', # enabled by default
#    'odf' => 'ODT',
#    'docbook' => 'DocBook XML',
#    'xhtml' => 'XHTML 1.0 Transitional',
#    'epub' => 'e-book (EPUB)',
#    'zim' => 'Kiwix (OpenZIM)',
);


/** Set $wgCollectionPODPartners to false to disable print-on-demand service. Default is:
$wgCollectionPODPartners = array(
	'pediapress' => array(
		'name' => 'PediaPress',
		'url' => 'http://pediapress.com/',
		'posturl' => 'http://pediapress.com/api/collections/',
		'infopagetitle' => 'coll-order_info_article',
	),
);*/	
$wgCollectionPODPartners = false;


/** Maximum no. of articles in a book */	
$wgCollectionMaxArticles = 500; # default value

	
/** URL of your render server. 
Default is: $wgCollectionMWServeURL = 'http://tools.pediapress.com/mw-serve/';*/
$wgCollectionMWServeURL = 'http://yourrenderserverurlorip/cache';


/** When your Wiki is private (log in mandatory to read) you should 
create a render server user. With these credentials the render server 
is able to access your wiki. */	
$wgCollectionMWServeCredentials = "Renderserveruser:renderserveruserpassword";


/** If you are using $wgUploadPath = "/yourwikiname/img_auth.php"; with
a .htaccess file in you images directory to prevent direct access to the 
files on your wiki you need to make an exception for your render server. 
On a "normal" wiki files are served by the web server directly but when 
you use img_auth.php and there is a .htaccess file in your image directory 
which is setup correctly this is not possible when users are not logged in. 
I am not sure if the below option is safe for a wiki hosted on the internet 
and it probably can be setup in a different way but like this it is easy 
and it works. When the below code is located after
$wgUploadPath = '/yourwikiname/img_auth.php'; 
it will overwrite $wgUploadPath only when your render server is accessing 
your wiki bypassing img_auth.php. When you know a better solutions please 
feel free to document this. */
if ( $_SERVER["REMOTE_ADDR"] == 'your.wiki.ip.address' ) {
	$wgUploadPath = '/yourwikiname/yourimagedirectory';
}


/** User rights for saving books. Set the permissions to your liking */		
$wgGroupPermissions['*']['collectionsaveasuserpage'] = false;
$wgGroupPermissions['*']['collectionsaveascommunitypage'] = false;
$wgGroupPermissions['user']['collectionsaveasuserpage'] = true;
$wgGroupPermissions['user']['collectionsaveascommunitypage'] = true;

Save books[edit]

If you grant users permission to save books you need to create a template. With this template it is possible to generate a book with "one click".[5]:

  • Template:Saved book with the following content:
<includeonly><span class="plainlinks" title="Download a PDF version of this book, optimized for A4 paper (8.3 × 11.7 in, 210 × 297 mm).">
[{{fullurl:Special:Book|bookcmd=render_collection&colltitle={{FULLPAGENAMEE}}&writer=rl}} Download PDF]</span></includeonly>

When users save books they will then be placed in Category:Books.

Conditional inclusion of content[edit]

You can add functionality to your Wiki that will show or hide content. This means you can prevent content on your Wiki to be printed or the other way around where it will only show up on the printed output and not on your Wiki. You need to create two templates and add some code to your local Mediawiki:Common.css[5]:

  • Template:Hide in print with the following content:
<includeonly><div class="noprint">{{{1|}}}</div></includeonly>
  • Template:Only in print with the following content:
<includeonly><div class="onlyinprint">{{{1|}}}</div></includeonly>
  • In Mediawiki:Common.css:
/**
* Hide text in the onlyinprint class. Is used for the collection extension
*/
.onlyinprint {display: none}

An example, {{Hide in print|This text will not be shown in the print.}} or {{Only in print|This text will only be shown in the print.}}.

Setup mwlib[edit]

Pdf output style[edit]

When your render server is working it will use the default style for the pdf output. The default style options are loaded via pdfstyles.py. If you are happy with the default setup you can leave things alone but if you want to change page breaks, footers and other stuff you should create a python script called customconfig.py. It is possible to hack pdfstyles.py but creating customconfig.py is a better option then hacking pdfstyles.py. When there are updates pdfstyles.py could get overwritten and you loose your changes. You can open pdfstyles.py to check what parameters can be set up so you know what to put into customconfig.py. Both files are located in /usr/local/lib/python2.7/dist-packages/mwlib/rl/. To create customconfig.py open the terminal screen and copy & past the following code into the terminal screen and press enter:

  1. sudo nano /usr/local/lib/python2.7/dist-packages/mwlib/rl/customconfig.py
    
  2. Copy and past the example code below into the document and press save:
######### PAGE CONFIGURATION
page_break_after_article = True # Turn Off/On page break after each article
show_article_attribution = False   # Show/Hide article source and contributors
pagefooter = 'This text is printed in the footer of all article pages' 
titlepagefooter = 'This text is only printed in the footer of the title page'
title_page_image = '/data/yourimagename.png' # Path of an image that is to be displayed on the title page

######### TEXT CONFIGURATION
import os
os.environ['MATH_RESOLUTION'] = "120"

######### IMAGE CONFIGURATION
link_images = False

Note Note:It is very easy to break things when the content of customconfig.py is not correct. In the example above the render server will look for /data/yourimagename.png but that image will most likely not exist in that location on your render server. When you have created a book with a title page the faulty name will break the pdf rendering process. If you are testing different options and settings make sure to do this step by step, look (read) on your render server what is happening and check the pdf output in-between so when things are not working you "know" what caused the problem.

Math formulas[edit]

When you use math formulas on your wiki and you have set up your render server to render the formulas you might have noticed that the quality (resolution) of the formulas in not "stellar". Not sure how the render server from PediaPress is set up but it renders formulas with higher quality then the "default" render server documented on this page. If you use the MathJax extension you will be especially "disappointed" because it renders very high quality formulas in your browser, compared to the formulas rendered with the Math extension.

Math formulas as shown in your browser (below with Firefox 23.0.1, same scale)
Rendered with MathJax Rendered with Math

To get the pdf result below the following set up is used in the customconfig.py for mwlib:

######### TEXT CONFIGURATION
import os
os.environ['MATH_RESOLUTION'] = "120"
Math formulas in pdf output (same scale)
pdf on render server (like documented) pdf by PediaPress via the English Wikipedia

It is possible to increase the quality of the pdf rendered formulas because PediaPress also does it. I did not find any documentation on how to do this so maybe somebody else has ideas on this subject.

References[edit]