Offline content generator/Architecture

General Overview
As shown in the diagram; MediaWiki sits between the render servers and the public internet. The collection extension is the portal to the backend, and it acts in a 'render, render status, download document' workflow. If it determines a document needs to be rendered, it can push the new job to any render server which will in turn push it to a queue in redis for eventual pick up. Status updates are obtained by again querying any server which will retrieve the status from redis.

Render servers have three main processes: a frontend, render client, and a garbage collector. The frontend is a HTTP server and is the public interface of the server. Render clients are what do the actual work and opportunistically pick up jobs from Redis. Finally the garbage collector picks up after failed jobs -- marking the status failed in redis and cleaning up the local scratch space.

To do the actual work, a render client will take a job out of the Redis FIFO queue. A bundle file and a final rendered document will be produced, both of which will be stored in Swift temporarily, a couple days, on successful job completion. The bundle file is stored in case another render job appears for a different final format but same content.

At any time after the work has been completed, and the file has not expired from Redis, the frontend can be instructed to stream the file though MediaWiki and down to a user. The file is served from MediaWiki with cache control headers so that it can be stored for a longer term in Varnish.

As all jobs have a unique hash (Collection ID) created from, among other things, the article revision IDs cache invalidation happens automatically on new requests with new text content. However, changes to templates, images, or any other change to text content that do not update revision IDs will not get a new hash and thus will not be re-rendered on request unless a manual purge is issued.

The Render Server (a.k.a offline content generator)
The render server hosts a Node.JS process which will fork itself several times to spawn sub components. The initial thread is kept as a coordinator and is capable of restarting threads on demand, of if any thread unexpectedly dies. It can be run standalone, logging to the console, or as a service with logs routed to syslog.

Render Frontend
The frontend is a HTTP server capable of accepting new jobs, obtaining status updates of pending and running jobs, and streaming final rendered content back to the requester.

API (command=?)

 * render Places new jobs (and the job metadata) into Redis
 * Creates the job IDs by preprocessing the render request into (format, [(title, revid), ...]) and SHA-ing that string
 * bookcmd=download Returns completed documents to mediawiki
 * Or does a 302 redirect to the server that does have the content if this server did not render it
 * bookcmd=render_status Queries the redis server for the current status of the job
 * bookcmd=zip_post Push a render job to redis with a special 'zip' format? Which will then also have the client upload the results? This would be used if we needed this service to natively support the PediaPress intermediate format.

Render Client
The render pipeline has three broad stages; getting the job from redis, spidering the site to produce an intermediate file with all resources, and then rendering the output.
 * Takes jobs when free from Redis
 * Spidering
 * Pulls each title from Parsoid
 * Process all downloaded RDF for external resources like images
 * Download resources
 * Rewrite RDF to point to the local resource (i don't think rewriting is necessary, the renderer can do that if needed. cscott (talk) 16:36, 14 November 2013 (UTC))
 * Rendering
 * Process the RDF as required for output format
 * Runs pages through compositor like latex/phantomJS producing intermediate pages
 * Perform final compositing of all parts (add title page, table of contents, page numbers, merging intermediates, etc)
 * Saves the final file into a local/remote disk
 * Updates the redis entry for the job when complete and in progress

Garbage Collector
Every so often
 * Go through all keys in the redis server and remove old jobs / files (older than 7 days?)
 * Also clean up intermediate results and output PDFs?

Redis
Each job ID has an persistant entry keyed on the ID with a JSON blob for contents. The blob contains
 * Has a pending job queue that a list of job IDs awaiting a render client
 * Should also contain the render request
 * Date last updated
 * The current job status (pending, running, completed...)
 * Some substructure containing 'percentage complete', 'current title', etc
 * The render server responsible for the job

Redis Server
There are three classes of objects that will be stored in Redis: a FIFO job pending list, job status objects, and collection metadata (metabook) objects. Jobs are inserted into Redis in such a way that no contention may happen. Once in Redis, only the client responsible for the job may modify the status object.

Jobs are inserted by the frontend into the pending list via the  command, and removed by clients using the   command. At the same time as the frontend is injecting the job into the pending list, it will also  the initial job status object (keyed on Collection ID) with a status of pending. Atomicity is ensured on job push by performing a  on the collection ID and then wrapping the   and   within a   /   block. This has the implication that all status keys and the pending list are on the same server.

Job Status Objects
Status

Intermediate Format
csa: my current plan is to use PDF_rendering/print-on-demand_service for this.