Topic on Talk:Offline content generator/Architecture

A few questions

3

Cscott (talkcontribs)

Are we planning to do both spidering and rendering in one job? It might be useful to split these into separate job queues?
Since image resource downloading seems to be the most resource-intensive task, and images don't change nearly as often as job text, should there be additional (on-disk?) caching inserted here? Or are we just sharing the image caches of the general web front end? (In theory we could cache the complete spider bundle, and use that to save time fetching resources when we just need to update (eg) article text.)
Progress update message formats?
Should the completed output file (PDF, etc) be added to the Redis Job info? I'm not quite sure how that works.

Mwalker (WMF) (talkcontribs)

We agreed in this mornings standup that we should explicitly split spidering and rendering into two jobs. The spider will then inject a job into a 'pending render' queue with a URL to it's endpoint for the 'post_zip' command.

The idea of a lookaside image cache has merit; but let's not do that right now. Implement it later if we need it (which should be easier as we have a separate spider/render workflow)

Progress update messages; haven't looked into it too much. It'll have to be very similar to the old format which I haven't yet looked at closely.

I would argue against putting large things in Redis; it wasn't really designed for large binary objects. The current plan of hosting them on disk; or in varnish seems to be workable.

Anomie (talkcontribs)

The disadvantage of a lookaside image cache is that then we have to deal with cache invalidation, which can be a pain.