Talk:Offline content generator/Architecture

About this board

Simple set up for casual MediaWiki users

10
Saper (talkcontribs)

As one of the few people who actually cared about Extension:Collection bugs in some recent past:

I understand that some developers fell in love with node.js (is MediaWiki going to be rewritten in JavaScript for phase four?) but please please keep it simple to install for the Jimmy the casual MediaWiki user which has PHP and maybe Python installed on his Linux/Windows box.

I've seen some people reporting bugs against their own mwlib installations as some organisations don't want the public PediaPress renderer.

I understand that there are some performance issues right now - it would be certainly beneficial to explain in the beginning of the document "why are we doing this" and "what options do we have".

I also don't think that making Wikipedia to use the cool nodejs/redis solution and leaving Jimmy the casual MediaWiki user with mwlib is viable - users will be frustrated they don't get the same output as Wikipedia's on their wikis and there will be lots of unnecessary troubleshooting work. Not sure how going through replication of WMF setup will be possible for moderately advanced user who is not living in a Puppet world.

Mwalker (WMF) (talkcontribs)

For the class of user that would be installing and running their own mwlib installs; the nodejs and redis queue system would require either a similar amount of complexity or less. We're moving away from the mwlib system for several reasons -- one of which is how difficult it is to run in production.

The primary dependencies as proposed, Node and Redis, are available as Ubuntu packages (we're going to run node 0.10 in production only so that the WMF cluster is running a unified version). Installation is then to pull the git repository of the new renderer and start the node process -- we will most likely provide an upstart script so that daemonization is easy.

We will continue to use standard Ubuntu packages for additional binary dependencies like PhantomJS, pdftk, pngcrush, and latex. For those packages which must be backported (if any, we don't currently have one beyond node) the package will be in the gerrit repo for download. Ideally we will eventually provide a Debian package of this solution.

Maintaining compatibility with the existing internal API is a design requirement. That means that any user who chooses to use the mwlib solution provided by PediaPress will be able to do so. (And in fact the WMF will be using that same service to provide the on demand book printing.)

I choose not to use Python in this case for the render server because our backend renderer is PhantomJS which is controlled via Javascript. Using Node means that we have all our code in one language. Additionally; there is no render system which will be purely a drop in with just PHP/Python unless we pushed rendering down into the user's browser -- which we don't wish to do.

Mwalker (WMF) (talkcontribs)

You are correct though that this solution does require Parsoid. I feel that it is a reasonable requirement, more and more features for MediaWiki are requiring it (VisualEditor and Flow are the big ones.)

In the bigger context, something has to parse the wikitext into usable output. We can't just take the output from the api.php?action=render because it doesn't provide enough semantic information (I have no idea what that API is designed to be used for, but it's clearly not this.) Maybe in the future we will be able to use a similar native API call; but I only have till the 22nd to come up with something usable for the WMF.

Anomie (talkcontribs)

The API doesn't have an "action=render", so I'm not sure what you're talking about there.

Jeremyb (talkcontribs)

maybe index.php?action=render ?

Mwalker (WMF) (talkcontribs)

Ah, no; I mis-remembered -- it's action=parse. Which appears to give the HTML output of the PHP parser. Which is great; but it of course is missing the RDF markup -- and you have to traverse it looking for specific classes to remove things like edit links, and the table of contents,

Anomie (talkcontribs)

Of course, since the PHP parser doesn't generate RDF markup in the first place. That's something that was introduced in Parsoid due to the needs of VE.

GWicke (talkcontribs)

Not just VE. The intention was always to expose all semantic information so that any kind of client can easily extract and manipulate it.

Cscott (talkcontribs)

If the user has PHP and "one other scripting language" installed, it doesn't seem to make a compelling difference whether that "other scripting language" is python or node.

That said, currently the real barrier to Jane Wikipedia is actually all the *other* stuff needed for PDF rendering: fonts, LaTeX, python extensions, imagemagick, pngcrush, etc, etc. There are lots of issues here, but rest assured we're not deliberately trying to make the system harder to install.

Saper (talkcontribs)

As a person who actually needed to compile v8 to get node working right I disagree is easy.

node has also interesting way to plug its modules (quite cool in my opinion but sometimes confusing) and I don't believe that installing necessary npms via standard OS distribution means is something workable in the long term.

Reply to "Simple set up for casual MediaWiki users"

PDF creation needs to support all scripts

4
GerardM (talkcontribs)

The current version that creates PDF does not work for many scripts. Any new implementation should support all the scripts we support.

Kelson (talkcontribs)

I agree, resourceLoader should be correctly exported for offline usage.

Mwalker (WMF) (talkcontribs)

Guh wah! It is a design goal that we support all of the languages that the WMF itself runs sites for (and hopefully really any in the basic and extended multilingual planes.)

However... JavaScript is not yet on the table because at the moment we're only looking at support static content. Kelson; does Kiwix actually run JS? Does your bundler already do this?

Kelson (talkcontribs)

Kiwix is based on the Mozilla runtime and Kiwix for Android on the native Webview. Both provide a JS runtime. In the past, I used to generate ZIM files with working javascript inside. But now with the resourceloader/parsoid, this is broken. I have tried to fix this, without success until now. Adam W. from the WMF office has also investigate this for Kiwix using phantomJS, I'm sure he can deliver to you a few more useful information.

Reply to "PDF creation needs to support all scripts"
Cscott (talkcontribs)
  • Are we planning to do both spidering and rendering in one job? It might be useful to split these into separate job queues?
  • Since image resource downloading seems to be the most resource-intensive task, and images don't change nearly as often as job text, should there be additional (on-disk?) caching inserted here? Or are we just sharing the image caches of the general web front end? (In theory we could cache the complete spider bundle, and use that to save time fetching resources when we just need to update (eg) article text.)
  • Progress update message formats?
  • Should the completed output file (PDF, etc) be added to the Redis Job info? I'm not quite sure how that works.
Mwalker (WMF) (talkcontribs)
  • We agreed in this mornings standup that we should explicitly split spidering and rendering into two jobs. The spider will then inject a job into a 'pending render' queue with a URL to it's endpoint for the 'post_zip' command.
  • The idea of a lookaside image cache has merit; but let's not do that right now. Implement it later if we need it (which should be easier as we have a separate spider/render workflow)
  • Progress update messages; haven't looked into it too much. It'll have to be very similar to the old format which I haven't yet looked at closely.
  • I would argue against putting large things in Redis; it wasn't really designed for large binary objects. The current plan of hosting them on disk; or in varnish seems to be workable.
Anomie (talkcontribs)

The disadvantage of a lookaside image cache is that then we have to deal with cache invalidation, which can be a pain.

Reply to "A few questions"

What is the Round Robin (and why do we need a garbage collector)

2
Mwalker (WMF) (talkcontribs)

Gabriel has suggested we use a similar front end to parsoid; e.g. use Varnish. This would also offload the caching to varnish layer for bookcmd=download requests.

Basically it would look like:

                            +------------------------------+
                            | Bi layer varnish boxes       |
MediaWiki --> LVS Boxes --> |  Frontend -> CARP -> Backend | --> LVS Boxes --> Render Servers
                            +------------------------------+

We could set a cache control of a couple days on rendered output; use the standard varnish LRU purge; and manually issue a purge when we get a forced render come in from MediaWiki.

-- However --

Because we wish to be backwards compatible with the old setup; we must always first issue a 'render' command to the backend. The only way to then know if something has been rendered is if it can find it in Redis. So we'll still have to garbage collect the Redis stuff...

GWicke (talkcontribs)

Adding the capability to speak HTTP to the collection extension might not be that hard and would not break the existing interface.

For client-side 'async' rendering without PHP timeout concerns, do an ajax HEAD request to kick off the parse. Then do some ajax polling with HEAD and Cache-control: only-if-cached header set. Reveal the link when the request is ready.

Alternatively, kick off the render as above and reveal link to the file using an external IP for the PDF service (no short PHP timeout). Varnish will then collapse the requests for you. The spinning can still be done here as well.

A nice feature about using Varnish is that you get

  • timeouts
  • bounded queuing
  • quick render start

basically for free.

phantomjs is also not as slow as mwlib. The HTML5 spec renders to 508 A4 pages in about 20 seconds. It has few and low-resolution graphics, but it does not seem to be inconceivable that even super-large books with 150dpi images will render in less time than even the PHP timeout.

Reply to "What is the Round Robin (and why do we need a garbage collector)"
There are no older topics