Requests for comment/Job queue redesign

From mediawiki.org
Request for comment (RFC)
Job queue redesign
Component General
Creation date
Author(s) Brion Vibber
Document status accepted

Current model[edit]

Each wiki carries a "job" table with a queue like this, used to record actions for future background processing:

-- Jobs performed by parallel apache threads or a command-line daemon
CREATE TABLE /*$wgDBprefix*/job (
  job_id int unsigned NOT NULL auto_increment,
  
  -- Command name
  -- Limited to 60 to prevent key length overflow
  job_cmd varbinary(60) NOT NULL default '',

  -- Namespace and title to act on
  -- Should be 0 and '' if the command does not operate on a title
  job_namespace int NOT NULL,
  job_title varchar(255) binary NOT NULL,

  -- Any other parameters to the command
  -- Presently unused, format undefined
  job_params blob NOT NULL,

  PRIMARY KEY job_id (job_id),
  KEY (job_cmd, job_namespace, job_title)
) /*$wgDBTableOptions*/;

New jobs are pushed in on the end at runtime; multiple duplicates of a single job may be pushed.

A later queue-run operation (either from running runJobs.php on the command line/cron, or a random "background operation" during a web hit) pulls an item from the front, loads up the associated class, executes it, and when completed deletes it and any duplicates of it in the queue.

Example jobs[edit]

  • RefreshLinksJob
    • Re-renders pages to update their link tables after a template they used was edited
  • RefreshLinksJob2
    • Same thing, but optimized for high-use templates (pulls the list of pages to update at job-run time instead of job-insert time)
  • HTMLCacheUpdateJob
    • Updates page_touched timestamps and clears pages from the HTTP proxy or HTML file caches, if in use. Again, used to hit pages for indirect updates via templates.
  • EnotifNotifyJob
    • Sends e-mails of page change notifications, if configured. This is relatively time-critical.
  • DoubleRedirectJob
    • Automatically tidies up double-redirects that might have been created by a page rename to point to the final target.
  • RenameUserJob
    • For the RenameUser extension -- updates *_user_text fields in various tables with a users' new name. Delays in this are highly noticed.

Problems[edit]

  • Lack of a timestamp in the queue entry means we can't easily get a sense of how lagged our operations are.
  • Duplicate-entry model allows the queue to grow disproportionately large when things happen like multiple edits to a widely-used template; it becomes difficult to tell how many actual jobs there are to do.
    • This model was used over avoiding duplicate inserts to maximize insert speed; a unique index is not needed, and we don't have to manually check for dupes.
  • There's not a good way to get a summary of what's currently in the table; if there are a million entries, querying them to get a breakdown is pretty slow.
  • There's no prioritization, making it wildly unsuitable for things like e-mail notifications which should happen within a couple minutes -- if we spend six hours updating link tables due to a template change, we shouldn't be pushing everything else back.

Execution model[edit]

On the Wikimedia setup, we have a number of app servers which additionally serve as "job queue runners", cycling through the various wikis running runJobs.php. There are some issues here as well:

  • It can take a while to cycle through 700+ separate sites. While you're running through empty queues you're not processing something with a giant queue.
    • Wikia, with a much larger number of smaller wikis, have been experimenting with a single shared job queue table to make this easier to handle.
  • runJobs.php has been known to hang sometimes when database servers get rearranged. Needs better self-healing?

Monitoring model[edit]

Job queue runners log activity to files in /home/wikipedia/logs/jobqueue/. While sometimes useful, it'd be nice to have more of a "top"-style view that's more openly accessible:

  • Web-accessible list of which jobs are being run on which wikis on which queue runners, with count, rate & lag info
  • Processing rate per machine in Ganglia