Talk:Requests for comment/Job queue redesign

If this redesign was carried out, please add a note saying that this is the new design as of version # 1.xxx. Thanks.

Sumana Harihareswara, Wikimedia Foundation Volunteer Development Coordinator 20:49, 1 February 2012 (UTC)

Idea dump: HTTP-based lightweight jobs, async job runner

 * jobs can have a handful of priorities
 * each job has an insertion timestamp
 * default job is just a method (GET/POST), an URL and a timestamp, which is
 * deduplicated
 * run directly by a job runner
 * only uses minimal (http client) resources on the job runner
 * uses normal http status codes for error handling

The low overhead might make it feasible to do full url-based deduplication. Doing HTTP requests is fairly generic and can automatically run a request in the context of a given wiki. Jobs would basically be executed with GET or POST requests to some API end points, which are assumed to be idempotent. Semantically POSTs might be better, but POST data in the job queue would need to be limited.

For HTTP, the job runner would limit the number of concurrent requests per host to something like 50-100. SPDY would reduce the overhead in terms of connections.
 * How does "url-based" deduplication work? You mean include a hash in the URL? Why not have it in the payload? Aaron (talk) 00:09, 27 November 2013 (UTC)
 * On further reflection, it might actually be more efficient to make HTTP requests idempotent and cheap wherever possible at the end points. That way jobs would be executed, but would usually not result in any serious duplicate work. Keeping that state at the end point frees the job queue from needing to keep track of it, and also catches duplicate requests from other sources. -- Gabriel Wicke (GWicke) (talk) 22:16, 7 January 2014 (UTC)

Monitoring discussion: 2014-05-06
Notes copied/adapted from http://etherpad.wikimedia.org/p/ScopingJobQueue


 * Nik: Nobody in the room could describe all of the things the job queue does!
 * Tim: We know the things that are user visible as those get reported as bugs. I'm talking to Aaron about how to make the job queue run faster. We might be able to make it run 3 times faster. HHVM will make the job queue run faster too. Limit on thread count (number of workers) determined by what the database can handle rather than what resources are available to run job queue. If we want to speed it up, might be a discussion involving Sean Pringle. Another issue, users don't get what's going on. Maybe we need a feature for this? A message to show after this?
 * Dan: That's worth thinking about!
 * Rob: There was a time where we didn't have any monitoring on the job queue; if things went bad, the only way we'd know if somebody was lucky enough to look at the right place or if someone complained things weren't working (e.g. categories are broken). We'd investigate and be "Ah, job queue is long!". Eventually put some (admittedly crude) monitoring in place for the queue. We decided to only focus on enwiki in the hopes that it was a reasonable proxy for all wikis, hoping it'd reliable indicate other wikis, and this did help us pick up on a few times when the job queue did legitimately break. Aaron improved the job queue to make it more robust so the old problems didn't occur as much. Now the job queue is a victim of its own success; now the job queue is more versatile and robust so other people are using it for more things. The delays are now caused not because things are broken, but because it's being used so much; a performance issue. As of right now, Faidon turned the monitoring off because nobody pays attention to it.
 * Aaron: Parsoid must be slower than refresh links jobs.
 * Nik: Parsoid jobs explode in number. Refresh links jobs sometimes spawn more jobs, where as Parsoid doesn't do that, so the refresh links number doesn't perhaps accurately represent the number of jobs.
 * Aaron: Both Parsoid and refreshLinks should have deduplication.

Possible ways forward: Outcomes and actionables: Highest priority items: 1 and 4.` Current monitoring using showJobs.php script: $ mwscript showJobs.php --wiki=enwiki --group refreshLinks: 22706 queued; htmlCacheUpdate: 166 queued; cirrusSearchLinksUpdate: 2 queued; (Nik/Chad) cirrusSearchLinksUpdatePrioritized: 0 queued; (Nik/Chad) cirrusSearchLinksUpdateSecondary: 628 queued; (Nik/Chad) MassMessageJob: 0 queued; (Legoktm) ParsoidCacheUpdateJobOnEdit: 0 queued;   (GabrieL) ParsoidCacheUpdateJobOnDependencyChange: 264971 queued;  (Gabriel) MWEchoNotificationEmailBundleJob: 0 queued; webVideoTranscode: 0 queued; (Gilles)
 * 1) Better monitoring strategy. We need to work out and document how you look at what's going on with the job queue. How do you work out what's normal and what's not? Break the monitoring up per job type so people know who's breaking things and why? How do we do this monitoring? Does this go into Ganglia, Graphite, some other place? Aaron feels like Graphite is possible and the best option.
 * 2) Do nothing and wait for HHVM to improve things. Probably wouldn't speed things up much at all, since job queue not limited by CPU but by database issues.
 * 3) Aaron and Tim think they can speed up the queue by a factor of 3. Allocate a couple of the new database slaves to the job queue so the job queue can go faster? We need to talk to Sean Pringle about this.
 * 4) There's a bug with the number of runners just going down over time. Aaron wrote a patch and merged it.
 * 5) Implement a feature that tells users that some things sometimes take some time due to the job queue?
 * 1) We should implement per-job monitoring. This will be put in Graphite + Icinga ® alert. (Aaron + Ori®). Get this set up then pick sensible thresholds.
 * 2) Set up monitoring for job execution rate.
 * 3) We should change monitoring to the amount of actual refreshlinks work that will be done rather than the number of jobs. (Aaron)
 * 4) Document actions to take when monitoring yells at us (on mediawiki.org?). (per-job responsibility: owner of each job should write what to when those jobs get too high). Aaron will provide list of all possible jobs to Dan, filled in with what he knows about who's responsible.
 * 5) Investigate whether it makes sense to develop a feature to tell the users that things may happen asynchronously. (Dan)
 * 6) Make it so that all jobs include the time that the action was taken that spawned the job.