User:Daniel Kinzler (WMDE)/Job Queue

This page gives a brief overview of current issues with the JobQueue, to the author's best but incomplete knowledge, and was updated based on the IRC discussion on September 13. The purpose of this page is to provide a starting point for a discussion about the future of the JobQueue mechanism.

Observations:
 * Latest instance of the JQ exploding: https://phabricator.wikimedia.org/T173710


 * With 600k refreshLink jobs in the backlog of commonswiki, only 7k refreshLink jobs got processed in a day. (Not including those run manually from terbium)


 * For wikis with just a few thousand pages, we sometimes see millions of UpdateHtmlCache jobs sitting in the queue.


 * Jobs that were triggered months ago were found to continue failing and re-trying


 * Selection of target wiki in the job runner is random. It does not depend on which wikis have most jobs pending. Wikis with no jobs pending get selected, and cause overhead (and even an explicit delay? Or is the job runner idling, waiting for jobs for a bit?)


 * There is one queue per target wiki.

Issues and considerations:
 * We have lots ob jobs that do nothing (e.g. RefreshLinks for a page that doesn't have links - but we don't know that in advance)
 * Jobs re-trying indefinitely: https://phabricator.wikimedia.org/T73853


 * Deduplication
 * mechanism is obscure/undocumented. Some rely on rootJob parameters, some use custom logic.
 * Batching prevents deduplication. When and how should jobs do batch operations? Can we automatically break up small batches?
 * Delaying jobs may improve deduplication, but support for delayed jobs is limited/obscure.
 * Custom coalescing could improve the chance for deduplication.
 * Kafka / changeprop WIP lib for dedup, delay, rate limiting & retry processing: https://github.com/wikimedia/budgeteer
 * if there are a lot of jobs being queued from a   given wiki, it makes sense to defer those jobs for a while so that    deduplication can take effect
 * Fix for recent dedupe issue: https://github.com/wikimedia/mediawiki/commit/cb7c910ba72bdf4c2c2f5fa7e7dd307f98e5138e
 *  mobrovac: a LRU list of   job signatures recently seen locally, kept in memory. that would be    quick-and-dirty dedupe before push.
 * <_joe_> So regarding deduplication, I am   unsure how effective it is, because there is actually no way to tell    right now
 * https://grafana.wikimedia.org/dashboard/db/job-queue-rate?panelId=7&fullscreen&orgId=1   (gwicke,    22:00:30)


 * Scope and purpose of some jobs is unclear. E.g. UpdateHtmlCache invalidates the parser cache, and RefreshLinks re-parse the page - but does not trigger an UpdateHtmlCache, which it probably should.


 * The throttling mechanism does not take into account the nature and run-time of different job types.
 * Video transcode jobs for TimedMediaHandler can be very long, on order of many hours or just a couple minutes. Can divide up into smaller chunks, but how best to handle a large influx of small chunks from one (or many) upload(s)?
 * alternate: make throttling mechanism understand job length variability and use this to help balance longer vs shorter jobs
 * ex: https://github.com/wikimedia/budgeteer, part of changeprop / kafka effort.


 * Scheduling
 * Issue: Random per-wiki processing can run up large backlogs in individual wikis.
 *  having fairness of   scheduling between wikis was a deliberate design decision, IMHO    important and useful
 *  The end use case that should   remain is that if a wiki is dormant and I schedule 1 job there, it    should run nearly instantly no matter what.
 * Scaling is achieved by running more cron jobs.
 * Having a single queue for all wikis would mean wikis with a large backlock get more job runner time. But maybe too much, starving small wikis?
 * this is roughly the plan for new Kafka system...
 * Can add rate limiting per job type & wiki if needed to prevent abuse. (what happens to rate-limited jobs? are they retried after an interval, or requeued at the end? -- yes, scheduled for delayed execution)
 * For wikibase, in change dispatching (pre job queue), we look at which wikis have the most changes pending, and we pick a set and then randomly pick one from that set. Makes it more likely for heavily-lagged targets to be processed, without starving others.
 * Overhead of selecting/switching target wikis for the JobRunner
 *  _joe_: we should confirm then   if the problem is the "wasting of time" on subjective unimportant    jobs, or the waste on cycles checking/switching wikis. The former    might be a hard sell.
 *  I believe it is spenidng most   time waiting for replag. A job quue write is not complete until    after we wait for all slaves to have replicated the write.
 * waiting for replication makes sense. DB   throughput is a hard limit on job execution, and should be. batching    can improve that. but batching kills deduplication
 * maybe the runner could keep track of the avg   execution time per job type, and consider that value for scheduling    fairness. so a large job on one wiki would count for many small jobs    on another wiki
 * https://github.com/wikimedia/budgeteer


 *  for third parties I think   we should do like wordpress and have a cron.php which you hit from    cron with curl
 * Stock MediaWiki job runner   (maintenance/runJobs.php) invokes JobRunner class directly, not over    HTTP. For cache and config consistency, we should consider    standardising on Special:RunJobs over http.
 * Kafka-based JQ is being tested by Services. Generally saner. Should improve ability to track causality (which job got triggered by which other job). T157088
 *  from a wiki user POV, it   should be possible to see the backlog for a wiki, so that you could    know "all changes made before have taken effect"    or "wait two weeks for all transclusions to update"
 * [with kafka] there is a combination of    concurrency limiting, and cost-based rate limiting;  with cost being    typically dominated by execution cost
 *  It sounds like the new stack   performs an HTTP call to MediaWiki/rpc for each individual job,    whereas the current model does it per batch (wiki+job type+batch    limits)

Documentation:
 * No support for recurrent jobs. Should we keep using cron?
 * Limited visibility into job queue contents, difficult to inspect
 * Manual:Job queue
 * JobQueue (Doxygen)
 * Aaron's slides (DropBox)
 * Job Queue Health (Grafana)
 * IRC discussion notes 2017-09-13
 * IRC minutes/log 2017-09-13