User:Daniel Kinzler (WMDE)/Job Queue

From mediawiki.org

This page gives a brief overview of current issues with the JobQueue, to the author's best but incomplete knowledge, and was updated based on the IRC discussion on September 13. The purpose of this page is to provide a starting point for a discussion about the future of the JobQueue mechanism.

Observations:

  • Latest instance of the JQ exploding: https://phabricator.wikimedia.org/T173710
  • With 600k refreshLink jobs in the backlog of commonswiki, only 7k refreshLink jobs got processed in a day. (Not including those run manually from terbium)
  • For wikis with just a few thousand pages, we sometimes see millions of UpdateHtmlCache jobs sitting in the queue.
  • Jobs that were triggered months ago were found to continue failing and re-trying
  • Selection of target wiki in the job runner is random. It does not depend on which wikis have most jobs pending. Wikis with no jobs pending get selected, and cause overhead (and even an explicit delay? Or is the job runner idling, waiting for jobs for a bit?)
  • There is one queue per target wiki.
  • We have lots ob jobs that do nothing (e.g. RefreshLinks for a page that doesn't have links - but we don't know that in advance)

Issues and considerations:

  • Jobs re-trying indefinitely: https://phabricator.wikimedia.org/T73853
  • Deduplication 
    • mechanism is obscure/undocumented. Some rely on rootJob parameters, some use custom logic.
    • Batching prevents deduplication. When and how should jobs do batch operations? Can we automatically break up small batches?
    • Delaying jobs may improve deduplication, but support for delayed jobs is limited/obscure.
    • Custom coalescing could improve the chance for deduplication.
    • Kafka / changeprop WIP lib for dedup, delay, rate limiting & retry processing: https://github.com/wikimedia/budgeteer
    • if there are a lot of jobs being queued from a given wiki, it makes sense to defer those jobs for a while so that deduplication can take effect
    • Fix for recent dedupe issue: https://github.com/wikimedia/mediawiki/commit/cb7c910ba72bdf4c2c2f5fa7e7dd307f98e5138e
    • <DanielK_WMDE__> mobrovac: a LRU list of job signatures recently seen locally, kept in memory. that would be quick-and-dirty dedupe before push.
    • <_joe_> So regarding deduplication, I am unsure how effective it is, because there is actually no way to tell right now
      • https://grafana.wikimedia.org/dashboard/db/job-queue-rate?panelId=7&fullscreen&orgId=1 (gwicke, 22:00:30)
  • Scope and purpose of some jobs is unclear. E.g. UpdateHtmlCache invalidates the parser cache, and RefreshLinks re-parse the page - but does not trigger an UpdateHtmlCache, which it probably should.
  • The throttling mechanism does not take into account the nature and run-time of different job types.
    • Video transcode jobs for TimedMediaHandler can be very long, on order of many hours or just a couple minutes. Can divide up into smaller chunks, but how best to handle a large influx of small chunks from one (or many) upload(s)?
      • alternate: make throttling mechanism understand job length variability and use this to help balance longer vs shorter jobs
        • ex: https://github.com/wikimedia/budgeteer, part of changeprop / kafka effort.
  • Scheduling
    • Issue: Random per-wiki processing can run up large backlogs in individual wikis.
      • <TimStarling> having fairness of scheduling between wikis was a deliberate design decision, IMHO important and useful
      • <Krinkle> The end use case that should remain is that if a wiki is dormant and I schedule 1 job there, it should run nearly instantly no matter what.
    • Scaling is achieved by running more cron jobs.
    • Having a single queue for all wikis would mean wikis with a large backlock get more job runner time. But maybe too much, starving small wikis?
      • this is roughly the plan for new Kafka system...
      • Can add rate limiting per job type & wiki if needed to prevent abuse. (what happens to rate-limited jobs? are they retried after an interval, or requeued at the end? -- yes, scheduled for delayed execution)
    • For wikibase, in change dispatching (pre job queue), we look at which wikis have the most changes pending, and we pick a set and then randomly pick one from that set. Makes it more likely for heavily-lagged targets to be processed, without starving others.
    • Overhead of selecting/switching target wikis for the JobRunner
      • <Krinkle> _joe_: we should confirm then if the problem is the "wasting of time" on subjective unimportant jobs, or the waste on cycles checking/switching wikis. The former might be a hard sell.
      • <Krinkle> I believe it is spenidng most time waiting for replag. A job quue write is not complete until after we wait for all slaves to have replicated the write.
        • waiting for replication makes sense. DB throughput is a hard limit on job execution, and should be. batching can improve that. but batching kills deduplication
    • maybe the runner could keep track of the avg execution time per job type, and consider that value for scheduling fairness. so a large job on one wiki would count for many small jobs on another wiki
  • <TimStarling> for third parties I think we should do like wordpress and have a cron.php which you hit from cron with curl
    • Stock MediaWiki job runner (maintenance/runJobs.php) invokes JobRunner class directly, not over HTTP. For cache and config consistency, we should consider standardising on Special:RunJobs over http.
  • Kafka-based JQ is being tested by Services. Generally saner. Should improve ability to track causality (which job got triggered by which other job). T157088
    • <Platonides> from a wiki user POV, it should be possible to see the backlog for a wiki, so that you could know "all changes made before <two weeks> have taken effect" or "wait two weeks for all transclusions to update"
    • [with kafka] there is a combination of concurrency limiting, and cost-based rate limiting; with cost being typically dominated by execution cost
    • <Krinkle> It sounds like the new stack performs an HTTP call to MediaWiki/rpc for each individual job, whereas the current model does it per batch (wiki+job type+batch limits)
  • No support for recurrent jobs. Should we keep using cron?
  • Limited visibility into job queue contents, difficult to inspect

Documentation: