Topic on Extension talk:Replace Text

Suggestion: adding an option to the replaceAll.php script to add changes to job queue like the special page

6
MyWikis-JeffreyWang (talkcontribs)

While using Replace Text to make a massive amount of changes to a wiki (25,000+ pages), the database would die often because of the massive amount of stress it's being put under.

I tried using both the CLI script and Special:ReplaceText. The problem with the CLI script is if it dies, when it's being re-run, it has to restart and do a SELECT SQL query to search for all the pages to replace again. If the CLI script keeps dying, it becomes really, really inefficient to keep doing this SQL query, especially if you increase the limit from 250 to say 1000.

The benefit of the job queue is if it dies, it can pick right up from where it left off. Furthermore, the SELECT query only needs to run once. And usually the SELECT query won't die, but if it does, it's more probable it will succeed than all 250+ changes to the database.

Perhaps it would be better to add an option in the CLI to allow the option to add the pages to be replaced to the job queue, just like the special page?

Sorry if this is the wrong place to make the suggestion, I can move it to Phabricator or somewhere else if that's preferred.

Yaron Koren (talkcontribs)

Sorry for the delay. This is certainly an option... one other option is to add some flag to replaceAll.php that would let you set a "wait" period between one replacement and the next - say, one second. (Which, for 25,000 pages, would add about 7 hours to the overall running time.) Or, the script could run replacements in batches of 100 and then wait some time between each batch and the next. Any thoughts on those different approaches?

MyWikis-JeffreyWang (talkcontribs)

Thanks Yaron for the reply! I was able to get through 25,000+ changes in less than 7 hours. Here's my experience:

  • Trying to do a SELECT for all 25,000+ changes to be made also crashed the DB. So maybe doing one big SELECT won't work either.
  • SELECT for 1000-2000 at a time worked well. The limitation here is you'd need to run the job queue before doing another SELECT, but this still works decently well.
  • The job queue would still often crash, but at least all the jobs ran beforehand did not have to be repeated.

I ended up just setting $wgReplaceTextResultsLimit to 1000 and entering the same query using the web form each time, then running through the job queue. So the most annoying part was entering the same query 10+ times into the form.

I think running replacements in smaller batches would be the most prudent approach. This is what Special:ReplaceText already does anyway, and it works really well.

Here's some pseudocode I thought of, which would've worked perfectly in my case:

  • Add a flag to specify how many (let's call this amount n) to do at a time (essentially an equivalent to $wgReplaceTextResultsLimit for the replaceAll.php script)
  • SELECTthe first n
  • run php maintenance/runJobs.php
    • if stderr returned, sleep(5) and then call php maintenance/runJobs.php again
    • keep running until exit code 0 returned and job queue finished
  • Loop back, SELECT the next n and repeat until SELECT query empty

The goal here is to minimize the number of times that php extensions/ReplaceText/replaceAll.php has to be called from the shell, or at the very least making sure it is able to handle sudden interruptions.

Yaron Koren (talkcontribs)

That's good to know that even the SELECT query by itself can crash the server. Given that, maybe the best solution is just to add a "limit" flag to the script, to let you limit each run to some number of replacements? Would using the job queue still be necessary at that point?

MyWikis-JeffreyWang (talkcontribs)

I would still prefer the job queue because let's say we SELECT 100, we run through 25 changes, and then the script crashes. If using the job queue, we can just finish the last 75 changes. Whereas if we didn't use the job queue, we would select another 100, presumably including the 75 changes, which kind of wastes the computing effort in from the first SELECT. TL;DR: Not using the job queue means more redundant SELECTing. Though I suppose it would not make too much of a difference if the limit is set at a reasonably low number, so it is probably fine to not use the job queue, since that'd be easier to code. In that case, my only hope is that the script can loop the select/replace n pages each time (n = limit flag), so there's no need to call the script 100 times.

Yaron Koren (talkcontribs)

Oh, that's very interesting - it sounds like it's really just the SELECT that's causing the problem for you. If so, that might make the solution very easy - no need for a new "limit" parameter, just have the script do, say, 100 replacements at a time, and keep going until there are no more replacements to be made, rather than trying to do them all at once. Would that work? And if so, what's a reasonable number of replacements to do each time?

Reply to "Suggestion: adding an option to the replaceAll.php script to add changes to job queue like the special page"