User:Pinkgothic/DPL

From MediaWiki.org
Jump to navigation Jump to search

Scope[edit]

Objective[edit]

A significant performance gain.

Solution[edit]

  1. In includes/LinksUpdate.php find the method queueRecursiveJobs().
  2. Wrap the RefreshLinksJob2 call into an if-statement that executes only if the page is a Template:
function queueRecursiveJobs() {
    global $wgUpdateRowsPerJob;
    wfProfileIn( __METHOD__ );

    $recurse = ($this->mTitle->getNamespace() === NS_TEMPLATE);
    $cache = $this->mTitle->getBacklinkCache();
    $batches = $cache->partition( 'templatelinks', $wgUpdateRowsPerJob );
    if ( !$batches ) {
        wfProfileOut( __METHOD__ );
        return;
    }
    $jobs = array();
    foreach ( $batches as $batch ) {
        list( $start, $end ) = $batch;
        $params = array(
            'start' => $start,
            'end' => $end,
        );
        if ($recurse) {
            $jobs[] = new RefreshLinksJob2( $this->mTitle, $params );
        } else {
            $jobs[] = new RefreshLinksJob( $this->mTitle, $params );
        }
    }
    Job::batchInsert( $jobs );

    wfProfileOut( __METHOD__ );
}

You're done. Wasn't that easy?

Background[edit]

a/k/a "...wait a second, exactly why the hell does that work?!"

mediawiki 101[edit]

One thing you have to know about mediawiki is that aside from saving the text of every page when you hit the magical [Save page] button, it also parses out any internal links you've put into the page and stores this information in a separate table in the database. That's to make several essential functions in mediawiki's user interface significantly faster, and to make things like 'Wanted pages' and the Toolbox's What links here? and Related changes possible.

You may also be aware that mediawiki lets you use templates - pages that you can include in the flow of another page.

Now, imagine you have a template that contains an internal link. You've included it on a dozen other pages. Sensibly, mediawiki assumes that the links on those dozen pages are important to you in their entirety, so the internal link your template provides counts as a link from the dozen pages, as well. Less cryptically stated, assuming your template is uninventively called Template:Link and you have it included on the pages A, B, C and D, and you link to the (equally uninventively titled) page Foo, mediawiki will want to store the following information for you:

            A --> Foo
            B --> Foo
            C --> Foo
            D --> Foo
Template:Link --> Foo

Additionally, it will store that the pages A, B, C and D are using the template 'Link' in the templateLinks table.

So, taking a look at your page Foo's What links here? will list:

* A
* B
* C
* D
* Template:Link

And looking at your template's What links here? will list:

* A (transclusion)
* B (transclusion)
* C (transclusion)
* D (transclusion)

What happens if you change the link to 'Bar'? All the 'Foo' records that are in accordance with the 'Template:Link' records have to change. In other words, if you change the link on the template, mediawiki detects a change on the page and assumes it has to rebuild the link table for the pages that include the template. Mediawiki recognises that this is potentially a costly process that shouldn't slow you down in your editing, though, and only rebuilds the links for the template, directly, and queues the link tables of the other pages for later. Since the template is used on four pages, there are four jobs - one for each page.

Assuming you haven't changed the job run rate, each time someone visits a page on your wiki, one job is taken care of. That means that as far as the page Bar is concerned, when you hit [Save page] on the initial edit, only the template is linking to it:

Template:Link --> Bar

The next time a page loads, the first job runs, and Bar knows of one more link to it:

            A --> Bar
Template:Link --> Bar

The next time a page loads, the next job runs...

            A --> Bar
            B --> Bar
Template:Link --> Bar

And so on and so forth until all our four jobs have completed.

Mediawiki determines the links by building the whole page and parsing out internal links. [how?] So far, so good!

In closing, know that mediawiki also allows you to transclude pages that aren't templates. Keep that in mind. It'll be important in the next section.

DPL 101[edit]

The third-party 'dynamic page list' extension lets you create dynamic lists of an almost entirely arbitrary subset of your Wiki's pages on any other Wiki page. The invocation of a DPL query is fairly complex and quite powerful, but an example should be enough to understand the basics:

<DPL>
category = All extensions
includepage = {Extension}:version:mediawiki
allowcachedresults = true
table = class="wikitable sortable" style="vertical-align:top",Extension,Version,Runs on mediawiki
</DPL>

A query like that would pull all pages in the category "All extensions", then parse them for the template "Extension" and determine what the values for the parameters 'status' and 'mediawiki' are. For example, while parsing the page Extension:DynamicPageList (third-party) (which is in the "All extensions" category), it would find these lines:

DPL-query-template-example (for pinkgothic's DPL userpage).png

The generated table would look something like this:

ExtensionVersionRuns on mediawiki
...
Extension:DynamicPageList (third-party)1.8.91.7 .. 1.16+
...

A query like that can take DPL several seconds to execute. Since the authors of DPL know this, they allow DPL to be cached along with the rest of the page. Great!

The problem is that rather than to parse the page itself and speak directly with the database, DPL uses wikimedia's transclusion ability to generate each row. In other words, if mediawiki.org had DPL installed and you placed the above DPL query onto a page, the What links here? of Extension:DynamicPageList_(third-party) would show, amongst everything else:

* Your DPL query page (transclusion)

As well as, if you tell WhatLinksHere not to show transclusions:

* Your DPL query page

The transclusion is the row itself - the numbers we've pulled from the content. That's a full page transclusion for mediawiki, because of the way DPL has it solved. The link is simply the link in the first column to the actual page, as one would expect.

The reason DPL uses transclusion is to allow a sane cache handling. While mediawiki doesn't know which parts of the transcluded page are displayed by DPL, it does know that it is transcluded, and each change on the transcluded page in turn means that the DPL table is invalidated. That way, DPL is on the safe side.

The less obvious side-effect, however, is that the links for the DPL query page have to be rebuilt. Remember, page transclusion is essentially the same as template transclusion for mediawiki, and it has no way of knowing that DPL only wants to display a part of the page. In consequence, mediawiki itself doesn't know that if you change a link somewhere in the depths of the page, it won't appear on the DPL query page. It will assume it could appear there (which is a good stance given the data it has simply doesn't let it be any more precise in any performant way) and thus queues the DPL query page for a refreshLinks job.

The road to disaster[edit]

Let's assume you have a hundred pages that all use the template 'Document', and you use DPL on several other Wiki pages to group these documents in some way. Maybe the documents exist as the, well, documentation for a software product, and you have a page each in your Wiki that documents software releases. On each of those pages, you want to show a list of documents associated with that release.

So you might have the following DPL call on the page Release 0.1:

<DPL>
category = Documents
includepage = {Document}:version
includematch = /\|version=0\.1/s
table = class="wikitable sortable" style="vertical-align:top",Document,Version
allowcachedresults = true
</DPL>

And then you also have an All releases pages (why not, right?):

<DPL>
category = Documents
includepage = {Document}:version
table = class="wikitable sortable" style="vertical-align:top",Document,Version
allowcachedresults = true
</DPL>

And then you have another page Security documents that lists documents by whether or not they are about security topics:

<DPL>
category = Documents
includepage = {Document}:version
includematch = /\|security=yes/s
table = class="wikitable sortable" style="vertical-align:top",Document,Version
allowcachedresults = true
</DPL>

And maybe a few more in that vein, so that at the end of the day, your hundred document pages are linked to in a handful of DPL queries, each.

What happens on a queried page's edit[edit]

We already know what happens when one of the documents is edited: All DPL queries that contain it as its result have their associated links queued for rebuilding.

For our example set, that means that making a change (any change!) to Document001 (which so happens to be relevant for version 0.1 of your software project) will queue Release 0.1, All releases, Security documents and all those other DPL queries relevant to the page for a trip to refreshLinks.

So, from that edit's moment on, the DPL query pages are all in the job queue table.

What happens on page load[edit]

On page load, mediawiki helpfully takes the first task out of the job queue. Let's say it's the page Release 0.1. It wants to find all internal links in it, so it parses the entire page, which bypasses the cache [really?], and triggers DPL to struggle through its sluggish table building routine.

Meanwhile, the person viewing the page doesn't know this. The person just notices that the page is taking several seconds to respond, for no discernable reason.

Why does the job queue do this? Well, the job queue was designed, to quote:

The job queue is designed to hold many short tasks using batch processing.

Emphasis mine. DPL table regeneration is not a short task by any definition of the word. It shouldn't be in there. The job queue can't know it's being abused, though. Refreshing page links usually doesn't do anything out of the ordinary. A non-DPL-query refreshLinks takes a fraction of a second!

But each edit to a DPL-queried page, regardless of whether the change made will show up in the DPL result table(s) or not, will trigger a job for each of the relevant DPL-query pages to have their links refreshed. That is to say, for every edit to a DPL-queried page, your wiki will be slow-responding the number of pageloads in that it takes to work those jobs. If your page is in the result set of ten DPL-query pages, it'll take ten page loads for it to return to normal... right up until the next change in one of the queried pages triggers the next inevitable job cascade. (All of this is assuming your $wgJobRunRate is unchanged at the default 1 per page load).

The saddest part of this effect is twofold:

  1. In most cases, it's a null edit. You're probably aggregating indices and overviews with DPL, the bulk of the queried pages won't be queried. However the bulk of the queried pages is where most of the changes happen. If a change happens in said non-queried portion of the page, the job runs for naught - the links on the DPL result page don't change at all!
  2. In the cases it isn't, presumably, no one cares. If you put a link into some aspect of your page being listed by DPL, the information of semantic importance is the link from the queried page to wherever you're linking to. That the index now also contains this link is probably just accidental or at the very most simply incidental. There's no semantic information being conveyed here.

It's at that point you might flirt with the idea of simply refusing to schedule the refreshLinks job for DPL's transclusions.

And that's what my fix hopes to achieve.

However: My fix simply denies refreshLinks to any non-template transclusions. Thus, it does have side-effects:

Side effects[edit]

If your wiki encourages non-Template page transclusion, you'll want to run refreshLinks.php as a cron job (crontab -e, enter 0 * * * * /path/to/your/mediawiki/maintenance/refreshLinks.php for an hourly run). This is still a better solution than using the job queue for this purpose.

Additionally, even with the refreshLinks.php cron job, newly created pages that are picked up by DPL queries will not have the links from those DPL queries registering in their 'What links here?', and deleted pages that were once picked up by DPL queries will show up in the wanted pages list. That is because refreshLinks.php does not execute extension hooks [thanks, Bawolff!]. You can fix that by doing a null edit on the DPL query pages that are affected. A patch for this use-case is in the works, see this page's discussion.