User talk:Pinkgothic/DPL

From MediaWiki.org
Jump to navigation Jump to search

Patch discussion[edit]

"If your wiki encourages non-Template page transclusion, you'll want to run refreshLinks.php as a cron job (crontab -e, enter 0 * * * * /path/to/your/mediawiki/maintenance/refreshLinks.php for an hourly run). This is still a better solution than using the job queue for this purpose. Seems like better advice would be to just run the job queue via runJobs.php and disable running jobs from web interface. Also, wouldn't simply using the dplcache parameter accomplish everything you do here, but without the side effects of making the job queue not work for non-templates? Bawolff 21:12, 28 January 2011 (UTC)

Hi, Bawolff, sorry for the late reply, it didn't notify me of the change to the discussion page! (Found out why; apparently I set my preferences not to ages ago. Epic fail! *harakiri*)
We use the cache parameter on all our DPLs, it doesn't make a difference to this scenario. mediawiki seems to bypass that setting entirely when it rebuilds the links.
Regarding your runJobs.php comment, that may actually be necessary.* The if-statement comes with the downside that the main link DPL generates is also lost (the not-(transclusion) one). :[ I'm still trying to figure out exactly why that is or how to fix it. I think that might be a case of the cached-results setting actually being a bad idea. v_v I'm going to see what I can find out there.
Point being, refreshLinks is unfortunately pointless. I'm not sure why that is, but it doesn't seem to do anything at all. :|
* The reason I didn't suggest runJobs.php, by the way, is because the jobs put into the Job Queue are largely null edits to begin with. My aim was to get rid of the unnecessary overhead. refreshLinks.php on our wiki would run for about seven minutes and then be done. A runJobs.php cron would... actually probably never terminate, since at least during mid-day we'd get in jobs faster than the jobs were finishing (back when I was trying to figure this problem out, I was manually triggering runJobs.php to see if taking jobs off their request-trigger was speeding things up considerably (it was!), that ran absolutely forever - it did terminate, but usually well after running for an hour). Additionally, running other jobs, like "proper" transclusion (read as: templates) I still wanted to have web-triggered.
-pinkgothic 10:33, 10 February 2011 (UTC)
Turns out the problem isn't my fix (directly). Commenting the if-Statement back out still has the same issue with newly created pages - but a DPL-query page refreshLinks2-Job evidently eventually fixes it. So each time an established page was edited, we got the new pages in. Now we just get the effect when we null-edit the DPL-query page(s). &action=purge does nothing, interestingly enough. *headdesk*
Guess it's back to the drawing board. -pinkgothic 11:03, 10 February 2011 (UTC)
I'm not really sure how DPL does caching, but I do know it invents its own way instead of using the mediawiki way, and thus its possible its caching might not be implemented that well (it looks as if it wouldn't work well on multi-server setups, it looks kind of scary when considered with apache content negotiation). refreshLinks.php doesn't render extension hooks (including DPL), which is why it might be faster then the jobQueue. The jobQueue does some other stuff to (like email notification) which might slow it down compared to refresh links, but really should not. Bawolff 13:41, 10 February 2011 (UTC)

*nods* "refreshLinks.php doesn't render extension hooks (including DPL)" - That'll be why it doesn't heal the DPL query pages, and indeed why it's faster. (After doing some math, refreshLinks.php on our wiki, if it loaded extension hooks, would take at least ~15 minutes, which is more than the seven-ish I've observed.) And presuming purge does nothing to the link list?

Anyway, I went back to the drawing board and came up with this on short notice (it desperately needs cleaning (read as proper mediawiki DB class usage, more OO, et cetera) so it sucks less, but I needed something to patch things up rather quickly *grimace*):

(works for mediawiki 1.17.0 with DPL 1.8.9; not sure about other versions)

<?php
$_SERVER['REQUEST_METHOD'] = 'GET';

$directoryToAdd = explode(DIRECTORY_SEPARATOR, dirname(__FILE__));
array_pop($directoryToAdd);
$directoryToAdd = implode(DIRECTORY_SEPARATOR, $directoryToAdd);

chdir($directoryToAdd);
putenv('MW_INSTALL_PATH=' . $directoryToAdd);
set_include_path($directoryToAdd . PATH_SEPARATOR . get_include_path());

include(dirname(__FILE__) . '/../includes/WebStart.php'); // why are we WebStarting if we want to CLI? FIXME

if ($wgDBtype !== 'mysql') {
    throw new InvalidArgumentException('fixDPL.php only supports mysql');
}
//$wgDBport = "5432";

$dbConnection = mysql_connect($wgDBserver, $wgDBuser, $wgDBpassword);
mysql_select_db($wgDBname, $dbConnection) or die('Could not connect to database.');

$dplQueryTemplate = 'Extension_DPL';

$result = mysql_query(
    'SELECT '
        . 'p.page_title AS title, p.page_namespace AS namespace'
    . ' FROM '
        . $wgDBprefix . 'page AS p LEFT OUTER JOIN ' . $wgDBprefix . 'templatelinks AS tl ON tl.tl_from=p.page_id'
    . ' WHERE '
        . 'tl.tl_title=\'' . mysql_real_escape_string($dplQueryTemplate) . '\''
    . ' ORDER BY '
        . 'p.page_title ASC',
    $dbConnection
);

while ($row = mysql_fetch_assoc($result)) {
    $map = array(
        'job_namespace' => $row['namespace'],
        'job_title'     => $row['title'],
        'job_params'    => 's:0:""',
        'job_cmd'       => 'refreshLinks'
    );
    foreach ($map as $c => $v) {
        if ( !is_numeric($v) || ( ((string)(int)$v) !== $v) ) {
            $map[$c] = "'" . mysql_real_escape_string($v) . "'";
        }
        $map[$c] = $c . '=' . $map[$c];
    }
    $sql = 'INSERT INTO ' . $wgDBprefix . 'job SET ' . implode(', ', $map);
    echo $sql . PHP_EOL;
    mysql_query($sql, $dbConnection);
}

echo 'Executing runJobs.php [' . date('Y-m-d H:i:s') . ']...' . PHP_EOL;
ob_flush();
$wgCommandLineMode = true;
include dirname(__FILE__) . '/runJobs.php';
$wgShowExceptionDetails = true;
$runJobs = new RunJobs();
$runJobs->execute();
echo 'Done with runJobs.php! [' . date('Y-m-d H:i:s') . ']' . PHP_EOL;

It's a hackish thing that requires DPL query pages to be in a category 'DPL_query_pages', because that was quicker to do than change DPL around to include the category automatically or create another table to track DPL-usage (or rather, quicker than trying to understand mediawiki extensions quick enough to do either - I literally have zero experience in writing one of my own). DPL auto-tracks itself now.

This solution is very similar to your suggestion to cronjob runJobs.php - except that it collects and executes the DPL-specific jobs separately, so other jobs can still run through the web, and you don't end up with a ridiculously long job queue in the meanwhile. It has the caveat that some of its jobs will be caught via the web if the cronjob doesn't run, say, at night (providing your wiki is in local-timezone use only).

I'm going to have to look into writing a little patch for DPL, though, obviously I can't release our maintenance/fixDPL.php, that doesn't work 'out of the box'. :)

By the way, I found 76 DPL query pages on our wiki, clustered about eight templates (with includeonly-categories). The associated categories contain the following amount of pages: 38, 81, 199, 238, 268, 280, 322, 941. So that's the exact numbers at time of writing of what was killing our wiki. *cringechuckle* We really can't live without the if-NS_TEMPLATE fix, sadly.

-pinkgothic 15:01, 10 February 2011 (UTC)

Caching sidenote[edit]

Hmm, the caching code in DPL is really weird. When it caches the dpl, it disables mediawiki's normal cache for the rest of the page. (The caching code might also allow an attacker to execute code on the server for some configurations of apache). Bawolff 19:48, 10 February 2011 (UTC)
*grimaces* Ouch, that doesn't sound promising. Thanks for the heads-up! Means I should take a look at it, I was supposed to analyse some of DPL's security vulnerabilities anyway. Oh, to have time for things! v.v </truckload of woe>
I'll probably get back to this tomorrow (the 'provide a proper patch for new/deleted pages' part), providing I can sneak in some time for it then.
-pinkgothic 20:04, 10 February 2011 (UTC)
The issue probably would be easy to fix. The issue is that it asks the user for the cache filename (really it should do something like md5sum of the dpl so its not a user issue. Really it should use mediawiki's internal cache support.). It doesn't validate this filename properly, allowing user to put it in any directory they want. Furthermore, default is web accessible. It appends .txt to the end of the user's filename, however apache takes all extensions into account depending on how things are setup foo.pl.txt might be interpreted as a perl script (or other scripting language). The XSS issues would be significantly harder to fix. Bawolff 22:00, 10 February 2011 (UTC)

Patch discussion #2[edit]

Issue: DPL has no way of "What links here"-ing new pages that match its queries until its cache is rebuilt, nor healing the "What links here" of pages that are deleted. The same is true for pages that are changed or moved in a way that has them be freshly in- or excluded by a DPL query. (This is only really obvious after the patch in the article because it deliberately cuts down on cache-rebuilding, but it is an inherit problem.)

Related Good Idea TM: Fixing the caching as per Bawolff. Should allow reverting the patch from the article, thus making it a DPL-internal fix only, but it doesn't address DPL's new/deleted page issue.

Presumably easy but still hackish solution: The manual categorisation becomes automatic by going into the <DPL> parsing routine. Fixing is still done by a cronjob (see horrible file in the first part the discussion). omgnoez. (This is actually no longer an issue.)

"Proper" solution a/k/a overkill, but I had to:

After spending some time thinking about it, it strikes me as the cleanest solution to have two more tables:

dpl_links
+-----------+--------------+
| dpll_from | dpll_page_id |
+-----------+--------------+
|         1 | 245          | title 'Foobartastic'
|         1 | 11           | title 'Foobar itself'
|         2 | 71           | in category ID#6
+-----------+--------------+
dpl
+--------+---------+-------------+-------------+-------------+---------------+
| dpl_id | page_id | category_id | template_id | title_match | content_match |
+--------+---------+-------------+-------------+-------------+---------------+
|      1 |      12 |        NULL |        NULL | Foobar%     |          NULL |
|      2 |      12 |           6 |        NULL |        NULL |          NULL |
+--------+---------+-------------+-------------+-------------+---------------+

<DPL> parsing determines if the DPLs on the page changed at all; if yes, deletes its page_id entries from the dpl table and the corresponding entries in dpl_links, then adds the 'new' ones. This would be very annoying if you have several DPLs on the page and you delete one; maybe some clever comparison routine can only delete the DPLs that were changed/deleted. I'm not thinking at that level of detail yet (I haven't looked at the parsing code yet, maybe that's even easier than doing a wipe).

Editing a page on the wiki as well as creating a new one would trigger a query to the effect of -

SELECT * FROM dpl WHERE (category_id IS NULL OR category_id IN (page_category_ids))
                    AND (template_id IS NULL OR template_id IN (page_template_ids))
                    AND (title_match IS NULL OR 'page_title' LIKE title_match); -- ...is this even possible?

Then having something like...

$db->query('DELETE FROM dpl_links WHERE dpll_page_id = ...'); // on edit: WHERE dpl_id NOT {...has title_match...}
foreach ($rows as $row) {
    if (empty($row['content_match']) || preg_match($row['content_match'], page_content)) {
        $db->query('INSERT INTO dpl_links ...');
    }
}

...happen. (Table and query/routine can be expanded in regards to other sensible DPL query parameters, those were the ones I knew of off the top of my head.)

Deleting a page would simply remove it from dpl_links.

Moving a page could cut down the check to title_match-es only.

What Links Here would be expanded to connect the page_id in dpl with the dpll_page_id in the dpl_links table.

The whole setup is strongly related to transclusion, but also deals with new pages and pages changed to be freshly included in DPL queries, which the DPL-transclusion concept can't pick up on. It's quite possible this is horrible, performance wise, and thus not an option; and there is only so much indices could do to help, too. But it seems cleanest from an architectural point of view.

Blagh.

-pinkgothic 11:10, 9 March 2011 (UTC)

I agree that having a links table and invalidating cache seems the Right Way to do things. (In fact, I was planning to try to add a similar feature to the more simple Wikimedia version one of these days), however this approach may not work with many of third party dpl's features. As you said yourself the title_match thing seems dubious if its even possible, and would definitely be inefficient. -Bawolff 01:03, 10 March 2011 (UTC)