Topic on Manual talk:GenerateSitemap.php

This script will generate errors on many wikis

4
Ppehrson (talkcontribs)

Due to the fact that the URLS are not HTML sanitized, Google will reject the sitemaps if they do have HTML unescaped characters in them.


You simply need to adapt the script to sanitize the URLs further.


At line 384, change:

$entry = $this->fileEntry( $title->getCanonicalURL(), $date, $this->priority( $namespace ) );

to:

$entry = $this->fileEntry( encodeURL($title->getCanonicalURL()), $date, $this->priority( $namespace ) );

$title = htmlentities($title);

Before the private function open, around line 424, add:

   private function encodeUrl($url) {
       return str_replace(array('(',')','$','&','\,'@','*','#'),array('%28','%29','%24','%26','%27','%40','%2A','%23'), $url);

//return $url;

   }


This will sanitize to match official sitemap rules. Your generator FAILS the tests without this code, especially if someone enters special characters in a wiki title, like a dollar sign, an asterisk, or parentheses/apostrophes.

Klaugust (talkcontribs)

Hello!

I tried this but got:

PHP Parse error:  syntax error, unexpected token "@", expecting ")" in maintenance/generateSitemap.php on line 434

Also I think that one ' is missing after the backslash \ in the array, but tried this too and it returned the same error.

Kghbln (talkcontribs)

Thanks for describing the issue and providing a solution. Honestly, from experience gained by sticking around here for a while, I believe that you should file an issue at Phabricator to address this issue and ideally provide a patch to be merged. Otherwise only a few people will notice this which is kinda sad.

Ciencia Al Poder (talkcontribs)

Can someone back up how the URLs aren't correctly encoded?

From the source code, the URL is passed to htmlspecialchars PHP function, which encoded XML problematic characters. On the other hand, the URLs are URL-encoded. They're generated from Title::getCanonicalURL(), which comes from Title::getLocalURL(). If you look at the source code, the "dbKey" is returned from wfUrlencode(), which correctly URL-encodes any non-ascii characters or special URL characters like ? #.

Reply to "This script will generate errors on many wikis"