Manual talk:GenerateSitemap.php

From MediaWiki.org
Jump to navigation Jump to search

It would appear that much of the discussion below is outdated-- use the server= option to get Google webmaster tools to accept your sitemap.

ATTN: in MW 1.16 the sitemap generated by the script does not work with Google. You need to patch-up the script with this first.

Google webmaster tools[edit]

google webmaster tools wants the full url of each sitemap gz file that is listed in the index file. this doesn't happen even when i enter --server parameter to the script. so, this code has to be modified:

        function indexEntry( $filename ) {
                return
                        "\t<sitemap>\n" .
                        "\t\t<loc>$filename</loc>\n" .
                        "\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
                        "\t</sitemap>\n";

add your site url before the $filename. after that google will not complain about invalid url in the sitemap index file. i hope mediawiki developers address this problem.

Note from BarkerJr: This is the error specified above: "We've detected that a Sitemap you've listed doesn't include the full URL." -BarkerJr 11:56, 15 August 2008 (UTC)

FYI, you should edit your url to the location the SITEMAP is saved to I save mine in domain.com/sitemap, so my setup is:
        function indexEntry( $filename ) {
                return
                        "\t<sitemap>\n" .
-                       "\t\t<loc>$filename</loc>\n" .
+                       "\t\t<loc>http://domain.com/sitemap/$filename</loc>\n" .
                        "\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
                        "\t</sitemap>\n";
-- Ipstenu 14:00, 19 September 2008 (UTC)
Just wanted to say the above worked for me. I don't know why it didn't work before as my generatesitemap.php had the url path in it like below. If I ran a manual sitemap it would have the path to the files but not when generatesitemap.php was ran with a cronjob.
	function indexEntry( $filename ) {
		return
			"\t<sitemap>\n" .
			"\t\t<loc>{$this->urlpath}$filename</loc>\n" .
			"\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
			"\t</sitemap>\n";
	}
After I changed the code to include the URL of the directory where my sitemaps are stored it worked. Below is my cronjob if anyone needs help with that. I'm hosting on HostGator.
/usr/local/bin/php /home/[userdirectory]/public_html/[websitedirectory]/w/maintenance/generateSitemap.php
--fspath /home/[userdirectory]/public_html/[websitedirectory]/w/sitemap \
--urlpath http://www.website.org/w/sitemap
--Fox15rider - www.TrailWIKI.org 17:07, 23 August 2012 (UTC)

Example of usage[edit]

I've installed MediaWiki on a separate subdomain, and have set up a cronjob to automatically update the sitemap every hour:

Run:

crontab -e

Create a line that looks something like this:

*/45 * * * * /usr/local/bin/php /home/httpd/public_html/wiki/maintenance/generateSitemap.php wiki.mydomain.com --fspath /home/httpd/public_html/wiki/

Go to Google Webmasters, add your site (e.g. wiki.mydomain.com) and then add the sitemap (e.g. sitemap-index-foo_bar.xml)

On a local windows box using xampp, the command would look something like this:

C:\xampp\php\php.exe c:\mediawiki-1.14.0\maintenance\generateSitemap.php wikisubdomain.mydomain.com --fspath "C:\server\www_public_dir\" --server "wikisubdomain.mydomain.com"

The parts of the command are:

  1. initiation of the php executable file/interpreter
  2. (the first argument for the command) the php script to be executed (in this case generateSitemap.php)
  3. the fspath argument and its value, which tell the script the filesystem path where the sitemap needs to go (on the local machine)
  4. the server argument and its value, which tell the script what to use in place of "localhost" if the name cannot be resolved

Manual[edit]

Options[edit]

--help

show this message

--fspath=<path>

The file system path to save to, e.g /tmp/sitemap/

--server=<server>

The protocol and server name to use in URLs, e.g.
http://en.wikipedia.org. This is sometimes necessary because
server name detection may fail in command line scripts.
You know you need to use this when the hostname in the sitemap.xml files shows up as "localhost". Use the domain name only, without the protocol prefix (e.g. "http") and without a trailing slash ("/")

--compress=[yes|no]

compress the sitemap files, default yes

--Subfader 11:53, 19 March 2008 (UTC)

Patch for enabled Server[edit]

Index: generateSitemap.php
===================================================================
--- generateSitemap.php	(revision 35805)
+++ generateSitemap.php	(working copy)
@@ -47,7 +47,7 @@
 	 *
 	 * @var string
 	 */
-	var $path;
+	var $server;
 
 	/**
 	 * Whether or not to use compression
@@ -143,14 +143,14 @@
 	 * @param string $path The path to append to the domain name
 	 * @param bool $compress Whether to compress the sitemap files
 	 */
-	function GenerateSitemap( $fspath, $compress ) {
+	function GenerateSitemap( $fspath, $server, $compress ) {
 		global $wgScriptPath;
 
 		$this->url_limit = 50000;
-		$this->size_limit = pow( 2, 20 ) * 10;
+		$this->size_limit = pow( 2, 20 ) * 10;	
 		$this->fspath = isset( $fspath ) ? $fspath : '';
 		$this->compress = $compress;
-
+		$this->server = $server;
 		$this->stderr = fopen( 'php://stderr', 'wt' );
 		$this->dbr = wfGetDB( DB_SLAVE );
 		$this->generateNamespaces();
@@ -233,7 +233,6 @@
 		global $wgContLang;
 
 		fwrite( $this->findex, $this->openIndex() );
-
 		foreach ( $this->namespaces as $namespace ) {
 			$res = $this->getPageRes( $namespace );
 			$this->file = false;
@@ -250,9 +249,10 @@
 						$this->close( $this->file );
 					}
 					$filename = $this->sitemapFilename( $namespace, $smcount++ );
+					$server= $this->server;
 					$this->file = $this->open( $this->fspath . $filename, 'wb' );
 					$this->write( $this->file, $this->openFile() );
-					fwrite( $this->findex, $this->indexEntry( $filename ) );
+					fwrite( $this->findex, $this->indexEntry( $filename, $server ) );
 					$this->debug( "\t$filename" );
 					$length = $this->limit[0];
 					$i = 1;
@@ -366,10 +366,10 @@
 	 *
 	 * @return string
 	 */
-	function indexEntry( $filename ) {
+	function indexEntry( $filename, $server )	{
 		return
 			"\t<sitemap>\n" .
-			"\t\t<loc>$filename</loc>\n" .
+			"\t\t<loc>$server$filename</loc>\n" .
 			"\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
 			"\t</sitemap>\n";
 	}
@@ -469,6 +469,6 @@
 	$wgServer = $options['server'];
 }
 
-$gs = new GenerateSitemap( @$options['fspath'], @$options['compress'] !== 'no' );
+$gs = new GenerateSitemap( @$options['fspath'], @$options['server'], @$options['compress'] !== 'no' );
 $gs->main();

Meta tags and priority[edit]

I use an extension to change meta tags (keywords, description, priority, and robots for follow and index). It's possible to force the priority and index using those tags?--Eloy 00:39, 18 June 2008 (UTC)

Priorities[edit]

It'd be nice if there was more fine tuning on the priorities it chooses. I'd like to have newer and more popular articles to have higher priorities. Right now all the priorities are doing is having regular articles be checked more often than talk and user pages, etc. Also Google now complains if a sitemap has the same priority for every page. -Nais 21:10, 2 July 2008 (UTC)

I agree! AFAIK when all the pages in sitemaps have the same priority the site itself won't be ranked as high as it should be (as it's more difficult for googlebots to judge what's important and what's not). Is there any kind of solution to rank pages with most edits and newest pages higher than older ones? --83.145.207.200 15:35, 11 November 2008 (UTC)

How I fixed sitemap[edit]

So here is how I fixed it - http://forum.appropedia.org/blog/finally-working-mediawiki-sitemap. This is based on the fix from OLPC - http://wiki.laptop.org/go/SEO_for_the_OLPC_wiki/sitemapgen.

Good luck, --LRG 03:49, 24 August 2008 (UTC)

Patch for enabled Server not in trunk?[edit]

I tried using the generateSitemap.php script for use with google web tools, and Google came back with the following error:

 We've detected that a Sitemap you've listed doesn't include the full URL 

I used the comment "Patch for enabled Server" above to patch the generateSitemap.php, except I added a '/' between the server and filename vars:

        function indexEntry( $filename ) {
                return
                        "\t<sitemap>\n" .
-                       "\t\t<loc>$server$filename</loc>\n" .
+                       "\t\t<loc>$server/$filename</loc>\n" .
                        "\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
                        "\t</sitemap>\n";

It seems like this change would be useful in the main branch, except this code will only work if you put the sitemap in the root directory of the server, since there is now way to tell the script what the URL is for the fspath parameter. --Cnovak 21:26, 15 January 2009 (UTC)

I made a version with an smpath (sitemap-path) parameter so taht this can be changed, I'll post it here these days. --DaSch 00:07, 16 January 2009 (UTC)
By the way. I've put this Patch into mediawiki bugzilla, but thex didn't care about it. That's why it's not in trunk. When you put a / after your servername when starting the skript you have not to change this. --DaSch 00:09, 16 January 2009 (UTC)

What't the URL, then??[edit]

With the "Extension:Google Sitemap" the URL I have to give Google for my sitemap is "http://www.pop-cult.net/Wikitainment/sitemap.xml", but what if I want to use the default sitemap that my mediawiki has?

Btw, is there a way for any of those sitemaps to list more than 500 articles?

I also hate that the "Extension:Google Sitemap" is listing categories and user pages, I only want it to list the regular articles.--187.147.10.114 01:33, 31 March 2009 (UTC)

Bug: Redirect pages are listed[edit]

Moved articles continue to be listed in the generated sitemaps, a problem with regard to duplicate content issues for SEO.

It's more of a problem if using the headers extension which changes redirect pages to 301 redirects to the new location. Google throws up a "warning" message when it finds this, wanting you to only list the destination page.

Senseless to list redirect pages. Please modify the script to skip entries if they're redirects.

203.184.10.37 06:18, 7 June 2009 (UTC)

Yes, I noticed that too. Here be patch for mediawiki-1.15.1:
245a246
>                               'page_is_redirect',
284a286,288
>                               if ($row->page_is_redirect) {
>                                       continue;
>                               }
--Gutza 19:08, 27 September 2009 (UTC)
This bug doesn't seem to have been fixed. I'm using the 1.22.4 release using the "skip-redirects" option set to true and it still generates the sitemap with the redirect pages. I modified line 170 of the generateSitemap file, replacing the last "false" to "true", and that seemed to have fixed the issue.--Spaceeinstein (talk) 06:56, 22 March 2014 (UTC)
As of 24.7.2016 (MediaWiki 1.27) this seems to be fixed. I changed the options section accordingly. --Stefahn (talk) 19:56, 24 July 2016 (UTC)

A NOOB needs help[edit]

I want to submit a sitmap to Google, but I am very new at this, and can't decide if I want to use the extension or the native sitemap. I would like to keep it native but I need some explicit instructions for a (almost) complete noob. I am using SiteGround to host my site and have added several extensions to Debatrix and to date I have done one mysql query with phpmyadmin (to recover a lost password).

This page doesn't make a lot of sense, can anyone point me in the right direction? --Jake4d 04:00, 1 July 2009 (UTC)

Server Option no longer working?[edit]

It seems that GenerateSitemap.php started to ignore the server option (--server="www.example.com"). Can anyone confirm this problem? --80.242.196.24 17:13, 4 September 2009 (UTC)

Works for me. I'm specifying it with --server "www.example.com" and it behaves as expected. --81.181.249.130 11:14, 29 September 2009 (UTC)

Output to stderr[edit]

If debugging output goes to stderr, where do the errors go? I simply changed "stderr" with "stdout" throughout the script, but I'm curious about the rationale... --Gutza 11:15, 29 September 2009 (UTC)


Error message?[edit]

Total noob, here.

Tried running this, got "Parse error: syntax error, unexpected T_STRING, expecting T_OLD_FUNCTION or T_FUNCTION or T_VAR or '}' in generateSitemap.php an line 167.

Help??

  • I got it. You need to be using PHP 5.


A BUG FIX to work the Google Webmaster Tools[edit]

in 1.16 (an possibly later) you need to make the following changes in order to work for Google Webmaster Tools:

--- generateSitemap.php.bak	2010-09-09 15:50:08.000000000 -0500
+++ generateSitemap.php	2010-09-09 16:53:33.000000000 -0500
@@ -74,6 +74,14 @@
 	var $compress;
 
 	/**
+ 	 * The server URL to prepend to the filename
+ 	 * Nicola Asuni 2010-05-30
+ 	 *
+ 	 * @var string
+ 	 */
+	var $server;
+
+	/**
 	 * The number of entries to save in each sitemap file
 	 *
 	 * @var array
@@ -147,6 +155,7 @@
 		$this->size_limit = pow( 2, 20 ) * 10;
 		$this->fspath = self::init_path( $this->getOption( 'fspath', getcwd() ) );
 		$this->compress = $this->getOption( 'compress', 'yes' ) !== 'no';
+		$this->server = $this->getOption( 'server', '/' );
 		$this->dbr = wfGetDB( DB_SLAVE );
 		$this->generateNamespaces();
 		$this->timestamp = wfTimestamp( TS_ISO_8601, wfTimestampNow() );
@@ -290,7 +299,7 @@
 					$filename = $this->sitemapFilename( $namespace, $smcount++ );
 					$this->file = $this->open( $this->fspath . $filename, 'wb' );
 					$this->write( $this->file, $this->openFile() );
-					fwrite( $this->findex, $this->indexEntry( $filename ) );
+					fwrite( $this->findex, $this->indexEntry( $this->server.'/'.$filename ) );
 					$this->output( "\t$this->fspath$filename\n" );
 					$length = $this->limit[0];
 					$i = 1;
@@ -405,10 +414,11 @@
 	 * @return string
 	 */
 	function indexEntry( $filename ) {
+ 		$filename = preg_replace('/[\t\r\n\s]+/i', '', $filename); // Nicola Asuni 2010-05-30
 		return
 			"\t<sitemap>\n" .
 			"\t\t<loc>$filename</loc>\n" .
-			"\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
+			"\t\t<lastmod>".$this->timestamp."</lastmod>\n" .
 			"\t</sitemap>\n";
 	}
 
@@ -444,11 +454,14 @@
 	 * @return string
 	 */
 	function fileEntry( $url, $date, $priority ) {
+ 		$url = preg_replace('/http:\/\/.+?\//', '', $url);
+		$url = $this->server."/".$url; 
+ 		$url = preg_replace('/[\t\r\n\s]+/i', '', $url); // Nicola Asuni 2010-05-30
 		return
 			"\t<url>\n" .
-			"\t\t<loc>$url</loc>\n" .
-			"\t\t<lastmod>$date</lastmod>\n" .
-			"\t\t<priority>$priority</priority>\n" .
+			"\t\t<loc>".$url."</loc>\n" .
+			"\t\t<lastmod>".$date."</lastmod>\n" .
+			"\t\t<priority>".$priority."</priority>\n" .
 			"\t</url>\n";
 	}

Original fix by Nicolaasuni 10:21, 30 May 2010 (UTC)

Sicvolo 22:04, 9 September 2010 (UTC)

I could really use some clarity here. For one, is this a complete swap of this code for the old one, or are we making changes within the old code according to this code? If the latter, how do we know what to change, by the colors? By a line by line comparison by people who aren't coders? Then when I followed the link to download, I see that it has a much more recent date than this patch, January of 2011, to be exact. So does that mean we can just use the Jan 2011 version and forget the patch? I don't know, because people aren't documenting and updating this page. Your help would be greatly appreciated. Natcolley 22:10, 25 January 2011 (UTC)

I have put the source code of the complete file here. It is patched as advised by Nicolaasuni and works fine with me. Cheers --[[kgh]] 10:48, 15 February 2011 (UTC)
This patched file ignores --urlpath and generates error index xml. I've made another one to fix the problem. --Superxain (talk) 05:54, 17 August 2012 (UTC)
Thanks, I had "Wrong URL format" errors. This fix worked like a charme and returned no erros at all. --Subfader (talk) 16:29, 29 July 2012 (UTC)

Fix for 1.16b3 and server[edit]

I'm not sure why, but after upgrade to 1.16b3 sitemap generate function always puts "http://localhost" on beginning of the url (despite --server option). Here's a quite, dirty fix:
change

$entry = $this->fileEntry( $title->getFullURL(), $date, $this->priority( $namespace ) );

to

$entry = $this->fileEntry( str_replace('http://localhost', $this->server, $title->getFullURL()), $date, $this->priority( $namespace ) );

Sauron 06:25, 19 June 2010 (UTC)

I see something similar on 1.16.0. Regardless of what I use as the --server setting, it's not picked up by the script. So even the replacement suggested by you is not doing the trick, as I then end up with a blank string. I replaced $this->server with the actual address to get it to work.

I've made a generatesitemap script that supports --server setting.--Superxain (talk) 05:57, 17 August 2012 (UTC)

i confirm that --server doesn't work even on the trunk revision 77884

What about 1.16.2 ?[edit]

It seems the 1.16.2 doesn't have been patched, so do we consider it's broken too ?

In case it has not been patched I put the source code to the patched file here. Worked for me with 1.16.0 and 1.16.1. It should also work for 1.16.2 though I have not tried it so for. Cheers --[[kgh]] 10:50, 15 February 2011 (UTC)

Settings that work for me[edit]

I spent one day figuring out how to set up generateSitemap.php. For me this works:

generateSitemap.php --server=http://www.yoursite.org --fspath=../sitemap/

Please note that I didn't use " or ' before and after the options. I used ../sitemap/ (.. means directory up) because it didn't work with absolute paths (for example: /home/site/www/sitemap/ or /www/sitemap).

The only problem is in the sitemap-index-yoursite.xml the paths of the files are broken (http://www.yoursite.org///sitemap-yoursite-NS_3-0.xml.gz). To correct this error You have to modify the PHP code, or submit the xml.gz files one by one. Disable of compression isn't neccessary because Google Webmaster tools can read compressed files.

If running php scripts with options is not working, make a php file in the same directory where generateSitemap.php is, for example generateSitemap2.php and write this:

<?php
   system("generateSitemap.php --server=http://www.yoursite.org --fspath=../sitemap/");
?>

It is enough to run generateSitemap2.php without any options, and this will run generateSitemap.php with options.

Support for <changefreq> ?[edit]

Can this script output the <changefreq> XML attribute of a sitemap? Thanks. -Dangrec 06:39, 22 January 2012 (UTC)

Unfortunately this sitemap does not work fine in my site.[edit]

Google webmater say:


Errors
Invalid URL
This is not a valid URL. Please correct it and resubmit.
14
URL: sitemap-moegirlo_wiki-NS_0-0.xml.gz
Parent tag: sitemap
Tag: loc
4
Mar 27, 2012
URL: sitemap-moegirlo_wiki-NS_1-0.xml.gz
Parent tag: sitemap
Tag: loc
8
Mar 27, 2012
URL: sitemap-moegirlo_wiki-NS_2-0.xml.gz
Parent tag: sitemap
Tag: loc
12
Mar 27, 2012

mediawiki 1.19 can't get GenerateSitemap.php work[edit]

PHP Fatal error: Call to undefined function mysql_error() in /includes/db/DatabaseMysql.php on line 305

--Zoglun (talk) 04:41, 18 July 2012 (UTC)

It seems to be adding deleted pages to the sitemaps[edit]

Is this the expected behaviour? If so, I think this is a serious bug. I'm running the version that comes with MW 1.17.4.

Thanks

how to add to sitemap main namespaces[edit]

I've run the php script and it seems that all the main namespaces were not indexed. eg www.xxx.com/q2a/ I'd prefer not to have to create a custom namespace for them. how do I index these pages? --Tech (talk) 08:45, 15 July 2013 (UTC)

how to generate sitemap for all users[edit]

I have above 100 users in my site , but the sitemap only generating for 3 users. How can i fix this. —The preceding unsigned comment was added by 122.164.234.236 (talkcontribs) 07:28, 27 December 2013 (UTC)

Sitemap urls when using short urls[edit]

When a wiki is set up to rewrite urls to short urls it grabs the sitemap urls and changes them to a page url and reports "(There is currently no text in this page)." No-where is it documented what the urls should be in this case, and there is no indication that any configuration changes need to be made.

DB connection error: No database connection[edit]

I got this error: DB connection error: No database connection Can this be related to have Postgresql? Mediawiki version: 1.19.3 Php version: 5.4.21 Postgresql version: 9.2.3

Manual:generateSitemap.php not working (Solved)[edit]

Hi, i am new to mediawiki. my articles like (http://artistopedia.com/wiki/Raftaar) are not showing is google search results. i have configured short urls using .htaccess.

I have read somewhere that i need to submit sitemap to google. so i tried doing this (Manual:GenerateSitemap.php) but unfortunately it doesnot looks like it worked. i am getting this (http://artistopedia.com/sitemap/) as a result.

Can anybody please help?

Solution[edit]

It actually worked, i didn't knew that the "generatesitemap.php" generates the sitemap index instead of sitemap.xml. its almost the same thing, you can go ahead and submit the url of .xml under yourdomain.com/sitemap to the Google Webmaster tools.