User:NeilK/Sitemaps

Actions taken

late February 2011 - Danese asked me to look into this, discussed this with Wikimedia veterans, started this wiki page
Google Image Search meeting on February 28, 2011. They explained stuff we already knew (with the exception of Google's new extensions to sitemaps, including licensing). Ask GIS non-technical rep to ping us again later
Tried to spec it out, originally thought we could use the new extensions provided, like licensing, realized that digging that info out of wikitext is not feasible at scale. Such info needs to be in the db first, or we need a more clever strategy about doing sitemaps.
early March 2011 GIS non-technical rep pings again, asking if we've assigned this work
floated idea on Commons-L, got some feedback
User:Tfinc mentions that TinEye (reverse image search engine) is very keen to get Sitemaps on Commons. This would be a good thing for our community for the whole copyvio research issue, as well as being a sort of "registrar" for images on Commons.... However, they seem to have their own TinEye "imagemap" XML format which is similar but not quite the same as regular sitemaps, and once again not like the Google format.
- Since we have three different(?) formats -- Sitemaps, Sitemaps+Google extensions, TinEye this is starting to look like we should get someone to actually do work on this.
- emailed TinEye to ask if they are continuing to use their own format, or can we do Sitemaps + Google Extensions only
April 2011 pinged ops again to see how this is progressing, they give me more rights on RT so I can give them more info about how to enable this

Resources

Manual:GenerateSitemap.php is the standard way to do this.

The manual page indicates these are not compatible with Google as of 1.16, but a patch can fix that. Did this get fixed in 1.17? Manual talk:GenerateSitemap.php#A BUG FIX to work the Google Webmaster Tools

Yes, it was fixed in rev:75650. Max Semenik 21:48, 25 February 2011 (UTC)

There are a number of other tools: Extension:Google Sitemap (may be obsolete)

This user created another script... it is unclear why he thought it was necessary to write his own, perhaps this works better with multiple sites. User:DaSch/generateSitemap.php

There is a specialized extension for creating sitemaps with Google's "news" protocol extensions. Extension:GoogleNewsSitemap. Seems to work more like an RSS feed, generating a crawlable sitemap of the most recent items. Also see bug 21919 where people complain about getting this deployed.

Questions

How well do these tools scale? Is it feasible to redump all the titles on a frequent basis, or should we go to a more incremental strategy?

The standard sitemaps script is well-written, but selects all pages in alphabetical order and then iterates, writing entries as it goes. For enwiki this is obviously going to take a longish time. (Jens says there was no issue, it was run from a cronjob and completed just fine, at least in late 2007. enwiki was about half the size it is now, although there are more wikis generally today.)

Still, this is very inefficient since almost all the results are the same anyway, and search engines will catch a 404 on the rare time they look up a page and it's gone. So mostly it's pointless to iterate through them all.

Interestingly the sitemaps script sets a different priority for every namespace. It creates a new sitemap for every namespace (numbered, which is the only sane way given that they change in different languages or wikis sometimes) and then creates one index file for them all.

See Ideas below for a more incremental approach.

History

Consensus from Brion Vibber, Ariel T. Glenn, et al., is that we used to run Sitemaps but haven't since 2008. bug 13693 suggests the exact date was 2007-12-27.

Brion believes the standard generateSitemap.php script was the one being used.

It is unclear why we stopped. Brion believes that Jens Frank (JeLuF on IRC) was the one in charge of this. E-mailing him to find out.

Jens replies:

Google was not really using it. They apparently also had some special engine to crawl wikipedia, so there wasn't a real need for it.
- My note: However, this says nothing about other search engines.
Additionally, there were some problems with the configuration. If I remember correctly, there were some issues with the location of the sitemaps and our disk layout. We have only one DocumentRoot per project, so e.g. de.wikipedia.org and en.wikipedia.org share the same /sitemap/ directory and - even worse - they share the file that Google needs to verify that I'm allowed to generate sitemaps for *.wikipedia.org. This was reported to Google, but they never came back to us - probably because they have their special crawler already.
- My note: But this could be fixed with a simple change to the script?
  - Not that easy. Lots of scripts assume that there's one directory per project. And it would increase the size of our installation dramatically. We already have problems with the rollout of changes (aka "scap") since it takes too long and the apache farm is havering different software releases during the scap.

Ideas

Dumps are NOT regular again yet, so perhaps this should be decoupled from that...

It is also sometimes important to be timely. It would be nice if we had a script to run hourly (or more frequently) to append new articles to the last leaf of a tree of sitemap index files (or entire new leaves as appropriate). Then we can regenerate the entire tree now and then, perhaps at the same moment as a dump, to get rid of deleted pages.

Change frequency

Question: does Google allow changes to the change-frequency of an item? It might be interesting to set change-frequency to items based on the date of their last edit. This presumes that edits come in "waves".

Note: the generateSitemap.php script sets lastmod, but does not set change-frequency. Presumably we would have to dip into article history to do a true guess at frequency of changes, which would be hella slow. But we could also take a wild guess based on the difference between script run time and the last modification time. Or, maybe that's what Google does anyway, so we should just let them handle it.

Sitemap formats

Major differences:

The primary item in Google's format is the URL, which can contain many images or videos. The sitemap format is also designed to give the search engine a sense of what items should be recrawled and when.

The primary item tracked in TinEye's format is an image, which may have a page. There is no recrawling in TinEye; they never delete or replace an image from their archive. This suggests we might want to delay giving data to TinEye until we're sure it's correct (maybe let it sit for a few weeks?)

* = required

grouping	mediawiki page table	mediawiki image table	mediawiki wikitext (this is hard)	Google Sitemap+Image ext.+Video ext.	TinEye Imagemap
x				*urlset	*tineye-list
x				*urlset/url
	derive url for page			*urlset/url/loc	*tineye-list/image/page-url
		last file update?		urlset/url/lastmod
		delve into history?		urlset/url/changefreq
				urlset/url/priority
x				urlset/url/image:image/	*tineye-list/image
		derive from title?			*tineye-list/image/id
		derive url for image		urlset/url/image:image/image:loc	*tineye-list/image/image-url (for img scaled between 600x600 and 1200x1200)
			description	urlset/url/image:image/image:caption
			any {{Location}} template	urlset/url/image:image/image:geo_location (format?)
page title				urlset/url/image:image/image:title
			any template in Category:License (until this moves to db)	urlset/url/image:image/image:license
			a URL contained within infobox author field, esp if to a MediaWiki user page?		*tineye-list/image/author-id
			Infobox author field		*tineye-list/image/author-name
					tineye-list/image/keywords [lang=]
		related to img_width but needs to be derived from scaled version given to TinEye (they demand it fit within 1200x1200)			tineye-list/image/metadata/width
x					tineye-list/image/metadata
		related to img_height but needs to be derived from scaled version given to TinEye (they demand it fit within 1200x1200)			tineye-list/image/metadata/height
		related to img_size but needs to be derived from scaled version given to TinEye (they demand it fit within 1200x1200)			tineye-list/image/metadata/file-size

Google

<?xml version="1.0" encoding="UTF-8"?>
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
 <url>
   <loc>http://commons.wikimedia.org/wiki/File:Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg</loc>
   <image:image>
     <image:loc>http://upload.wikimedia.org/wikipedia/commons/a/aa/Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg</image:loc>
     <image:caption>Limerick - Thomond Bridge Thomond Bridge is the most northern of the three road bridges that span the River Shannon at Limerick.</image:caption>
     <image:geo_location>Limerick, Ireland</image:geo_location>.
     <image:title>File:Limerick - Thomond Bridge - geograph.org.uk - 331738.jpg</image:title>
     <image:license>http://creativecommons.org/licenses/by-sa/2.0/deed.en</image:license>
   </image:image>
   <image:image>
     <image:loc>http://example.com/photo.jpg</image:loc>
   </image:image>
 </url> 
</urlset>

TinEye

<tineye-list 
   creation-date="2010-04-20T11:48:43.0Z"
   xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
   xs:noNamespaceSchemaLocation="http://www.tineye.com/contributing/imagemap.xsd"
>
   <image>
      <id>??? page id ? File:Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg</id>
      <page-url>http://commons.wikimedia.org/wiki/File:Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg</page-url>
      <image-url>http://upload.wikimedia.org/wikipedia/commons/a/aa/Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg</image-url>
      <author-id>??? in this case author is not a user on Commons, it was a bot. We have author's URL? http://www.geograph.org.uk/profile/4335</author-id>
      <author-name>Colin Park</author-name>
      <keywords lang='en'>??? use categories??? use descriptions?? Limerick,Road bridges in Ireland</keywords>
   </image>
   <image>
      <id>SN08972345</id>
      <page-url>http://www.example.com/image/SN08972345</page-url>
      <image-url>https://comps.example.com/comp/SN08972345.jpg</image-url>
      <author-id>jdoe2</author-id>
      <author-name>Jack Doe</author-name>
      <metadata>
         <width>12000</width>
         <height>8000</height>
         <file-size>12373020</file-size>
      </metadata>
      <keywords lang='en-US'>dog, fire plug, sidewalk</keywords>
      <keywords lang='en-GB'>dog, fire hydrant, pavement</keywords>
      <keywords lang='de'>hund, hydrant, bürgersteig</keywords>
   </image>
</tineye-list>