User:NeilK/Sitemaps

Actions taken

 * late February 2011 - Danese asked me to look into this, discussed this with Wikimedia veterans, started this wiki page
 * Google Image Search meeting on February 28, 2011. They explained stuff we already knew (with the exception of Google's new extensions to sitemaps, including licensing). Ask GIS non-technical rep to ping us again later
 * Tried to spec it out, originally thought we could use the new extensions provided, like licensing, realized that digging that info out of wikitext is not feasible at scale. Such info needs to be in the db first, or we need a more clever strategy about doing sitemaps.
 * early March 2011 GIS non-technical rep pings again, asking if we've assigned this work
 * floated idea on Commons-L, got some feedback
 * User:Tfinc mentions that TinEye (reverse image search engine) is very keen to get Sitemaps on Commons. This would be a good thing for our community for the whole copyvio research issue, as well as being a sort of "registrar" for images on Commons.... However, they seem to have their own TinEye "imagemap" XML format which is similar but not quite the same as regular sitemaps, and once again not like the Google format.
 * Since we have three different(?) formats -- Sitemaps, Sitemaps+Google extensions, TinEye this is starting to look like we should get someone to actually do work on this.
 * emailed TinEye to ask if they are continuing to use their own format, or can we do Sitemaps + Google Extensions only

Resources
Manual:GenerateSitemap.php is the standard way to do this.

The manual page indicates these are not compatible with Google as of 1.16, but a patch can fix that. Did this get fixed in 1.17? Manual talk:GenerateSitemap.php
 * Yes, it was fixed in 75650. Max Semenik 21:48, 25 February 2011 (UTC)

There are a number of other tools: Extension:Google Sitemap (may be obsolete)

This user created another script... it is unclear why he thought it was necessary to write his own, perhaps this works better with multiple sites. User:DaSch/generateSitemap.php

Questions
How well do these tools scale? Is it feasible to redump all the titles on a frequent basis, or should we go to a more incremental strategy?

The standard sitemaps script is well-written, but selects all pages in alphabetical order and then iterates, writing entries as it goes. For enwiki this is obviously going to take a longish time. (Jens says there was no issue, it was run from a cronjob and completed just fine, at least in late 2007. enwiki was about half the size it is now, although there are more wikis generally today.)

Still, this is very inefficient since almost all the results are the same anyway, and search engines will catch a 404 on the rare time they look up a page and it's gone. So mostly it's pointless to iterate through them all.

Interestingly the sitemaps script sets a different priority for every namespace. It creates a new sitemap for every namespace (numbered, which is the only sane way given that they change in different languages or wikis sometimes) and then creates one index file for them all.

See Ideas below for a more incremental approach.

History
Consensus from Brion Vibber, Ariel T. Glenn, et al., is that we used to run Sitemaps but haven't since 2008. suggests the exact date was 2007-12-27.

Brion believes the standard generateSitemap.php script was the one being used.

It is unclear why we stopped. Brion believes that Jens Frank (JeLuF on IRC) was the one in charge of this. E-mailing him to find out.

Jens replies:
 * Google was not really using it. They apparently also had some special engine to crawl wikipedia, so there wasn't a real need for it.
 * My note: However, this says nothing about other search engines.
 * Additionally, there were some problems with the configuration. If I remember correctly, there were some issues with the location of the sitemaps and our disk layout. We have only one DocumentRoot per project, so e.g. de.wikipedia.org and en.wikipedia.org share the same /sitemap/ directory and - even worse - they share the file that Google needs to verify that I'm allowed to generate sitemaps for *.wikipedia.org. This was reported to Google, but they never came back to us - probably because they have their special crawler already.
 * My note: But this could be fixed with a simple change to the script?
 * Not that easy. Lots of scripts assume that there's one directory per project. And it would increase the size of our installation dramatically. We already have problems with the rollout of changes (aka "scap") since it takes too long and the apache farm is havering different software releases during the scap.

Ideas
Dumps are NOT regular again yet, so perhaps this should be decoupled from that...

It is also sometimes important to be timely. It would be nice if we had a script to run hourly (or more frequently) to append new articles to the last leaf of a tree of sitemap index files (or entire new leaves as appropriate). Then we can regenerate the entire tree now and then, perhaps at the same moment as a dump, to get rid of deleted pages.

Change frequency
Question: does Google allow changes to the change-frequency of an item? It might be interesting to set change-frequency to items based on the date of their last edit. This presumes that edits come in "waves".

Note: the generateSitemap.php script sets lastmod, but does not set change-frequency. Presumably we would have to dip into article history to do a true guess at frequency of changes, which would be hella slow. But we could also take a wild guess based on the difference between script run time and the last modification time. Or, maybe that's what Google does anyway, so we should just let them handle it.

Sitemap formats
Major differences:

The primary item in Google's format is the URL, which can contain many images or videos. The sitemap format is also designed to give the search engine a sense of what items should be recrawled and when.

The primary item tracked in TinEye's format is an image, which may have a page. There is no recrawling in TinEye; they never delete or replace an image from their archive. This suggests we might want to delay giving data to TinEye until we're sure it's correct (maybe let it sit for a few weeks?)

* = required

Google
  http://commons.wikimedia.org/wiki/File:Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg  http://upload.wikimedia.org/wikipedia/commons/a/aa/Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg Limerick - Thomond Bridge Thomond Bridge is the most northern of the three road bridges that span the River Shannon at Limerick. Limerick, Ireland. File:Limerick - Thomond Bridge - geograph.org.uk - 331738.jpg http://creativecommons.org/licenses/by-sa/2.0/deed.en   http://example.com/photo.jpg</image:loc> </image:image>

TinEye
<tineye-list creation-date="2010-04-20T11:48:43.0Z" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:noNamespaceSchemaLocation="http://www.tineye.com/contributing/imagemap.xsd" >     <id>??? page id ? File:Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg</id> <page-url>http://commons.wikimedia.org/wiki/File:Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg</page-url> <image-url>http://upload.wikimedia.org/wikipedia/commons/a/aa/Limerick_-_Thomond_Bridge_-_geograph.org.uk_-_331738.jpg</image-url> <author-id>??? in this case author is not a user on Commons, it was a bot. We have author's URL? http://www.geograph.org.uk/profile/4335</author-id> <author-name>Colin Park</author-name> <keywords lang='en'>??? use categories??? use descriptions?? Limerick,Road bridges in Ireland <id>SN08972345</id> <page-url>http://www.example.com/image/SN08972345</page-url> <image-url>https://comps.example.com/comp/SN08972345.jpg</image-url> <author-id>jdoe2</author-id> <author-name>Jack Doe</author-name> 12000         8000          <file-size>12373020</file-size> <keywords lang='en-US'>dog, fire plug, sidewalk <keywords lang='en-GB'>dog, fire hydrant, pavement <keywords lang='de'>hund, hydrant, b&#252;rgersteig </tineye-list>