Extension:ArchiveLinks/Project

Google Summer of Code project.


 * Design doc
 * User stories
 * Installation instructions

Potential partners
This project can be configured to use a local spider, but for the case of the Wikipedias it's desirable from a lot of perspectives to get someone else to do the archiving and serving archived pages. Less security issues to worry about, less legal issues.

Requirements:
 * Willing to crawl any site we configure
 * Willing to crawl (eventually) all the links in Wikipedia (~16M)
 * Willing to crawl at a rate to keep up with our peak update speed (5 updates/sec)
 * Low-latency crawl
 * Reliable, fast serving of pages
 * No advertising
 * No fees to WMF or other organizations (ideally...)

Internet Archive
The Internet Archive is very interested in partnering with the WMF & Wikipedia. Their mission is to archive the web for history's sake, and they see our external links as being a source of important, community-vetted, high-quality links. They are by far the largest archiving organization on the web and already have over 150 Billion web pages indexed. This allows for a high probability that some version of a link will already exist in their archive in the event that the site is down at the time the link is added.

Advantages

 * Currently working on improving index and archive availability speed
 * Willing to start indexing
 * Most likely has the best developed spider, pages should look the most like the original, plan to use browsers to do actual archival so all elements will be downloaded, possibly even rudimentary support for web video.
 * Have crawled all Wikipedia external links recently (just got done) so all the old content is currently in place archived and ready to go.
 * Most likely to have to many existing revisions of archived content, so you can see how content has changed
 * By far the largest index, their mission is solely to archive the web, this is not a side project
 * Non profit organization
 * Offices is San Francisco, close to the Wikimedia Foundation offices so direct face to face contact is easily possible

Disadvantages

 * System is not currently in place, although it is under devolopment and could be available very shortly
 * Time at the current time is months or weeks after the present

Requirements

 * Willing to crawl any site we configure: yes
 * Willing to crawl (eventually) all the links in Wikipedia: they're already trying to do this
 * Willing to crawl at a rate to keep up with our peak update speed: yes
 * Low-latency crawl: working on it Internet Archive crawl latency is still too slow (weeks or months), but they are starting a project called Archive-on-Demand to speed it up dramatically for some sites - like Wikipedia.
 * Reliable, fast serving of pages - seems so, can we determine this with stats?
 * No advertising - non-profit, very like Wikipedia, will never have ads
 * No fees to WMF or other organizations (ideally...) - none

Other benefits: their HQ is in the same city as the WMF (San Francisco) and lots of people know each other in those orgs (NeilK, Eloquence)


 * Meeting, Internet Archive, 2011-07-14

Wikiwix
Wikiwix is an external search engine for Wikipedia. They have also expressed interest in providing this service and are currently archiving new external links from several Wikipedias, including the English and French wikipedias. The French wikipedia currently has a script deployed to link to their archived content.

Advantages

 * Out of the box ready solution
 * Currently in place for several wikis
 * Already supports low latency crawling
 * Has been heavily involved and receptive to the concerns of the wikipedia community

Disadvantages

 * Existing archive is nowhere near as large as archive.org
 * Not sure if wikiwix currently supports rearchiving pages, it appears that once a page is archived a new version is never archived again
 * Is only archiving new links, not all links
 * Side project, not core to the orginization's main mission of search
 * For-profit orginization

Requirements

 * Willing to crawl any site we configure: possibly
 * Willing to crawl (eventually) all the links in Wikipedia (~16M): probably not
 * Willing to crawl at a rate to keep up with our peak update speed (5 updates/sec): yes
 * Low-latency crawl: yes
 * Reliable, fast serving of pages: most likely yes, though I'm not totally sure on reliability
 * No advertising: at the current time no, however is a for-profit company
 * No fees to WMF or other organizations (ideally...): at the current time no

WebCitation.org
Webcitation.org is an archiving organization that is focused soley on on-demand archival. Their primary audience is scholarly publications such as journals to prevent link rot in links used as references. Many Wikipedia have already been using their service to archive external links, however nothing is archived automatically. In addition their current infrastructure is not really set up for full automatic archival.

Advantages

 * On demand archival is set up out of the box
 * There have already been numerous attempts from bots at setting up archival with them on the English Wikipedia, so this is a familiar project
 * Since some Wikipedians have been using this service for years there so some links will already be in the archive
 * Supports archiving a page multiple times in case the content of the page changes
 * Non-profit and their core mission is archival

Disadvantages

 * Infrastructure is currently inefficient and not set up for full scale archival, everything primarily operates on email addresses and emails are sent for every archival request. This does not scale well when tens of thousands of links are involved
 * May not be able to handle to load of running at full scale
 * Requires us to send archival requests to them instead of polling a feed and archiving everything on it
 * Since they have relied on manual archival requests from users many of the existing links on Wikipedia will not have already been archived.
 * Has been somewhat difficult to get in contact with webcitation.org, this could be a problem

Requirments

 * Willing to crawl any site we configure: yes but must do a request for each URL
 * Willing to crawl (eventually) all the links in Wikipedia (~16M): possibly
 * Willing to crawl at a rate to keep up with our peak update speed (5 updates/sec): probably no
 * Low-latency crawl: yes
 * Reliable, fast serving of pages: most likely yes
 * No advertising: yes
 * No fees to WMF or other organizations (ideally...): no fees at the current time