Extension:ArchiveLinks/Project/partners

From MediaWiki.org
Jump to: navigation, search

Requirements[edit | edit source]

This project can be configured to use a local spider, but for the case of the Wikipedias it's desirable from a lot of perspectives to get someone else to do the archiving and serving archived pages. Less security issues to worry about, less legal issues.

Must have:

  • Willing to crawl any site we configure
  • Willing to crawl (eventually) all the links in Wikipedia (~16M)
  • Willing to crawl at a rate to keep up with our peak update speed (5 updates/sec)
  • Low-latency crawl
  • Reliable, fast serving of pages
  • Allows different versions of a URL to be archived over time
  • No advertising

Nice to have:

  • No fees to WMF or other organizations
  • Allows "rearchival" (different versions over time)
  • No rearchival limits


Internet Archive[edit | edit source]

The Internet Archive is very interested in partnering with the WMF & Wikipedia. Their mission is to archive the web for history's sake, and they see our external links as being a source of important, community-vetted, high-quality links. They are by far the largest archiving organization on the web and already have over 150 Billion web pages indexed. This allows for a high probability that some version of a link will already exist in their archive in the event that the site is down at the time the link is added.

Advantages[edit | edit source]

  • Willing to start indexing
  • Most likely has the best developed spider , pages should look the most like the original, plan to use browsers to do actual archival so all elements will be downloaded, possibly even rudimentary support for web video.
  • Have crawled all Wikipedia external links recently (just got done) so all the old content is currently in place archived and ready to go.
  • Most likely to have to many existing revisions of archived content, so you can see how content has changed
  • By far the largest index, their mission is solely to archive the web, this is not a side project
  • Non profit organization
  • Offices is San Francisco, close to the Wikimedia Foundation offices so direct face to face contact is easily possible (NeilK, Eloquence know people there)

Disadvantages[edit | edit source]

  • Fast crawl (archive-on-demand) is not currently in place, although it is under devolopment and could be available very shortly

Must-have[edit | edit source]

  • Willing to crawl any site we configure: yes
  • Willing to crawl (eventually) all the links in Wikipedia: they're already trying to do this
  • Willing to crawl at a rate to keep up with our peak update speed: yes
  • Low-latency crawl: working on it Internet Archive crawl latency is still too slow (weeks or months), but they are starting a project called Archive-on-Demand to speed it up dramatically for some sites - like Wikipedia.
  • Reliable, fast serving of pages : needs stats, although this is a very large & well established site...
  • No advertising - non-profit, very like Wikipedia, will never have ads

Nice to have[edit | edit source]

  • No fees to WMF or other organizations (ideally...) - none
  • Allows "rearchival" (different versions over time): yes, with very easy linking API to get the version that's closest to a given timestamp.
  • No rearchival limits: yes

Meetings[edit | edit source]

Wikiwix[edit | edit source]

Wikiwix is an external search engine for Wikipedia. They have also expressed interest in providing this service and are currently archiving new external links from several Wikipedias, including the English and French wikipedias. The French wikipedia currently has a script deployed to link to their archived content.

Advantages[edit | edit source]

  • Out of the box ready solution
  • Currently in place for several wikis
  • Already supports low latency crawling
  • Has been heavily involved and receptive to the concerns of the wikipedia community

Disadvantages[edit | edit source]

  • Small site
  • Is only archiving new links, not all links
  • Side project, not core to the organization's main mission of search
  • For-profit organization

Must have[edit | edit source]

  • Willing to crawl any site we configure: possibly
  • Willing to crawl (eventually) all the links in Wikipedia (~16M): probably not
  • Willing to crawl at a rate to keep up with our peak update speed (5 updates/sec): yes
  • Low-latency crawl: yes
  • Reliable, fast serving of pages: ??? needs reliable stats
  • No advertising: at the current time no, however is a for-profit company

Nice to have[edit | edit source]

  • No fees to WMF or other organizations: at the current time no
  • Allows "rearchival" (different versions over time): uncertain
  • No rearchival limits: uncertain

WebCitation.org[edit | edit source]

Webcitation.org is an archiving organization that is focused soley on on-demand archival. Their primary audience is scholarly publications such as journals to prevent link rot in links used as references. Many Wikipedia have already been using their service to archive external links, however nothing is archived automatically. In addition their current infrastructure is not really set up for full automatic archival.

Advantages[edit | edit source]

  • On demand archival is set up out of the box
  • There have already been numerous attempts from bots at setting up archival with them on the English Wikipedia, so this is a familiar project
  • Since some Wikipedians have been using this service for years there so some links will already be in the archive
  • Supports archiving a page multiple times in case the content of the page changes
  • Non-profit and their core mission is archival

Disadvantages[edit | edit source]

  • Infrastructure is currently inefficient and not set up for full scale archival, everything primarily operates on email addresses and emails are sent for every archival request. This does not scale well when tens of thousands of links are involved
  • May not be able to handle to load of running at full scale
  • Requires us to send archival requests to them instead of polling a feed and archiving everything on it
  • Since they have relied on manual archival requests from users many of the existing links on Wikipedia will not have already been archived.
  • Has been somewhat difficult to get in contact with webcitation.org, this could be a problem

Must have[edit | edit source]

  • Willing to crawl any site we configure: yes but must do a request for each URL
  • Willing to crawl (eventually) all the links in Wikipedia (~16M): possibly
  • Willing to crawl at a rate to keep up with our peak update speed (5 updates/sec): probably no
  • Low-latency crawl: yes
  • Reliable, fast serving of pages: ??? needs stats
  • No advertising: yes

Nice to have[edit | edit source]

  • No fees to WMF or other organizations: at the current time no
  • Allows "rearchival" (different versions over time): yes
  • No rearchival limits: uncertain