Archived Pages

From mediawiki.org

Goals[edit]

  • The Internet Archive wants to help fix broken outlinks on Wikipedia and make citations more reliable. As of October 2015, the English Wikipedia alone contains 734k links to web.archive.org.[1] Are there members of the community who can help build tools to get archived pages in appropriate places? If you would like to help, please discuss, annotate this page, and/or email alexis@archive.org.
  • More information is now available at http://blog.archive.org/2013/10/25/fixing-broken-links/. Legoktm is currently not working on the project due to time constraints.
  • Note: «As Yasmin AlNoamany showed in "Who and What Links to the Internet Archive", wikipedia.org is the biggest referrer to the Internet Archive».[1]
  • Readers should not click external links and see 404s.
  • All references should include a permanent link to avoid content drift.

Wayback API[edit]

To this end, we developed a new Wayback Availability API, that answers if a given URL is archived and currently accessible in the Wayback Machine. API also has timestamp option that will return the closest good capture to that date. For example,

GET http://archive.org/wayback/available?url=example.com

might return

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

Please visit API documentation page for details.

IA is crawling Wikimedia outlinks[edit]

We are running specialized crawls to make this API more useful for Wikipedia community:

  • As of 2019, the Internet Archive crawls all external links from all Wikimedia projects as soon as they're reported by EventStream, which includes new external links, citations and embeds (previous it followed the feeds on hundreds of IRC channels).
  • IA has been bulk-crawling external links periodically since 2011/2012. At some points, all existing links existing at that point on some wikis including the English Wikipedia were archived.

Newly crawled URLs are generally available through the Wayback within a few hours. 87% of the dead links found by the Internet Archive crawler on Wikipedia are archived.

Implementation ideas[edit]

What useful tools/services can we develop on top of this? Please help come up with ideas and implementations. For instance:

  1. Create a visual format/style to include an archived link next to an external link. This could be a small icon similar to / next to the "external link" icon. This is most helpful for links that are often offline or totally dead (see 2.) and even for external links that are not dead, to provide a snapshot in time (see 3.).
  2. Run bots to fix broken external links. When an external link is dead, query the Wayback Availability API to discover if there is a working archived version of the page. If the page is available in Wayback, either a) rewrite the link to point directly to the archived version, or b) annotate the link to indicate that there is an archived version available, per 1.
  3. Make citations more time-specific. When someone cites content on the web, they are citing that URL as it exists at that moment in time. Best practice on English Wikipedia is to include a "retrieved on <date>" field in the cite. It would be useful to update all citations to include an estimated date - guessing "retrieved on [revision-date]" when the editor failed to include it. This lets readers find the version of the page that was cited, even if it changes later on. For new citations, Wayback should have an archived version close to that date/time. For older citations, IA may or may not have one. But if an archived version does exist in the Wayback, we could update the archive-link for that URL to the older version.

See also[edit]

Scripts
English Wikipedia
All projects
Other archives
Proposals