Archived Pages

Jump to navigation Jump to search


  • The Internet Archive wants to help fix broken outlinks on Wikipedia and make citations more reliable. As of October 2015, the English Wikipedia alone contains 734k links to[1] Are there members of the community who can help build tools to get archived pages in appropriate places? If you would like to help, please discuss, annotate this page, and/or email
  • Readers should not click external links and see 404s.

Wayback API[edit]

To this end, we developed a new Wayback Availability API, that answers if a given URL is archived and currently accessible in the Wayback Machine. API also has timestamp option that will return the closest good capture to that date. For example,


might return

    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "",
            "timestamp": "20130919044612",
            "status": "200"

Please visit API documentation page for details.

IA is crawling Wikipedia outlinks[edit]

We are running specialized crawls to make this API more useful for Wikipedia community:

  • IA crawling all new external links, citations and embeds made on Wikipedia pages in dozens of languages within a few hours of their creation / update.
  • IA has been bulk-crawling external links periodically since 2011/2012

Newly crawled URLs are generally available through the Wayback within a few hours. 87% of the dead links found by the Internet Archive crawler on Wikipedia are archived.

Implementation ideas[edit]

What useful tools/services can we develop on top of this? Please help come up with ideas and implementations. For instance:

  1. Create a visual format/style to include an archived link next to an external link. This could be a small icon similar to / next to the "external link" icon. This is most helpful for links that are often offline or totally dead (see 2.) and even for external links that are not dead, to provide a snapshot in time (see 3.).
  2. Run bots to fix broken external links. When an external link is dead, query the Wayback Availability API to discover if there is a working archived version of the page. If the page is available in Wayback, either a) rewrite the link to point directly to the archived version, or b) annotate the link to indicate that there is an archived version available, per 1.
  3. Make citations more time-specific. When someone cites content on the web, they are citing that URL as it exists at that moment in time. Best practice on English Wikipedia is to include a "retrieved on <date>" field in the cite. It would be useful to update all citations to include an estimated date - guessing "retrieved on [revision-date]" when the editor failed to include it. This lets readers find the version of the page that was cited, even if it changes later on. For new citations, Wayback should have an archived version close to that date/time. For older citations, IA may or may not have one. But if an archived version does exist in the Wayback, we could update the archive-link for that URL to the older version.

See also[edit]

English Wikipedia
All projects
  • w:User:Dispenser/Checklinks - A tool to query, classify, and fix, all external links in a page. Includes Wayback Machine integration.
  • w:de:Wikipedia:Defekte Weblinks/Botmeldung: GiftBot is running since November 2015 on the German Wikipedia: it identifies broken links and notifies on talk pages, user talk pages but makes no edits in main namespace as of December 2015
Other archives