Archived Pages

From MediaWiki.org
Jump to: navigation, search

Goals[edit | edit source]

The Internet Archive wants to help fix broken outlinks on Wikipedia and make citations more reliable. Are there members of the community who can help build tools to get archived pages in appropriate places? If you would like to help, please discuss, annotate this page, and/or email alexis@archive.org.

More information is now available at http://blog.archive.org/2013/10/25/fixing-broken-links/. Legoktm is currently not working on the project due to time constraints.

Note: «As Yasmin AlNoamany showed in "Who and What Links to the Internet Archive", wikipedia.org is the biggest referrer to the Internet Archive».[1]

Wayback API[edit | edit source]

To this end, we developed a new Wayback Availability API, that answers if a given URL is archived and currently accessible in the Wayback Machine. API also has timestamp option that will return the closest good capture to that date. For example,

GET http://archive.org/wayback/available?url=example.com

might return

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
} 

Please visit API documentation page for details.

IA is crawling Wikipedia outlinks[edit | edit source]

We are running specialized crawls to make this API more useful for Wikipedia community:

  • IA crawling all new external links, citations and embeds made on Wikipedia pages within a few hours of their creation / update.
  • IA has been bulk-crawling external links periodically for the past 2 years

Newly crawled URLs are generally available through the Wayback within a few hours.

Implementation ideas[edit | edit source]

What useful tools/services can we develop on top of this? Please help come up with ideas and implementations. For instance:

  1. Create a visual format/style to include an archived link next to an external link. This could be a small icon similar to / next to the "external link" icon. This is most helpful for links that are often offline or totally dead (see 2.) and even for external links that are not dead, to provide a snapshot in time (see 3.).
  2. Run bots to fix broken external links. When an external link is dead, query the Wayback Availability API to discover if there is a working archived version of the page. If the page is available in Wayback, either a) rewrite the link to point directly to the archived version, or b) annotate the link to indicate that there is an archived version available, per 1.
  3. Make citations more time-specific. When someone cites content on the web, they are citing that URL as it exists at that moment in time. Best practice on English Wikipedia is to include a "retrieved on <date>" field in the cite. It would be useful to update all citations to include an estimated date - guessing "retrieved on [revision-date]" when the editor failed to include it. This lets readers find the version of the page that was cited, even if it changes later on. For new citations, Wayback should have an archived version close to that date/time. For older citations, IA may or may not have one. But if an archived version does exist in the Wayback, we could update the archive-link for that URL to the older version.

See also[edit | edit source]

Scripts
English Wikipedia
All projects
  • w:User:Dispenser/Checklinks - A tool to query, classify, and fix, all external links in a page. Includes Wayback Machine integration.