Extension:ArchiveLinks/Project/Design


 * On page save all the links from the article are retrieved from the parser
 * if have already been archived nothing is done
 * if they have not yet been archived they are added to a queue for a web bot to come by and archive them if they are not blacklisted
 * Sometime later a web bot comes by and attempts to retrieve the web page
 * if the archive is successful it is saved and displayed on request
 * if the web site is down the page is readded to the queue to be checked later, or if the page is still down after a certain number of attempts the the link is assumed to be dead and we stop trying
 * if the web site is up but the link can't be archived due to robots.txt, nocache, or noarchive tags automatically blacklist the site for a certain amount of time
 * if the web site is up but the page comes back as a 404 or a redirect assume it as a failed attempt, note it, and blacklist that link
 * Add a hook to the parser to display a link to access the cached version of the page for every external link on the wiki, or possibly configurable options, this will be done on parse so the link may link to stuff that has not yet been archived or where the archive was unsuccessful

Project Goals
This project aims to achieve the following:
 * Solve the problem of link rot by archiving all external links on page save
 * Sanitize and secure all archived content from any potential attacks
 * Add options for integration with external mirroring services

Rationale
Almost since the web’s inception, link rot has been a major problem. Web-based content comes and goes, sometimes within a matter of hours. This presents a major problem, both for users seeking to access this information and for Wikipedia's core content policy of verifiability. While Wikipedia policy does not require users to use web citations, it is by far the most popular form of citations, because they're easy for readers and editors to access.

To help solve this and ensure adherence to verifiability, I plan to create an archival system over the summer, so users can access all external links even if they go down. This preemptive archival should effectively solve the problem of linkrot, as long as the source site allows caching of its content. The project aims to get something that "just works" without user input/request and to seamlessly integrate with existing page parsing and rendering. Such a system will allow users to focus on content creation, rather than the distracting technical aspects of archival.

Implementation
The extension will make use of the following Hooks:

Queue Implementation
The queue will be implemented in the form of a new table in the database. On page save all links from a page that haven’t already been archived in a reasonable amount of time ago will be added to the queue. When the spider runs it will run a query to change the in_progress column for that record to 1. This will prevent a job from being executed twice by two different threads concurrently. After the archival attempt is complete the job will be removed from the queue and the result will be logged in the logs table. Then depending on whether it was successful or not will be either:
 * Be readded to the queue for another attempt
 * Be added to the resources table and have the content saved in the filesystem
 * Be autoblacklisted for a period of time (if the site has had repeated failed attempts at archival)
 * or just be given up on and have nothing done (if the link has reached the maximum number of archival attempts)

Database Tables
The project will add the following database tables (this is subject to change):

Links Per Day
A rough estimate gives me around 15,000 links per day that are being currently added/readded to the English Wikipedia. Since the extension queues all unachieved links on page save this would likely be much higher after the initial installation until the majority of the links had been archived. Also it is important to note that this number is based on diffs, so if links are removed for say page blanking vandalism and then that edit is reverted the links will be counted as new, even though they existed on the page prior to vandalism and aren’t really new. This number also includes duplicates when the same link is used on multiple pages. Due to these problems the number of actual new unique external links that get added to the English Wikipedia and don’t get reverted is much lower. A more accurate number could probably be found by taking the database dumps and taking the total number of links in each dump, then finding the difference between the more recent dump and dividing that by the number of days between them to get a better average growth. This would however likely be less than the rate of growth in the file store because the extension will not immediately delete a link even if it is reverted.

In Database
The only thing that would need to be permanently stored would be the resource table. This should be fairly a trivial amount of data for each record (under 1 KB).

On File System
An average web page with images should be around 376 KB uncompressed with images according to metrics published by Google last year. At the current time there are around 17.5 million external links on the English Wikipedia, including duplicates. Counting only unique external links the total number would be closer to 13.5-14 million. Assuming each page is stored only once this would mean that to store every page would require around 6.2 TB of disk space without compression or 5.1 TB with compression. It’s important to note that this number is based on archiving every page object, including java script. Due to this if one were to use only HTML and images or a more restrictive archival setting the space requirements would be significantly less (probably under 1 TB for text only archival).

Number of Hosts Required
One host will most likely be more than enough for archival. For storage a file store with enough capacity to store all the content.

What is Stored?
The goal for archival should be to match the experience a user would have had in their browser if they had visited the website at the time it was archived while still remaining secure and not using excessive amounts of disk space. Since each wiki will likely have different standards as to the risks they are willing to take and how much disk space is available exactly what is archived and how much is stripped out will be a configurable option in LocalSettings.php. The default for the extension however will be HTML with Images only, with support for HTML only, limited HTML, and Text-Only archival of pages. Going in the other direction (less secure) the extension will also include the option to download everything (except most types of files, like dlls and exes) that the user would have seen at the time of archival. Anything beyond HTML with Images will however only be partially supported. In addition to having archival settings on a wiki-wide basis the extension will also aim to have the ability to modify archival settings on a link by link basis in order to allow archival of more or less objects than is set for the whole wiki. This would allow those with a user right (such as Administrators) to set more in-depth archival for a few trusted sites while still keeping the whole wiki relatively secure and would likely be useful if the wiki wide setting were to say text only archival.

Security Concerns
Retrieving offsite content presents significant security concerns, to mitigate this we will try to only archive objects that are unlikely to present major forms of attack. This will unfortunately mean excluding much of the rich content on the web, such as Flash objects, Java Script, and document files such as PDFs (which may instead be converted to HTML) and Word documents. One of the features that will be looked into is integration with blacklists such as stopbadware.org and anti-virus software. As an additional precaution the HTML will be sanitized and any links in src or href attributes outside of the cache server will be removed. This should prevent the majority of the CSRF and XSS vulnerabilities. Security will remain a major priority, but no system will be perfect and it is important to realize that the cache will most likely have malicious content in it at some point, especially if the more permissive archival settings are used.