Extension:ArchiveLinks/Project/Design


 * On page save all the links from the article are retrieved from the parser
 * if have already been archived nothing is done
 * if they have not yet been archived they are added to a queue for a web bot to come by and archive them if they are not blacklisted
 * Sometime later a web bot comes by and attempts to retrieve the web page
 * if the archive is successful it is saved and displayed on request
 * if the web site is down the page is readded to the queue to be checked later, or if the page is still down after a certain number of attempts the the link is assumed to be dead and we stop trying
 * if the web site is up but the link can't be archived due to robots.txt, nocache, or noarchive tags automatically blacklist the site for a certain amount of time
 * if the web site is up but the page comes back as a 404 or a redirect assume it as a failed attempt, note it, and blacklist that link
 * Add a hook to the parser to display a link to access the cached version of the page for every external link on the wiki, or possibly configurable options, this will be done on parse so the link may link to stuff that has not yet been archived or where the archive was unsuccessful

Queue Implementation
The queue will be implemented in the form of a new table in the database. On page save all links from a page that haven’t already been archived in a reasonable amount of time ago will be added to the queue. When the spider runs it will run a query to change the in_progress column for that record to 1. This will prevent a job from being executed twice by two different threads concurrently. After the archival attempt is complete the job will be removed from the queue and the result will be logged in the logs table. Then depending on whether it was successful or not will be either:
 * Be readded to the queue for another attempt
 * Be added to the resources table and have the content saved in the filesystem
 * Be autoblacklisted for a period of time (if the site has had repeated failed attempts at archival)
 * or just be given up on and have nothing done (if the link has reached the maximum number of archival attempts)

Database Tables
The project will add the following database tables (this is subject to change):