Extension:ArchiveLinks/Project/Design

From MediaWiki.org
Jump to navigation Jump to search
  • On page save all the links from the article are retrieved from the parser
    • if have already been archived nothing is done
    • if they have not yet been archived they are added to a queue for a web bot to come by and archive them if they are not blacklisted
  • Sometime later a web bot comes by and attempts to retrieve the web page
    • if the archive is successful it is saved and displayed on request
    • if the web site is down the page is readded to the queue to be checked later, or if the page is still down after a certain number of attempts the the link is assumed to be dead and we stop trying
    • if the web site is up but the link can't be archived due to robots.txt, nocache, or noarchive tags automatically blacklist the site for a certain amount of time
    • if the web site is up but the page comes back as a 404 or a redirect assume it as a failed attempt, note it, and blacklist that link
  • Add a hook to the parser to display a link to access the cached version of the page for every external link on the wiki, or possibly configurable options, this will be done on parse so the link may link to stuff that has not yet been archived or where the archive was unsuccessful

Project Goals[edit]

This project aims to achieve the following:

  • Solve the problem of link rot by archiving all external links on page save
  • Sanitize and secure all archived content from any potential attacks
  • Add options for integration with external mirroring services


Rationale[edit]

Almost since the web’s inception, link rot has been a major problem. Web-based content comes and goes, sometimes within a matter of hours. This presents a major problem, both for users seeking to access this information and for Wikipedia's core content policy of verifiability. While Wikipedia policy does not require users to use web citations, it is by far the most popular form of citations, because they're easy for readers and editors to access.

To help solve this and ensure adherence to verifiability, I plan to create an archival system over the summer, so users can access all external links even if they go down. This preemptive archival should effectively solve the problem of linkrot, as long as the source site allows caching of its content. The project aims to get something that "just works" without user input/request and to seamlessly integrate with existing page parsing and rendering. Such a system will allow users to focus on content creation, rather than the distracting technical aspects of archival.

Implementation[edit]

The extension will make use of the following Hooks:

Hook Name Used for
ArticleSaveComplete Adding all links on the page to the queue
LinkerMakeExternalLink Modifying the HTML of the link to display a link to the cached page.

Diagram of Implementation[edit]

Archive External Links Diagram.png

Queue Implementation[edit]

The queue will be implemented in the form of a new table in the database. On page save all links from a page that haven’t already been archived in a reasonable amount of time ago will be added to the queue. When the spider runs it will run a query to change the in_progress column for that record to 1. This will prevent a job from being executed twice by two different threads concurrently. After the archival attempt is complete the job will be removed from the queue and the result will be logged in the logs table. Then depending on whether it was successful or not will be either:

  • Be readded to the queue for another attempt
  • Be added to the resources table and have the content saved in the filesystem
  • Be autoblacklisted for a period of time (if the site has had repeated failed attempts at archival)
  • or just be given up on and have nothing done (if the link has reached the maximum number of archival attempts)

Database Tables[edit]

The project will add the following database tables (this is subject to change):

Name of Table Purpose of Table How long will this data persist?
el_archive_queue Queue for archival Until the job runs
el_archive_log Log of result of previous archive attempts This was originally supposed to be temporary, but may contain useful data that might want to be kept for awhile
el_archive_resources Storage locations of the archived content As long as the archived content exists on the filesystem
el_archive_blacklist This table is a bit of a misnomer, at the moment it contains the blacklist but it will likely also contain the whitelist as well as site by archival settings As long as the settings are desired

el_archive_queue[edit]

Name of Column Data Type Additional Settings Purpose of Column
queue_id int(11) UNSIGNED, AUTO_INCREMENT This is the key for the table
page_id int(11) UNSIGNED This is the key for the page that the job was inserted from.
queue_url blob BINARY This is the URL of the page, (a BLOB because a URL could conceivably exceed the size of a TINYBLOB)
delay_time int(11) UNSIGNED This is the Unix time (from PHP’s time() function)that the URL is not to be archived before, for use in exponential back off.
insertion_time int(11) UNSIGNED This is the Unix timestamp (from PHP’s time() function) that the archival request was made at.
in_progress tinyint(1) UNSIGNED This column exists to prevent two different hosts from working on the same job at the same time. (0 for not in progress, 1 for in progress)

el_archive_log[edit]

Name of Column Data Type Additional Settings Purpose of Column
log_id int(11) UNSIGNED, AUTO_INCREMENT This is the key for the table
log_result tinyint(1) UNSIGNED This is for the resultcode of the archival attempt. It is currently uncertain how many result codes there will be.
log_url blob BINARY This is the URL of the page, (a BLOB because a URL could conceivably exceed the size of a TINYBLOB)
log_time int(11) UNSIGNED This is the Unix time (from PHP’s time() function) of when the archival attempt was made.

el_archive_resources[edit]

Name of Column Data Type Additional Settings Purpose of Column
resource_id int(11) UNSIGNED, AUTO_INCREMENT This is the key for the table
el_id int(11) UNSIGNED This is the key from the externallinks table that external link uses.
resource_url blob BINARY This is the URL of the page, (a BLOB because a URL could conceivably exceed the size of a TINYBLOB)
resource_location text Where the resource can be found on the file system. This is a TEXT instead of a BLOB because the file should be named something that doesn’t need full unicode support. A TINYTEXT would probably work here but we will use a bigger container just in case you wanted to use big hash values (like sha-512) or have an insanely long file path.
archival_time int(11) UNSIGNED This is the Unix timestamp (from PHP’s time() function) of when the content was archived.

el_archive_blacklist[edit]

Name of Column Data Type Additional Settings Purpose of Column
blacklist_id int(11) UNSIGNED, AUTO_INCREMENT This is the key for the table
bl_type tinyint(1) UNSIGNED Is this a blacklist, whitelist, setting, or autoblacklist entry?
bl_url blob BINARY This is the URL of the page, (a BLOB because a URL could conceivably exceed the size of a TINYBLOB)
bl_expiry int(11) UNSIGNED This is the Unix time (from PHP’s time() function)that this blacklist entry will expire.
bl_reason varchar(255) If this was manually blacklisted allow for a short explanation why.


Quantity of Data[edit]

Links Per Day[edit]

A rough estimate gives me around 15,000 links per day that are being currently added/readded to the English Wikipedia. Since the extension queues all unachieved links on page save this would likely be much higher after the initial installation until the majority of the links had been archived. Also it is important to note that this number is based on diffs, so if links are removed for say page blanking vandalism and then that edit is reverted the links will be counted as new, even though they existed on the page prior to vandalism and aren’t really new. This number also includes duplicates when the same link is used on multiple pages. Due to these problems the number of actual new unique external links that get added to the English Wikipedia and don’t get reverted is much lower. A more accurate number could probably be found by taking the database dumps and taking the total number of links in each dump, then finding the difference between the more recent dump and dividing that by the number of days between them to get a better average growth. This would however likely be less than the rate of growth in the file store because the extension will not immediately delete a link even if it is reverted.

In Database[edit]

The only thing that would need to be permanently stored would be the resource table. This should be fairly a trivial amount of data for each record (under 1 KB).

On File System[edit]

An average web page with images should be around 376 KB uncompressed with images according to metrics published by Google last year. [1] At the current time there are around 17.5 million external links on the English Wikipedia, including duplicates. Counting only unique external links the total number would be closer to 13.5-14 million. Assuming each page is stored only once this would mean that to store every page would require around 6.2 TB of disk space without compression or 5.1 TB with compression. It’s important to note that this number is based on archiving every page object, including java script. Due to this if one were to use only HTML and images or a more restrictive archival setting the space requirements would be significantly less (probably under 1 TB for text only archival).

Number of Hosts Required[edit]

One host will most likely be more than enough for archival. For storage a file store with enough capacity to store all the content.

What is Stored?[edit]

The goal for archival should be to match the experience a user would have had in their browser if they had visited the website at the time it was archived while still remaining secure and not using excessive amounts of disk space. Since each wiki will likely have different standards as to the risks they are willing to take and how much disk space is available exactly what is archived and how much is stripped out will be a configurable option in LocalSettings.php. The default for the extension however will be HTML with Images only, with support for HTML only, limited HTML, and Text-Only archival of pages. Going in the other direction (less secure) the extension will also include the option to download everything (except most types of files, like dlls and exes) that the user would have seen at the time of archival. Anything beyond HTML with Images will however only be partially supported. In addition to having archival settings on a wiki-wide basis the extension will also aim to have the ability to modify archival settings on a link by link basis in order to allow archival of more or less objects than is set for the whole wiki. This would allow those with a user right (such as Administrators) to set more in-depth archival for a few trusted sites while still keeping the whole wiki relatively secure and would likely be useful if the wiki wide setting were to say text only archival.

Security Concerns[edit]

Retrieving offsite content presents significant security concerns, to mitigate this we will try to only archive objects that are unlikely to present major forms of attack. This will unfortunately mean excluding much of the rich content on the web, such as Flash objects, Java Script, and document files such as PDFs (which may instead be converted to HTML) and Word documents. One of the features that will be looked into is integration with blacklists such as stopbadware.org and anti-virus software. As an additional precaution the HTML will be sanitized and any links in src or href attributes outside of the cache server will be removed. This should prevent the majority of the CSRF and XSS vulnerabilities. Security will remain a major priority, but no system will be perfect and it is important to realize that the cache will most likely have malicious content in it at some point, especially if the more permissive archival settings are used.

References[edit]