Requests for comment/Caching references

This is proposal for a webservice/storage back end that will provide both an archiving service to prevent link rot, and to preserve references for dynamic locations that may often change there content. Currently it often becomes impossible to verify data that is sourced to such locations as the content differs between creating the citation and reviewing the citation. The goal of this project is to also enable the automated verification and filling in of reference material in citations across WMF wikis.

This is going to be a rough outline/plan as the program continues details will be filled in and we may run into additional complications that where not foreseen.


 * 1) Create a process for archiving urls which will abide by robots.txt.
 * This process will both be for existing URLs used throughout the WMF and on user requested basis.
 * These snapshots will be indexed and compressed on disk, using a combination of hashing/accessdate/url information.
 * 1) Create a method for looking up, and displaying snapshots to users.
 * 2) Once a snapshot is created metadata is harvested and dumped into a database
 * Metadata extraction is both automated to a degree and requested by the user in a process that loads the page in question with a form to fill in the needed metadata.
 * 1) Information that is used in the metadata database is then used to supplement and fill in citation information.
 * 2) Data that is no longer needed for citations is pruned.
 * Such prunes will have a minimum delay period to ensure that the removal of the citation is in fact legitimate and not a transient situation resulting from blanking or other actions that are short term
 * The delay should be no less than 90 days but will probably be longer for both technical and and logistical issues.
 * In the situation where the Digital Millennium Copyright Act becomes a factor tool maintainers will be able to go in and remove the material in question. In the long term there is a desire to broaden the response team and provide a simple method for addressing the DMCA cases with possibly entering a joint effort with the OTRS team.

Long term goals are to:
 * Combine the existing tools by Dispenser into this as they will become redundant.
 * Provide a process of comparing archived snapshots with current versions of resources to enable the easy identification of modifications.
 * Provide an automated process for identifying and tagging reference material that ceases to maintain operation.
 * Create a process to extract basic informational statements from snapshoted material using advanced data analysis and tools like a lexical parser.
 * To implement a citation/statement verification tool using the informational statements extracted.
 * To create and maintain a tool to provide existing references to currently unreferenced material.
 * To communicate and work with local project work-groups (Wikiprojects) in identifying and addressing issues that are detected within their scope.

To those ends the hardware requirements are a roughly ballparked at this time:
 * A minimum of 30TB of storage for the archive data and associated metadata.
 * This volume is not required immediately, however given the scope of the expected project the storage requirements will grow rather rapidly.
 * An initial few TBs should be sufficient to enable
 * Would suggest at least double this to ensure that there is sufficient storage to provide at least one backup copy of the data in case of hardware failure.


 * 1 webserver (as the popularity of this tool increases odds are we may need more)
 * The working and processing threads can probable be handled by the existing tools labs infrastructure.
 * 1 database server (depending on work load additional servers may need to be deployed in the long term but current projected loads should be within the capacity of the one server)

Questions
Why would we not rely on an independent archiving service (such as webcitation), or several archive services, and then just keep a copy of the URL and a link to the archive page? A donation from the WMF to support these services might be more effective than creating our own mirror. Fæ (talk) 17:12, 4 July 2014 (UTC)
 * Also on this, currently Internet Archive supports caching on demand: web.archive.org/save/ . Lugusto (talk) 17:22, 4 July 2014 (UTC)
 * I second Fæs comment, they do they already have the infrastructure in place. Or is this adding some value beyond their service that outweighs all obvious legal risks? --Ainali (talk) 17:28, 4 July 2014 (UTC)


 * Maintaining a copy independent of webcitation (which has an unstable future) or archive.org which deletes prior snapshots if the robot.txt changes (domain is sold and new owner throws up a generic deny all robots.txt) both of those provide problematic for long term verification. You also run into issues when the original HTML isnt available. Both IA and webcite inject custom HTML into what they serve. Also IA can take prolonged periods to publish content once its archived. What Dispenser is trying to do will need to cache the source pages whether or not its published doesnt make much difference for him, however providing an archiving service in the process that does not rely on unstable third parties would just be a bonus for us. Betacommand (talk) 17:34, 4 July 2014 (UTC)
 * I don't think that's how the IA works. Nothing is ever deleted; just removed from public view.  We could set up a way for editors or researchers to access their private cache archive.  IA is actively working on their caching of Wikipedia pages, and has been in steady contact with Wikipedians to make this service work better specifically for our use case.  I am certain the person working on their project would be glad to collaborate with Dispenser or others who are interested in reflinks style projects.  When it comes to hundred-year archiving of copyrighted material such as a webcache, IA is currently a more reliable host than we are (it's in their mission; it's not in ours; unless we revise or clarify our mission future Wikimedians might decide to delete that material).  Sj (talk) 18:27, 4 July 2014 (UTC)