Requests for comment/Caching references

This is proposal for a webservice/storage back end that will provide both an archiving service to prevent link rot, and to preserve references for dynamic locations that may often change there content. Currently it often becomes impossible to verify data that is sourced to such locations as the content differs between creating the citation and reviewing the citation. The goal of this project is to also enable the automated verification and filling in of reference material in citations across WMF wikis.

This is going to be a rough outline/plan as the program continues details will be filled in and we may run into additional complications that where not foreseen.


 * 1) Create a process for archiving urls which will abide by robots.txt.
 * This process will both be for existing URLs used throughout the WMF and on user requested basis.
 * These snapshots will be indexed and compressed on disk, using a combination of hashing/accessdate/url information.
 * 1) Create a method for looking up, and displaying snapshots to users.
 * 2) Once a snapshot is created metadata is harvested and dumped into a database
 * Metadata extraction is both automated to a degree and requested by the user in a process that loads the page in question with a form to fill in the needed metadata.
 * 1) Information that is used in the metadata database is then used to supplement and fill in citation information.
 * 2) Data that is no longer needed for citations is pruned.
 * Such prunes will have a minimum delay period to ensure that the removal of the citation is in fact legitimate and not a transient situation resulting from blanking or other actions that are short term
 * The delay should be no less than 90 days but will probably be longer for both technical and and logistical issues.
 * In the situation where the Digital Millennium Copyright Act becomes a factor tool maintainers will be able to go in and remove the material in question. In the long term there is a desire to broaden the response team and provide a simple method for addressing the DMCA cases with possibly entering a joint effort with the OTRS team.

Long term goals are to:
 * Combine the existing tools by Dispenser into this as they will become redundant.
 * Provide a process of comparing archived snapshots with current versions of resources to enable the easy identification of modifications.
 * Provide an automated process for identifying and tagging reference material that ceases to maintain operation.
 * Create a process to extract basic informational statements from snapshoted material using advanced data analysis and tools like a lexical parser.
 * To implement a citation/statement verification tool using the informational statements extracted.
 * To create and maintain a tool to provide existing references to currently unreferenced material.
 * To communicate and work with local project work-groups (Wikiprojects) in identifying and addressing issues that are detected within their scope.

To those ends the hardware requirements are a roughly ballparked at this time:
 * A minimum of 30TB of storage for the archive data and associated metadata.
 * This volume is not required immediately, however given the scope of the expected project the storage requirements will grow rather rapidly.
 * An initial few TBs should be sufficient to enable
 * Would suggest at least double this to ensure that there is sufficient storage to provide at least one backup copy of the data in case of hardware failure.


 * 1 webserver (as the popularity of this tool increases odds are we may need more)
 * The working and processing threads can probable be handled by the existing tools labs infrastructure.
 * 1 database server (depending on work load additional servers may need to be deployed in the long term but current projected loads should be within the capacity of the one server)

Questions

 * Why would we not rely on an independent archiving service (such as webcitation), or several archive services, and then just keep a copy of the URL and a link to the archive page? A donation from the WMF to support these services might be more effective than creating our own mirror. Fæ (talk) 17:12, 4 July 2014 (UTC)
 * Also on this, currently Internet Archive supports caching on demand: web.archive.org/save/ . Lugusto (talk) 17:22, 4 July 2014 (UTC)
 * I second Fæs comment, they do they already have the infrastructure in place. Or is this adding some value beyond their service that outweighs all obvious legal risks? --Ainali (talk) 17:28, 4 July 2014 (UTC)
 * I agree, supporting the efforts of mission-aligned entities seems like the best general approach. makes some good points below that should be thought through carefully, and might influence the details of the path pursued -- but they don't strike me as overall deal-breakers. -Pete F (talk) 20:41, 4 July 2014 (UTC)
 * Given the behavior from an average user of IA I find the process quite a pain. Yes they may already have the infrastructure, but they lack any kind of effective API, (or at least documentation if it exist). I cannot make a query and find all archive dates for a work. I have two options I can either request the most current, or I can try to screen scape the web interface, which is a nightmare of javascript. And given that existing archives can be randomly removed due to changes in a sites robot.txt (yes the actual content may not be deleted, but as an end user I cannot tell the difference as I cannot access the needed data). Overall it introduces a significant overhead and headache to rely on a third party. It could be done in theory but bring that in house along with the planned additional features would either require direct server access to the IA's system or a massive amount of data being moved between the tool and the IA. I am unsure if the IA could reasonably handle the workload needed to operate this tool. Betacommand (talk) 00:41, 5 July 2014 (UTC)
 * +1 for Ainali --Prolineserver (talk) 21:22, 4 July 2014 (UTC)
 * The goals of the project are to "prevent link rot" and "enable the automated verification and filling in of reference material in citations," not duplicating archive services. I strongly object to the suggestion that robots.txt has any place in the workings of human-directed citation completion or link failure-triggered (or source renaming-triggered etc.) attempts to verify that a rotted link is correctly described on-wiki. The former is clearly human driven, and the latter is just as firmly in service to humans instead of corporate profits no matter how it is triggered. James Salsman (talk) 01:06, 5 July 2014 (UTC)
 * To a degree the script is human prompted, on the other hand it will use a proactive approach to cache and preload data where possible. This will also enable us to work with local workgroups to identify existing articles with citations that may have limited, or incorrect values in the citations. and eventually using the latter stages to have a user write a statement and have the tool return completed citations that support the said statement. Abiding by robots.txt is just good manners online when working with a website. Ignoring a robots.txt may also violate some websites ToS. Abiding by robots.txt enables a robust process while also building in safe guards. Betacommand (talk) 01:23, 5 July 2014 (UTC)
 * Good manners, or the involuntary transfer of the legitimate rights of readers to corporate interests? James Salsman (talk) 01:37, 5 July 2014 (UTC)
 * I consider it good manners. Often things are added to robot.txt for a reason. I know for example that the WMF prohibit access by robots to page histories due to the dynamic nature of the material and the resources needed to generate it. Betacommand (talk) 02:07, 5 July 2014 (UTC)


 * Maintaining a copy independent of webcitation (which has an unstable future) or archive.org which deletes prior snapshots if the robot.txt changes (domain is sold and new owner throws up a generic deny all robots.txt) both of those provide problematic for long term verification. You also run into issues when the original HTML isnt available. Both IA and webcite inject custom HTML into what they serve. Also IA can take prolonged periods to publish content once its archived. What Dispenser is trying to do will need to cache the source pages whether or not its published doesnt make much difference for him, however providing an archiving service in the process that does not rely on unstable third parties would just be a bonus for us. Betacommand (talk) 17:34, 4 July 2014 (UTC)
 * I don't think that's how the IA works. Nothing is ever deleted; just removed from public view.  We could set up a way for editors or researchers to access their private cache archive.  IA is actively working on their caching of Wikipedia pages, and has been in steady contact with Wikipedians to make this service work better specifically for our use case.  I am certain the person working on their project would be glad to collaborate with Dispenser or others who are interested in reflinks style projects.  When it comes to hundred-year archiving of copyrighted material such as a webcache, IA is currently a more reliable host than we are (it's in their mission; it's not in ours; unless we revise or clarify our mission future Wikimedians might decide to delete that material).  Sj (talk) 18:27, 4 July 2014 (UTC)
 * I suggest we serve those who must use the "public view" to see, not just those with the connections to phone up the IA and have a deeper look. I would not ask any editor to have to wait through a series of IA look-ups as part of the reflinks process. James Salsman (talk) 00:56, 5 July 2014 (UTC)


 * Is this compatible with the Wikimedia Foundation's Licensing Policy? If so, would it require a (rather radical) Exemption Doctrine Policy? Or if not, what kinds of changes would need to be made to the Licensing Policy, and do we have reason to think the Wikimedia community would go along with those? -Pete F (talk) 20:17, 4 July 2014 (UTC)
 * The discussion below goes on many tangents, so I'd like to express clearly why I think compliance with Wikimedia's Licensing Policy is a significant hurdle for this proposal. The issue is not a legal one, but a question of whether the proposal is in line with the shared values we hold across the Wikimedia Movement, in this case most concisely and officially expressed in that policy. Specifically, the policy states:
 * "All projects are expected to host only content which is under a Free Content License, or which is otherwise free..."
 * The current proposal is about hosting content that is almost entirely non-free.
 * It's possible that one might express the output of this proposal as something fundamentally different from a Wikimedia project; but in my view it would be a project, because it would be public-facing, designed for end users; and in service to our mission to share knowledge freely.
 * So if we were to start a project with the express purpose of hosting non-free content, it would be necessary per the Licensing Policy to create an EDP. But that EDP would have to be incredibly broad, for this proposal to work, which would strain against provision #3 of the Licensing Policy -- that EDPs must be minimal.
 * Finally, it is of course possible for something like the Licensing Policy to be replaced or amended. But in order for that to happen, there would have to be broad agreement that the changes are a good idea.
 * If those proposing this project do not think these issues are significant, and do not have a clear plan for addressing them, in my view that is a sign of a fatally flawed project. Any effort to pursue this would likely end up being wasted, and causing strong backlash from the Wikimedia community upon project completion. -Pete F (talk) 17:44, 5 July 2014 (UTC)
 * Thanks for raising this, Pete. I think that it's ok according to our licensing policy to have support-tools that themselves cache / use / analyze non-free (but legal to host) works, as long as the cached material is not itself part of one of our major projects, not available as part of any dumps, &c.  However dealing with this in a way that clearly distinguishes it from our free-content haven would be one more layer of overhead.  We can avoid by working with partners that work with all knowledge, not just free knowledge.  For the sake of robustness, we can mirror any non-free data that we rely on in a non-public workspace without confusing end users. Sj (talk) 20:12, 6 July 2014 (UTC)


 * I am not a lawyer and thus am not going down this rabbit hole. In this case I would recommend asking the WMF legal on this. Given our goals and that this tool is not designed as a primary access point for citations, and more of a fallback/verification method there shouldnt be that much of a legal hurdle. Given we already use a massive amount of copyrighted works on our wikis (images,video,quotes) providing a fallback/verification process for what we use as citations, and our goal of improving our existing citations I cannot see that as being too much of a stray from our mission. Betacommand (talk) 00:49, 5 July 2014 (UTC)
 * Why do you think the cached sources would be available for download? The proposal suggests that they might be used to prevent link rot, but that is certainly not the same as re-publishing them. The cached resources can be used to locate moved copies of the same versions as cited. James Salsman (talk) 00:56, 5 July 2014 (UTC)
 * One of the proposed uses for the tool is to provide archived copies of the data both to verify that the existing website contains the original material, and as a backup of the data in case the data is no longer available at the original source. Betacommand (talk) 01:17, 5 July 2014 (UTC)
 * Entire copies, or just diffs? James Salsman (talk) 01:36, 5 July 2014 (UTC)
 * Entire copies, working with diffs is just a headache, Lets say we have the base copy and 100 snapshots. If we wanted to compare version 3 with version 99, we would need to apply 99 diffs to the original data, and then do the comparison. Depending on how things get moved around in the webpage, it may turn into a nightmare. Getting complete snapshots is just easier to work with and we can use hashing to verify data between dates hasnt modified. Using a diff bases system is just begging for nightmares and untangling a huge mess if its not done 100% correctly the first time. It also increases the risk of data corruption tainting the entire source instead of just one copy. Betacommand (talk) 01:44, 5 July 2014 (UTC)
 * Is that your characterization of the proposal, or User:Dispenser's? Where in the proposal does it say that entire copies of the cached references will be available for download? James Salsman (talk) 01:57, 5 July 2014 (UTC)
 * Speaking with Dispenser the plan is to have complete snapshots to work with. There was nothing about using diffs for storage, only to be able to show a diff between two snapshots. I reference the ability to view/verify the contents of a reference after the source becomes unavailable. I am unsure exactly how we will be dealing with archived copies, and who will have access to them. (most current version will probably have widespread access). Before making any kind of comment on that I would need to consult legal and see what their position is. Betacommand (talk) 02:05, 5 July 2014 (UTC)


 * How does this relate to existing work:
 * Extension:WebCache,
 * Archived Pages? --Nemo 21:33, 4 July 2014 (UTC)
 * Given that the documentation for Extension:WebCache lacks any kind of details/explanations its hard to figure out exactly where/how that project is functioning. As for the second, see above the IA has quite a few issues, and using a third party service has a significant headache, performance, quality issues. Betacommand (talk) 00:53, 5 July 2014 (UTC)


 * What happened to the idea to use the textual content of references to cache the information necessary for fast partial ordering of category tree intersections? That would have been from around March 2012 if I'm remembering this right. James Salsman (talk) 01:03, 5 July 2014 (UTC)
 * That was never part of this project. There may have been similar projects in the past, however this one focus on caching, and extracting metadata from URLs Betacommand (talk) 01:13, 5 July 2014 (UTC)
 * Perhaps I am not remembering the discussion correctly, but I would point out that data-mining text to include services such as enhancing machine translation has been ruled fair use and can not be restricted by publishers' attempts to do so any more than they can prohibit construction of indices to licensed text by readers, human or machine. A numeric list of the location of words on a page has the same Feist v. Rural copyright status as a telephone directory in the US. James Salsman (talk) 01:36, 5 July 2014 (UTC)
 * To my knowledge there are zero plans to work with categories or category intersections. Betacommand (talk) 01:45, 5 July 2014 (UTC)
 * You and I have a different familiarity with Dispenser's work and assessment of his capabilities. James Salsman (talk) 01:57, 5 July 2014 (UTC)
 * Dispenser may have other other tools/ideas, however this proposal is for the basic infrastructure of his reflinks tool. I dont doubt that there will be additional tools, or projects that eventually spin off from this one. But the proposal does not cover those. Betacommand (talk) 02:00, 5 July 2014 (UTC)


 * This seems to have scary legal considerations, especially if the cached copies are made publicly available. According to Wikilegal/Copyright Status of Wikipedia Page Histories, it seems that 17 U.S.C. § 108 doesn't apply to content on the Internet, so an archive would have to depend solely on 17 U.S.C. § 107 (fair use), which is somewhat unpredictable and ambiguous. There have apparently been lawsuits against the Internet Archive (see Wayback Machine), but I'm not sure if any case has been settled in court or if all cases have been settled outside court. The outcome of a lawsuit looks unclear, as suggested at w:WP:LINKVIO. Until the legal considerations become known, I think that it would be safer not to provide cached content to people. The archive could cost the WMF a lot of money in case someone sues, and it would be safer to let other people take that risk. The legal considerations may be different if the archive only is accessible by bots or a small number of users. --Stefan2 (talk) 15:56, 6 July 2014 (UTC)

NIH vs. improving existing projects
Existing implementations with a track record have actual failure modes, whereas hypothetical future implementations will fix everything and never fail. Everyone has a friend who once rode in a vehicle using those other inferior wheels that failed somehow, inspiring an entirely new ground-up wheel design. :-)

Walking through the concerns mentioned above about the existing IA setup for caching references:
 * ''they lack any kind of effective API (or at least documentation of it) ... I cannot make a query and find all archive dates for a work.
 * The search term you want is 'wayback machine API'.  The Wayback API page links to a CDX server project which seems to do everything you could want.  If you need extra functionality you probably don't need to fork that code: you can submit patches that run other queries on the raw data.
 * ''existing archives can be randomly removed due to changes in a sites robot.txt (yes the actual content may not be deleted, but as an end user I cannot tell the difference as I cannot access the needed data).
 * Any system we set up would face similar problems. This proposal explicitly aims to abide by robots.txt files.
 * ''Overall it introduces a significant overhead and headache to rely on a third party.
 * It introduces a more significant overhead to design, test, and maintain our own bespoke system, used only by us, rather than contributing to a system already in use to cache half of the Internet.
 * ''[to] bring that in house along with the planned additional features would either require direct server access to the IA's system or a massive amount of data being moved between the tool and the IA.
 * What massive amount of data that would need to move? What tools/scripts would be running over that data?
 * ''I am unsure if the IA could reasonably handle the workload needed to operate this tool.
 * Which parts? IA works with petabytes of data; what is described in this proposal basically duplicates one of their core tools, which they have run for over a decade.  The parts they don't have experience handling are ones we've never handled before either.

This idea for caching refs is a really great one. I am excited to see it move forward. Luckily, the first part of this work (caching, indexing, compressing, backing up, looking up and displaying snapshots, parsing robots.txt, handling DMCA) is already done. That means we can focus on the second half (extracting metadata, autofilling forms, identifying and updating deadlinks, showing diffs, verifying cites &c). Sj (talk) 09:39, 5 July 2014 (UTC)
 * Actually you are ignoring a few points that I made. The API that is provided does not provide the needed information (there are no ways of getting all dates of particular URL, unless you wanted to screen scrape the HTML version), and what is provided (snapshots) has custom HTML/CSS/javascript injected into it. Actually if you read what I said I said we would respect robots.txt, which means that some things may not be cachable. However unlike robots.txt if something changes we will not remove pre-existing caches. In regards to the data being moved, every one of the 20-30 million links on enwiki will probably need retrieved, examined and have metadata extracted. Doing this our-self enables us to distribute the queries across difference sources so that no one source is getting an excessive load. If we are using just IA that would mean 30 million requests, and growth of about 6,000 a day for new links. We also dont have methods for finding out if IA creates new snapshots. Which means that periodically we would need to re-run all the links to check for newer snapshots. The you would need to look at the bandwidth that is being placed solely on IA's servers to provide all of this. Realistically anyone placing that much stress on a groups servers will cause disruptions in service and performance problems. Betacommand (talk) 10:59, 5 July 2014 (UTC)
 * Hi BetaC, a few replies:
 * API: Please follow the link I provided to the CDX server - it is a deeplink to the section on that page that shows you how to get all captures of a given URL (or URL match). If you just pull out the date field, that gives you all dates for that URL.
 * robots.txt : I recognize the difference when you put it that way, but it doesn't seem like a major issue here; it would be a tiny portion of all links. I believe you could get a copy of anything cached in this way that IA later removed from the public web, if you wanted to host that small # of links elsewhere.
 * Load: Each of the ~30M links on enwiki currently has at least one snapshot cache in the wayback machine.  so what's needed is the examining and metadata extraction.   Finding a way to do this effectively is an interesting question, and one we should solve together with the data mavens who currently handle IA's infrastructure - and depends mainly on what is being analyzed and how those algorithms are written.
 * Checking for new snapshots: why would you want to do this? I think a snapshot close to the time of the citation is what matters.  (However yes, one can find out if new snapshots have been taken, or at least run a "take new snapshot if none has been taken in the past 12 months" cronjob.)
 * Disruptions in service and performance: First, anything that we can accomplish with only 30TB of disk and a few fast machines is not going to significantly impact their service and performance.  Second,  we're not talking about a "third party link cacher" that we might ask to run 30 million requests.  These are fellow travellers who often collaborate with Wikipedians and have already run 30 million requests (and another 6,000 a day) on their own initiative and are likely interested in making the results work better.  Sj (talk) 10:31, 6 July 2014 (UTC)

General recommendations
James Salsman (talk) 21:11, 6 July 2014 (UTC)
 * Do not depend on IA/archive.org. It is profoundly slow (30+ second response times are typical), It is blocked in China and many other regions. Its uptime availability is not very good. It retroactively respects robots.txt. We should not ask reflinks users to have to wait for and depend on IA.
 * Do not censor with respect to robots.txt except in the cases of actions which are completely unambigiously not human-directed. This includes checking to see whether a reference added by a human is still the same reference available at the same address. This is not a general spidering tool, it is a cache based entirely on human editing behavior.
 * Do not make the verbatim cached sources available for direct download, unless link rot occurs, the source can not be found in a new location, and robots.txt would have allowed it and still does.
 * Store the sources in verbatim, text only, and inverted index format. Make the source text available to machine translation and similar researchers willing to agree to limited fair use disclosure. Make the inverted index and occurrence frequency statistics available to anyone, vigorously exercising U.S. Feist v. Rural rights.
 * Consider use of the inverted index format for fast diffs and sorting fast category tree intersections for, e.g., spreading activation use.
 * Use a genuine redundant array of inexpensive disks architecture for network attached storage projects from of off-the-shelf mass market storage hardware, instead of "commercial-grade" equipment, whatever that means. Avoid premium name brands associated with tax avoidance and surveillance overhead.