Extension:ArchiveLinks/Project/InternetArchive 2011 07 14
Kevin's project can be configured to use a local spider, but for the case of Wikimedia, it's desirable from a lot of perspectives to get someone else to do it. Less security issues to worry about, less legal issues.
Internet Archive crawl latency is still too slow (weeks or months), but they are starting a project called Archive-on-Demand to speed it up dramatically for some sites - like Wikipedia!
Been talking with the Internet Archive people for a while over email, met in person at the Internet Archive offices to kickstart this.
- Kevin Brown (via Skype) - Google Summer of Code student, Wikimedia Foundation
- Vinay Goel - Web Crawl Engineer, Internet Archive
- Neil Kandalgaonkar - Features Engineer/GSoC mentor, Wikimedia Foundation
- Alexis Rossi - Web Collections Manager, Internet Archive
Demoed Kevin's app to all.
Noted that the link Kevin is using could be better. If it included the timestamp desired then the Wayback machine would pick the best archived version automatically. Actually looks in both directions, future and past, for nearest one, which is exactly what we want.
Vinay & Alexis described how they will be setting up this faster crawler
- possibly headless simulated browsers like HTMLUnit working on VMs, controlled by some server.
- daemon which accepts new crawl requests
- daemon to make new crawl requests, from some feed of new external links provided by WMF
- other crawl requests may be made from other services in the IA
Decided it is best if WMF simply makes a new request every time someone saves the page. We won't attempt to lighten the burden by not re-requesting URLs or whatever, we just ask for a new crawl every time someone touches that link's wikitext.
Kevin: what about when they aren't editing that portion of the page -- should use diff to decide if this is really a new link. Seems like a good idea
Kevin: Also, what about reverts, we don't want to link to a new version of the page just because vandalism was reverted
Neil points out these are issues to be solved on the WMF side, since it's about which version to link to, not which version to crawl. We can be clever with diff, or make sure to use the *old* timestamp of the page if we know the change is a revert. (Can we know this from the hook!?)
So, new item added to ArchiveLinks agenda, to make a feed page (RSS, Atom?) of new links to be archived. Public, so IA can crawl it.
Several measures to minimize here:
- Delay before WMF publishes link in feed.
- Delay before crawl.
- Delay before we get the data.
- Delay until publicly available on Wayback machine.
Current delay before crawl can be as high as 6 months due to Alexa. But this new service (Archive on Demand) should have a delay-before-crawl of seconds to minutes.
Kevin: WMF anticipates 15,000 new external links per day, from english Wikipedia. Can publish instantly to feed.
Vinay: other languages?
Neil: follows a power law. Back of envelope calc, expect 1.5-2x English when combining all languages together.
Vinay: what's the highest burst rate
Kevin: we don't know but highest edit rate (period) is 5-6 per second, and only some of those have ext. links.
What is the max latency of the new crawl system... Vinay has done tests, 6 processes + 6 urls, take a minute to fully download all pages. That's on one slow machine. Will be higher w/fast machines, multiple machines.
Retries? Vinay assumes one try is good, if it fails, move on... Neil requests that they at least try one more time, maybe in the next daily crawl? And if that fails, drop it.
As for publishing content on Wayback, current latency is 5 weeks... but there is no reason it can't be much, much faster. Engineer (Brad (?)) located in Vietnam currently is working on this. No schedule or timeline yet for their work. Must ping him again soon for details.
Embedded / scripted / delayed content
No good solutions for things like Youtube, where the desired movie isn't even loaded until an ad plays. Much discussion about this but it is really a corner case and we're not going to do anything special to mitigate.
IA does not make any attempt to deal with malware. They just show the web as it is. Somewhat surprising.
WMF reasons that it's not more dangerous than having the link on the page in the first place. And at least if the page is served by IA then we don't have any same-origin issues locally either.
Kevin: some people have asked for stats on how many links in WP are dead
Alexis, Vinay: we have done a crawl of all the external links we could find in WP. We could provide you with CDX files (summaries of URLs + HTTP status), even daily.
Service level guarantees
IA doesn't do that, even for paying customers
Alexis: Mutual interest is the only guarantee you have. Wikipedia has links of high interest to the IA, these are a carefuly curated collection, good for research etc.
Neil: And Wikipedia wants to have these links archived, because otherwise the encyclopedia isn't as verifiable.
Alexis: You do not want to get into DMCA takedowns and other stuff, requires a lot of customer service, legal staff. We have that set up already
Who else will be using this Archive-on-Demand? Just the WMF to start with, and maybe one other IA customer. The WMF is the priority.
So, we're keeping the lawyers out of this one. :)
- Kevin embed timestamp into links when targeting the WayBack machine
- Kevin + Vinay agree on format of feed of external links that will be crawled at high speed. RSS? Atom? Something else?
- Neil recommends Atom only due to better escaping specification
- Kevin write software that will produce feed of external links in above format
- Kevin + Neil get that deployed to a public test wiki, so we can test it with the Internet Archive.
- Vinay handle consumption of that link at IA, feed it into crawl
- Neil + Alexis ping Brad about latency in about a month
- Neil has to explain the significance of "ARRRRRR" for the Internet Archive to Kevin