User:Kevin Brown/ArchiveLinks/UserStories
From MediaWiki.org
Theme: Render external links with "cache" link in MediaWiki articles.
- Render external links differently.
Done - Render external links differently based on configuration in LocalSettings.php
Done - Create a sample config that would work with www.archive.org.
Done - Create a sample config that would work with wikiwix.org.
Done - Create a sample config that would work with a local spidering system.
Done - Internationalize any UI that needs it (the word that appears in the "archive" link, anything else?)
Done, I think?
Theme: queueing links for spidering
- On article save, get external links.
Done - On article save, get external links, place into a queue.
Done - Write another program that can consume links from the queue and prints it to the screen.
Done - Ensure that another program invoked at the same time doesn't contend with the other one.
- Create a permanent blacklist for domains we don't want to spider.
Sort of done (the blacklist table is checked but there is no UI to populate it) - Ensure that any Wiki administrator can edit this blacklist.
Not done (I'm holding off on this for now, will come back to it later) - Ensure that we don't queue such links for archival.
Done
Theme: spidering a link and storing HTML.
- Expand the program above to invoke wget to spider the link.
Done (at least partially, there are still some lingering problems with meta tags not being followed and links not being rewritten properly. - Store these files in a permanent manner
Done (at least in the most basic manner, swift support has yet to be added)
Theme: linking to the stored HTML
- Create web handler for archived links stored locally.
Doing... - If the local file doesn't exist, show an interstitial page and then send users to the original URL.
- If the local file does exist, make it show a header, like Google Cache, with placeholder for content.
- If we think the content should be there but we still can't find it, show an error message.
- Make it so the archive links within articles link to the locally archived content as described above.
Theme: feed for external archive partners
- decide on format for feed w/partner (Archive.org, etc.?)
Done
- Push from MediaWiki (hard) or Archive.org polls? (easier,)
Done They will pull via the API - What format -- RSS, ATOM, or something directly from API? http://www.mediawiki.org/wiki/API:Data_formats ?
Done Data format
- Push from MediaWiki (hard) or Archive.org polls? (easier,)
- figure out how to serve this format
Done
- make sure the job queue runs and actually inserts stuff into the queue
Done
Theme: putting it all together
- Create a way for links to be spidered automatically
- With exponential backoff, if the URL 404s
- Develop heuristics to decide if you should re-spider a link, or just use previously cached items.
- Develop heuristics to not spider & store links if determined to have malware.