User:NKohli (WMF)/DeadlinkChecker

Links

 * Code available here
 * Available on Packagist
 * To report bugs, create a ticket in Phabricator and tag with Community-tech tag
 * How to use?

About the project
While working on Community Wishlist survey Wish #1 - Migrating dead links to archives, we were faced with the problem of detecting dead links. Seems like a problem someone somewhere would have already solved, right? So we thought. It turned out checking dead links is quite a non-trivial problem. There's hundreds of ways a website can be dead, be temporarily dead, not be dead, yet say it is dead, say it's not dead, yet be dead....you get the idea. So we started to write our own Deadlink Checker library for PHP. It started out really basic - with just checking for HTTP response code but over time we started doing more complicated checks with it.

Here's how the code works (roughly):
 * For each incoming url - we curl it to get the header information
 * The curl options are set based on whether it's an HTTP/FTP request and whether we want just the header information or the complete html
 * Based on the data we get back from curl we perform the following checks:
 * Whether the url redirected to domain root? - We derive a set of probably domain roots before making this check.
 * Whether the response code was bad/uncertain? - Do a full request in that case.
 * Whether the end url gives away that it's an error/404.

The big piece missing from the code is soft 404 checks, but apart from that it's good to work. This library was used in conjunction with a software which kept a database log of how many times has it checked a url. If a url claims to be dead at least 3 times, over the course of a few days, we conclude it to be dead, to exclude the possibility of temporary dead links.