weblinkchecker.py is a script from the Pywikibot which finds broken external links.
weblinkchecker.py can check the following:
- All URLs found on a single article
- All articles in a category
- All articles in one or more namespaces
- All articles on the wiki
- And much more! Check the list of command-line arguments.
It will only check HTTP and HTTPS links, and it will leave out URLs inside comments and nowiki tags. To speed itself up, it will check up to 50 links at the same time, using multithreading.
The bot will not remove external links by itself, it will only report them; removal would require strong artificial intelligence. It will only report dead links if they have been found unresponsive at least twice, with a default period of at least one week of waiting between the first and the last time. This should help prevent users from removing links due to temporary server failure. Please keep in mind that the bot cannot differentiate between local failures and a server failures, so make sure you're on a stable Internet connection.
The bot will save a history of unavailable links to a
deadlinks subdirectory, e.g.
./deadlinks/deadlinks-wikipedia-de.dat. This file is not intended to be read or modified by humans. The dat file will be written when the bot terminates (because it is done or the user pressed CTRL-C). After a second run (with an appropriate wait between the two), a human-readable list of broken links will be generated as a
Speculation. If someone is familiar with the technical details, please update this section.
To check for dead links for the first time for all pages on the wiki:
python weblinkchecker.py -start:!
This will add an entry into the .dat file, with a date. If you run this line again, it will add any new dead links that are not already list, or it will remove any existing entries that are now working.
After the bot has checked some pages, run it on these pages again at a later time. This can be done with this command:
python weblinkchecker.py -repeat
If the bot finds a broken link that has been broken for at least one week, it will log it in a text file, e.g.
deadlinks/results-wikipedia-de.txt. The written text has a format that is suitable for posting it on the wiki, so that others can help you to fix or remove the broken links from the wiki pages.
Additionally, it's possible to report broken links to the talk page of the article in which the URL was found (again, only once the linked page has been unavailable at least twice in at least one week). To use this feature, set report_dead_links_on_talk = True in your user-config.py.
Reports will include a link to the Internet Archive Wayback Machine if available, so that important references can be kept.
python weblinkchecker.py -start:!
- Loads all wiki pages in alphabetical order using the Special:Allpages feature.
python weblinkchecker.py -start:Example_page
- Loads all wiki pages using the Special:Allpages feature, starting at "Example page"
python weblinkchecker.py -weblink:www.example.org
- Loads all wiki pages that link to www.example.org
python weblinkchecker.py Example page python weblinkchecker.py -page:Example page
- Only checks links found in the wiki page "Example page"
python weblinkchecker.py -repeat
- Loads all wiki pages where dead links were found during a prior run
The following list was extracted from the bot's help (using
python weblinkchecker.py -help). It is in addition to the global arguments used by most bots.
|-cat||fromtitle" (using # instead of | is also allowed in this one and the following)|
|-file||Read a list of pages to treat from the named text file. Page titles in the file may be either enclosed with [[brackets]], or be separated by new lines. Argument can also be given as "-file:filename".|
|-filelinks||Work on all pages that use a certain image/media file. Argument can also be given as "-filelinks:filename".|
|Work on all pages that are found in a Google search. You need a Google Web API license key. Note that Google doesn't give out license keys anymore. See google_key in config.py for instructions. Argument can also be given as "-google:searchstring".|
|-imagesused||Work on all images that contained on a certain page. Can also be given as "-imagesused:linkingpagetitle".|
|-interwiki||Work on the given page and all equivalent pages in other languages. This can, for example, be used to fight multi-site spamming. Attention: this will cause the bot to modify pages on several wiki sites, this is not well tested, so check your edits!|
|-links||Work on all pages that are linked from a certain page. Argument can also be given as "-links:linkingpagetitle".|
|-liverecentchanges||Work on pages from the live recent changes feed. If used as -liverecentchanges:x, work on x recent changes.|
|-logevents||Work on articles that were on a specified Special:Log. The value may be a comma separated list of these values: logevent,username,start,end or for backward compatibility: logevent,username,total|
| To use the default value, use an empty string. You have options for every type of logs given by the log event parameter which could be one of the following: spamblacklist, titleblacklist, gblblock, renameuser, globalauth, gblrights, gblrename, abusefilter, massmessage, thanks, usermerge, block, protect, rights, delete, upload, move, import, patrol, merge, suppress, tag, managetags, contentmodel, review, stable, timedmediahandler, newusers|
It uses the default number of pages 10. Examples:
In some cases it must be given as -logevents:"move,Usr,20"
|-lonelypages||Work on all articles that are not linked from any other article. Argument can be given as "-lonelypages:n" where n is the maximum number of articles to work on.|
|-mysqlquery||Takes a Mysql query string like "SELECT page_namespace, page_title, FROM page WHERE page_namespace = 0" and works on the resulting pages.|
|-newimages||Work on the most recent new images. If given as -newimages:x, will work on x newest images.|
|-newpages||Work on the most recent new pages. If given as -newpages:x, will work on x newest pages.|
|-page||Work on a single page. Argument can also be given as "-page:pagetitle", and supplied multiple times for multiple pages.|
|-pageid||pageid2|..'" and supplied multiple times for multiple pages.|
|-prefixindex||Work on pages commencing with a common prefix.|
|-property:name||Work on all pages with a given propery name from Special:PagesWithProp.|
|-random||Work on random pages returned by Special:Random. Can also be given as "-random:n" where n is the number of pages to be returned.|
|-randomredirect||Work on random redirect pages returned by Special:RandomRedirect. Can also be given as "-randomredirect:n" where n is the number of pages to be returned.|
|-recentchanges||Work on the pages with the most recent changes. If given as -recentchanges:x, will work on the x most recently changed pages. If given as -recentchanges:offset,duration it will work on pages changed from 'offset' minutes with 'duration' minutes of timespan. rctags are supported too. The rctag must be the very first parameter part.
|-ref||Work on all pages that link to a certain page. Argument can also be given as "-ref:referredpagetitle".|
|-search||Work on all pages that are found in a MediaWiki search across all namespaces.|
|-searchitem||Takes a search string and works on Wikibase pages that contain it. Argument can be given as "-searchitem:text", where text is the string to look for, or "-searchitem:lang:text", where lang is the langauge to search items in.|
|-sparql||Takes a SPARQL SELECT query string including ?item and works on the resulting pages.|
|-sparqlendpoint||Specify SPARQL endpoint URL (optional). (Example : -sparqlendpoint:http://myserver.com/sparql)|
|-start||Specifies that the robot should go alphabetically through all pages on the home wiki, starting at the named page. Argument can also be given as "-start:pagetitle". You can also include a namespace. For example, "-start:Template:!" will make the bot work on all pages in the template namespace. default value is start:!|
|-subcats||Work on all subcategories of a specific category. Argument can also be given as "-subcats:categoryname" or as "-subcats:categoryname|fromtitle".|
|-subcatsr||Like -subcats, but also includes sub-subcategories etc. of the given category. Argument can also be given as "-subcatsr:categoryname" or as "-subcatsr:categoryname|fromtitle".|
|-transcludes||Work on all pages that use a certain template. Argument can also be given as "-transcludes:Title".|
|-uncat||Work on all pages which are not categorised.|
|-uncatcat||Work on all categories which are not categorised.|
|-uncatfiles||Work on all files which are not categorised.|
|-unconnectedpages||Work on the most recent unconnected pages to the Wikibase repository. Given as -unconnectedpages:x, will work on the x most recent unconnected pages.|
|-unusedfiles||Work on all description pages of images/media files that are not used anywhere. Argument can be given as "-unusedfiles:n" where n is the maximum number of articles to work on.|
|-unwatched||Work on all articles that are not watched by anyone. Argument can be given as "-unwatched:n" where n is the maximum number of articles to work on.|
|-usercontribs||Work on all articles that were edited by a certain user. (Example : -usercontribs:DumZiBoT)|
|-weblink||Work on all articles that contain an external link to a given URL; may be given as "-weblink:url"|
|-withoutinterwiki||Work on all pages that don't have interlanguage links. Argument can be given as "-withoutinterwiki:n" where n is the total to fetch.|
|-yahoo||Work on all pages that are found in a Yahoo search. Depends on python module pYsearch. See yahoo_appid in config.py for instructions.|
|-catfilter||Filter the page generator to only yield pages in the specified category. See -cat generator for argument format.|
|-grep||A regular expression that needs to match the article otherwise the page won't be returned. Multiple -grep:regexpr can be provided and the page will be returned if content is matched by any of the regexpr provided. Case insensitive regular expressions will be used and dot matches any character, including a newline.|
|-ignore||HTTP return codes to ignore. Can be provided several times: -ignore:401 -ignore:500|
|-intersect||Work on the intersection of all the provided generators.|
|-limit||When used with any other argument -limit:n specifies a set of pages, work on no more than n pages in total.|
|-namespaces||Filter the page generator to only yield pages in the|
|-namespace||specified namespaces. Separate multiple namespace|
|-namespace||Only process templates in the namespace with the given number or name. This parameter may be used multiple times.|
|-ns||numbers or names with commas. Examples:
You may use a preleading "not" to exclude the namespace. Examples:
If used with -newpages/-random/-randomredirect generators, -namespace/ns must be provided before -newpages/-random/-randomredirect. If used with -recentchanges generator, efficiency is improved if -namespace is provided before -recentchanges. If used with -start generator, -namespace/ns shall contain only one value.
|-onlyif||A claim the page needs to contain, otherwise the item won't be returned. The format is property=value,qualifier=value. Multiple (or none) qualifiers can be passed, separated by commas.
P1=Q2 (property P1 must contain value Q2), P3=Q4,P5=Q6,P6=Q7 (property P3 with value Q4 and qualifiers: P5 with value Q6 and P6 with value Q7).
|-onlyifnot||A claim the page must not contain, otherwise the item won't be returned.
For usage and examples, see -onlyif above.
|-ql||Filter pages based on page quality. This is only applicable if contentmodel equals 'proofread-page', otherwise has no effects. Valid values are in range 0-4. Multiple values can be comma-separated.|
|-repeat||Work on all pages were dead links were found before. This is useful to confirm that the links are dead after some time (at least one week), which is required before the script will report the problem.|
|-subpage||-subpage:n filters pages to only those that have depth n i.e. a depth of 0 filters out all pages that are subpages, and a depth of 1 filters out all pages that are subpages of subpages.|
|-titleregex||A regular expression that needs to match the article title otherwise the page won't be returned.
Multiple -titleregex:regexpr can be provided and the page will be returned if title is matched by any of the regexpr provided. Case insensitive regular expressions will be used and dot matches any character.
|-titleregexnot||Like -titleregex, but return the page only if the regular expression does not match.|
|-xml||Should be used instead of a simple page fetching method from pagegenerators.py for performance and load issues|
|-xmlstart||Page to start with when using an XML dump|
|Furthermore, the following command line parameters are supported:|
|-day||Do not report broken link if the link is there only since x days or less. If not set, the default is 7 days.|
|-notalk||Overrides the report_dead_links_on_talk config variable, disabling the feature.|
|-talk||Overrides the report_dead_links_on_talk config variable, enabling the feature.|
All other arguments will be regarded as part of the title of a single page, and the bot will only work on that single page.
The following config variables (to be declared in user-config.py) are supported by this script:
|max_external_links||The maximum number of web pages that should be loaded simultaneously. You should change this according to your Internet connection speed. Be careful: if it is set too high, the script might get socket errors because your network is congested, and will then think that the page is offline.|
|report_dead_links_on_talk||If set to true, causes the script to report dead links on the article's talk page if (and ONLY if) the linked page has been unavailable at least two times during a timespan of at least one week.|
|weblink_dead_days||- sets the timespan (default: one week) after which a dead link will be reported|
|Language:||English • italiano|