Manual talk:Pywikibot/weblinkchecker.py/LQT Archive 1

Future feature
It would be pretty neat if one day this could also check whether a page is saved in the Internet Archive (which should be pretty easy to do, since all you have to do is append the page's address which you are checking to "http://web.archive.org/web/", and see whether it returns "$PAGE is not available in the Wayback Machine.").

Even neater would be if one could take a positive result and have the bot automatically insert in the article a link to the most recent archived copy, but that would not be as easy. --129.21.121.171 21:16, 10 May 2006 (UTC)

Is it possible to feed the output from this into the replace.py or some other script? ST47 19:02, 6 December 2006 (UTC)

Environment proxy variables (http_proxy, no_proxy) support
Weblinkchecker now uses httplib which does not honor environment proxy variables http_proxy, no_proxy. So if you are behind a proxy and your wiki is NOT in the outer world, there is no way to check the links normally. Here's a patch which adds the support for these variables, though the resulting proxy support is very buggy - works only in Python 2.6 with your wiki address first in no_proxy variable, like no_proxy="www.wiki.local, *.local" It would be great to merge this patch (or its fixed version) into your SVN...

Index: weblinkchecker.py

=
====================================================== --- weblinkchecker.py	(revision 6937) +++ weblinkchecker.py	(working copy) @@ -95,6 +95,7 @@ import wikipedia, config, pagegenerators import sys, re +import os import codecs, pickle import httplib, socket, urlparse, urllib, urllib2 import threading, time @@ -297,6 +298,16 @@        resolveRedirect. This is needed to detect redirect loops. """        self.url = url +        proxy = os.environ.get("http_proxy").replace('http://',,1) +        self.noproxy = re.compile('\s*(?:,\s*)+').split(os.environ.get("no_proxy").replace('http://',)) +        self.noproxy = map(lambda s: re.escape(s).replace('\\*','.*'), self.noproxy) +        self.noproxy = '|'.join(self.noproxy) +        self.noproxy = re.compile(self.noproxy) +        if proxy and re.search(':', proxy): +            self.proxy, self.proxyport = proxy.split(':') +        else: +            self.proxy = proxy +            self.proxyport = 3128         self.serverEncoding = serverEncoding         self.header = {             # 'User-agent': wikipedia.useragent, @@ -315,30 +326,16 @@         self.HTTPignore = HTTPignore     def getConnection(self): -        if self.scheme == 'http': +        if self.proxy and not self.noproxy.match(self.host): +            return httplib.HTTPConnection(self.proxy, self.proxyport) +       elif self.scheme == 'http': return httplib.HTTPConnection(self.host) elif self.scheme == 'https': return httplib.HTTPSConnection(self.host) def getEncodingUsedByServer(self): -       if not self.serverEncoding: -           try: -               wikipedia.output(u'Contacting server %s to find out its default encoding...' % self.host) -               conn = self.getConnection -               conn.request('HEAD', '/', None, self.header) -               response = conn.getresponse +       return 'utf-8' -               self.readEncodingFromResponse(response) -           except: -               pass -           if not self.serverEncoding: -               # TODO: We might also load a page, then check for an encoding -               # definition in a HTML meta tag. -               wikipedia.output(u'Error retrieving server\'s default charset. Using ISO 8859-1.') -               # most browsers use ISO 8859-1 (Latin-1) as the default. -               self.serverEncoding = 'iso8859-1' -       return self.serverEncoding -    def readEncodingFromResponse(self, response): if not self.serverEncoding: try: @@ -367,6 +364,7 @@            encoding = self.getEncodingUsedByServer self.path = unicode(urllib.quote(self.path.encode(encoding))) self.query = unicode(urllib.quote(self.query.encode(encoding), '=&')) +       self.url = urlparse.urlunparse([ self.scheme, self.host, self.path, '', self.query, urllib.quote(self.fragment) ]) def resolveRedirect(self, useHEAD = False): ''' @@ -379,9 +377,9 @@        conn = self.getConnection try: if useHEAD: -               conn.request('HEAD', '%s%s' % (self.path, self.query), None, self.header) +               conn.request('HEAD', self.url, None, self.header) else: -               conn.request('GET', '%s%s' % (self.path, self.query), None, self.header) +               conn.request('GET', self.url, None, self.header) response = conn.getresponse # read the server's encoding, in case we need it later self.readEncodingFromResponse(response) @@ -446,7 +444,8 @@            if isinstance(error, basestring): msg = error else: -               msg = error[1] +               try: msg = error[1] +               except: msg = error[0] # TODO: decode msg. On Linux, it's encoded in UTF-8. # How is it encoded in Windows? Or can we somehow just # get the English message? @@ -483,7 +482,7 @@            except httplib.error, error: return False, u'HTTP Error: %s' % error.__class__.__name__ try: -               conn.request('GET', '%s%s' % (self.path, self.query), None, self.header) +               conn.request('GET', self.url, None, self.header) except socket.error, error: return False, u'Socket Error: %s' % repr(error[1]) try: @@ -789,6 +788,7 @@    # that are also used by other scripts and that determine on which pages # to work on. genFactory = pagegenerators.GeneratorFactory +   global day day = 7 for arg in wikipedia.handleArgs: if arg == '-talk': @@ -805,7 +805,6 @@        elif arg.startswith('-ignore:'): HTTPignore.append(int(arg[8:])) elif arg.startswith('-day:'): -           global day day = int(arg[5:]) else: if not genFactory.handleArg(arg):

Socket error: 'connection refused'
Is anyone else getting this error a lot? I am using MediaWiki v1.16.0beta. Thanks, Tisane 22:24, 7 April 2010 (UTC)
 * It happened for me and can be solve by removing MW:Extension:LDAP Authentication (it it's been installed naturally). JackPotte 21:29, 5 August 2010 (UTC)

Title encoding
Could we change the title encoding? For example, to report a URL in the page qualimétrie, the .txt contains "qualimÃ©trie". JackPotte 17:24, 14 August 2010 (UTC)

Questions from BRFAs and elsewhere on English Wikipedia
The following questions from BRFAs and elsewhere on English Wikipedia are begging for better answers:

"weblinkchecker.py only checks Wayback. Could you add support for WebCite? How can i find reported dead links for a subject area using catscan (e.g. template, category inserted) ?" (expired at w:Wikipedia:Bots/Requests for approval/PhuzBot)

"How will the bot respond to a page that has been permanently moved? It would be better for us to get the updated URL for the live page (if possible) than to point to an archived copy." (archived at w:Wikipedia talk:External links/Archive 29)

Could it be "modified to crawl every page querying prop=extlinks instead of downloading the page text"? Could it be "done from a database dump or a toolserver query"? Why does it "post to article talk pages instead of applying w:template:dead link directly to the page"? (current at w:Wikipedia:Bots/Requests for approval/JeffGBot)

In addition, what bugs are meant by "the script on repository has some bugs you should care about."? (current at w:Wikipedia talk:Link rot)

— Jeff G. ツ 21:00, 23 February 2011 (UTC)

Similarly:

"Given the number of dead links, I strongly suggest the bot to place Dead link instead and attempt to fix them with Wayback/Webcite. Few people repair dead links in article, and I feel like posting them on talk page will be even more cumbersome. A couple of bots are already approved for that, though inactive." (current at w:Wikipedia:Bots/Requests for approval/JeffGBot)

Answers and patches to address the above would be helpful. — Jeff G. ツ 03:37, 4 April 2011 (UTC)

Re: Questions
"I'm still waiting", so I posted six bugs here. — Jeff G. ツ 20:53, 21 January 2012 (UTC)
 * I was just reading your thread on Link rot talk. Then I had a look at the six bugs you posted on the JIRA toolserver site. I don't have an account on JIRA, so am just mentioning this to you here. Everything you're doing looks good to me. And you have been working on this over a year, based on your entries. I noticed that a query against WebCite was one of your six items. But you might want to confirm that WebCite is capable of handling additional load from Wikipedia. I mention that because of a response to an inquiry in section 3
 * "Quite simply put WebCite cannot handle the volume that Wikipedia provides, even the small run of 10-50 PDFs a night by Checklinks seems to be contributing to the problem."
 * Date was June 2010, so things might have changed. And you might be doing something different than what that inquiry was about. Sounded like that referred to a batch load of nightly links to be archived by WebCite, whereas you are (maybe?) planning on querying WebCite to see if broken URLs are archived there already. Maybe that isn't nearly as resource intensive. Just a few thoughts I wanted to share. Thanks for what you're doing here! --FeralOink (talk) 10:40, 20 February 2012 (UTC)
 * Thanks. I've asked on that page.  — Jeff G. ツ 03:58, 17 March 2012 (UTC)

deadlinks-*.dat syntax
For a few months the content of my deadlinks-*.dat files has become impossible to understand, eg: (dp0 Vhttp://www.paremia.org/paremia/P13-20.pdf p1 (lp2 (Va beau mentir qui vient de loin p3 F1378644901.783 S'404 Not Found' p4 tp5 as. And as the script didn't modify the pages itself, I've developed the following module which has already fixed this on all the French wikis, into several thousands of pages for six months: w:fr:Utilisateur:JackBot/hyperlynx.py.

For the other wikis we would need to change the templates names and parameters, which would also allows to translate them into the wiki language as I've made from English to French. JackPotte (talk) 13:06, 8 September 2013 (UTC)