Manual talk:Pywikibot/weblinkchecker.py

About this board

The following discussion has been transferred from Meta-Wiki.
Any user names refer to users of that site, who are not necessarily users of MediaWiki.org (even if they share the same username).

Future feature

It would be pretty neat if one day this could also check whether a page is saved in the Internet Archive (which should be pretty easy to do, since all you have to do is append the page's address which you are checking to "http://web.archive.org/web/", and see whether it returns "$PAGE is not available in the Wayback Machine.").

Even neater would be if one could take a positive result and have the bot automatically insert in the article a link to the most recent archived copy, but that would not be as easy. --129.21.121.171 21:16, 10 May 2006 (UTC)

Is it possible to feed the output from this into the replace.py or some other script? ST47 19:02, 6 December 2006 (UTC)

Environment proxy variables (http_proxy, no_proxy) support

Weblinkchecker now uses httplib which does not honor environment proxy variables http_proxy, no_proxy. So if you are behind a proxy and your wiki is NOT in the outer world, there is no way to check the links normally. Here's a patch which adds the support for these variables, though the resulting proxy support is very buggy - works only in Python 2.6 with your wiki address first in no_proxy variable, like no_proxy="www.wiki.local, *.local" It would be great to merge this patch (or its fixed version) into your SVN...

Index: weblinkchecker.py
===================================================================
--- weblinkchecker.py	(revision 6937)
+++ weblinkchecker.py	(working copy)
@@ -95,6 +95,7 @@
 
 import wikipedia, config, pagegenerators
 import sys, re
+import os
 import codecs, pickle
 import httplib, socket, urlparse, urllib, urllib2
 import threading, time
@@ -297,6 +298,16 @@
         resolveRedirect(). This is needed to detect redirect loops.
         """
         self.url = url
+        proxy = os.environ.get("http_proxy").replace('http://','',1)
+        self.noproxy = re.compile('\s*(?:,\s*)+').split(os.environ.get("no_proxy").replace('http://',''))
+        self.noproxy = map(lambda s: re.escape(s).replace('\\*','.*'), self.noproxy)
+        self.noproxy = '|'.join(self.noproxy)
+        self.noproxy = re.compile(self.noproxy)
+        if proxy and re.search(':', proxy):
+            self.proxy, self.proxyport = proxy.split(':')
+        else:
+            self.proxy = proxy
+            self.proxyport = 3128
         self.serverEncoding = serverEncoding
         self.header = {
             # 'User-agent': wikipedia.useragent,
@@ -315,30 +326,16 @@
         self.HTTPignore = HTTPignore
 
     def getConnection(self):
-        if self.scheme == 'http':
+        if self.proxy and not self.noproxy.match(self.host):
+            return httplib.HTTPConnection(self.proxy, self.proxyport)
+        elif self.scheme == 'http':
             return httplib.HTTPConnection(self.host)
         elif self.scheme == 'https':
             return httplib.HTTPSConnection(self.host)
 
     def getEncodingUsedByServer(self):
-        if not self.serverEncoding:
-            try:
-                wikipedia.output(u'Contacting server %s to find out its default encoding...' % self.host)
-                conn = self.getConnection()
-                conn.request('HEAD', '/', None, self.header)
-                response = conn.getresponse()
+        return 'utf-8'
 
-                self.readEncodingFromResponse(response)
-            except:
-                pass
-            if not self.serverEncoding:
-                # TODO: We might also load a page, then check for an encoding
-                # definition in a HTML meta tag.
-                wikipedia.output(u'Error retrieving server\'s default charset. Using ISO 8859-1.')
-                # most browsers use ISO 8859-1 (Latin-1) as the default.
-                self.serverEncoding = 'iso8859-1'
-        return self.serverEncoding
-
     def readEncodingFromResponse(self, response):
         if not self.serverEncoding:
             try:
@@ -367,6 +364,7 @@
             encoding = self.getEncodingUsedByServer()
             self.path = unicode(urllib.quote(self.path.encode(encoding)))
             self.query = unicode(urllib.quote(self.query.encode(encoding), '=&'))
+        self.url = urlparse.urlunparse([ self.scheme, self.host, self.path, '', self.query, urllib.quote(self.fragment) ])
 
     def resolveRedirect(self, useHEAD = False):
         '''
@@ -379,9 +377,9 @@
         conn = self.getConnection()
         try:
             if useHEAD:
-                conn.request('HEAD', '%s%s' % (self.path, self.query), None, self.header)
+                conn.request('HEAD', self.url, None, self.header)
             else:
-                conn.request('GET', '%s%s' % (self.path, self.query), None, self.header)
+                conn.request('GET', self.url, None, self.header)
             response = conn.getresponse()
             # read the server's encoding, in case we need it later
             self.readEncodingFromResponse(response)
@@ -446,7 +444,8 @@
             if isinstance(error, basestring):
                 msg = error
             else:
-                msg = error[1]
+                try: msg = error[1]
+                except: msg = error[0]
             # TODO: decode msg. On Linux, it's encoded in UTF-8.
             # How is it encoded in Windows? Or can we somehow just
             # get the English message?
@@ -483,7 +482,7 @@
             except httplib.error, error:
                 return False, u'HTTP Error: %s' % error.__class__.__name__
             try:
-                conn.request('GET', '%s%s' % (self.path, self.query), None, self.header)
+                conn.request('GET', self.url, None, self.header)
             except socket.error, error:
                 return False, u'Socket Error: %s' % repr(error[1])
             try:
@@ -789,6 +788,7 @@
     # that are also used by other scripts and that determine on which pages
     # to work on.
     genFactory = pagegenerators.GeneratorFactory()
+    global day
     day = 7
     for arg in wikipedia.handleArgs():
         if arg == '-talk':
@@ -805,7 +805,6 @@
         elif arg.startswith('-ignore:'):
             HTTPignore.append(int(arg[8:]))
         elif arg.startswith('-day:'):
-            global day
             day = int(arg[5:])
         else:
             if not genFactory.handleArg(arg):

Socket error: 'connection refused'

Is anyone else getting this error a lot? I am using MediaWiki v1.16.0beta. Thanks, Tisane 22:24, 7 April 2010 (UTC)

It happened for me and can be solve by removing MW:Extension:LDAP Authentication (it it's been installed naturally). JackPotte 21:29, 5 August 2010 (UTC)

Title encoding

Could we change the title encoding? For example, to report a URL in the page wikt:fr:qualimétrie, the .txt contains "qualimétrie". JackPotte 17:24, 14 August 2010 (UTC)

Questions from BRFAs and elsewhere on English Wikipedia

The following questions from BRFAs and elsewhere on English Wikipedia are begging for better answers:

"weblinkchecker.py only checks Wayback. Could you add support for WebCite? How can i find reported dead links for a subject area using catscan (e.g. template, category inserted) ?" (expired at w:Wikipedia:Bots/Requests for approval/PhuzBot)

"How will the bot respond to a page that has been permanently moved? It would be better for us to get the updated URL for the live page (if possible) than to point to an archived copy." (archived at w:Wikipedia talk:External links/Archive 29#Web_Link_Checking_Bot)

Could it be "modified to crawl every page querying prop=extlinks instead of downloading the page text"? Could it be "done from a database dump or a toolserver query"? Why does it "post to article talk pages instead of applying w:template:dead link directly to the page"? (current at w:Wikipedia:Bots/Requests for approval/JeffGBot)

In addition, what bugs are meant by "the script on repository has some bugs you should care about."? (current at w:Wikipedia talk:Link rot#Web_Link_Checking_Bot)

 Jeff G. ツ 21:00, 23 February 2011 (UTC)

Similarly:

"Given the number of dead links, I strongly suggest the bot to place {{Dead link}} instead and attempt to fix them with Wayback/Webcite. Few people repair dead links in article, and I feel like posting them on talk page will be even more cumbersome. A couple of bots are already approved for that, though inactive." (current at w:Wikipedia:Bots/Requests for approval/JeffGBot)

Answers and patches to address the above would be helpful.  Jeff G. ツ 03:37, 4 April 2011 (UTC)

Re: Questions

[w:Another Brick in the Wall "I'm still waiting"], so I posted six bugs here.  Jeff G. ツ 20:53, 21 January 2012 (UTC)

I was just reading your thread on Link rot talk. Then I had a look at the six bugs you posted on the JIRA toolserver site. I don't have an account on JIRA, so am just mentioning this to you here. Everything you're doing looks good to me. And you have been working on this over a year, based on your entries. I noticed that a query against WebCite was one of your six items. But you might want to confirm that WebCite is capable of handling additional load from Wikipedia. I mention that because of a response to an inquiry in section 3
"Quite simply put WebCite cannot handle the volume that Wikipedia provides, even the small run of 10-50 PDFs a night by Checklinks seems to be contributing to the problem."
Date was June 2010, so things might have changed. And you might be doing something different than what that inquiry was about. Sounded like that referred to a batch load of nightly links to be archived by WebCite, whereas you are (maybe?) planning on querying WebCite to see if broken URLs are archived there already. Maybe that isn't nearly as resource intensive. Just a few thoughts I wanted to share. Thanks for what you're doing here! --FeralOink (talk) 10:40, 20 February 2012 (UTC)
Thanks. I've asked on that page.  Jeff G. ツ 03:58, 17 March 2012 (UTC)

For a few months the content of my deadlinks-*.dat files has become impossible to understand, eg:

(dp0
Vhttp://www.paremia.org/paremia/P13-20.pdf
p1
(lp2
(Va beau mentir qui vient de loin
p3
F1378644901.783
S'404 Not Found'
p4
tp5
as.

And as the script didn't modify the pages itself, I've developed the following module which has already fixed this on all the French wikis, into several thousands of pages for six months: [w:fr:Utilisateur:JackBot/hyperlynx.py w:fr:Utilisateur:JackBot/hyperlynx.py].

For the other wikis we would need to change the templates names and parameters, which would also allows to translate them into the wiki language as I've made from English to French. JackPotte (talk) 13:06, 8 September 2013 (UTC)


Mdann52 (talkcontribs)
Legoktm (talkcontribs)

No, it doesn't.

Mdann52 (talkcontribs)

Is there an easy way to implement it to ignore them; I can live with it if not, but a fix would be useful smile

Legoktm (talkcontribs)

Probably not. We do love patches though, see Manual:Pywikibot/Gerrit for instructions on how to get started, or ask in #pywikipediabot :)

Mdann52 (talkcontribs)

I have asked in the IRC channel, and it needs a complete rewrite to do this. I don't have the time ATM, so will need someone else to do this; Feel free to mark as resolved.

Xqt (talkcontribs)

You may file a new bug at bugzilla for this feature request. Otherwise it will be lost I fear.  @xqt 05:28, 13 November 2013 (UTC)

Wesalius (talkcontribs)

Hi,

I ran a python weblinkchecker.py -start:!week ago . It went through our wiki at wikiskripta.eu fine and produced .dat Today I ran python weblinkchecker.py -repeat and no .txt file appeared. Thank you for your advice.

There are no older topics