Extension talk:Google Search-2

From mediawiki.org
The following discussion has been transferred from Meta-Wiki.
Any user names refer to users of that site, who are not necessarily users of MediaWiki.org (even if they share the same username).

Google mini[edit]

Note that I've implemented a version of this search - the code is nearly identical - on innovation.intuit.com using a dedicated Google mini. --MHart 13:46, 25 October 2006 (UTC)Reply

(In response to an emailed question)[edit]

Google sells several appliances for corporate intranet searching. We use one of those appliances - it is essentially the same as the Google website, but it can live behind a corporate firewall.

The Google website can return XML results just like I use in the extension I created. You have to sign up for a Google API license key and you are limited to 1000 searches per day.

Google also lets you embed unlimited search results in pages, but those results will include ads:

The search extension I created finds anything on the website that is linked from "somewhere" - the normal way that Google crawls a site to find things to index. In the Google appliance as well as in a "robots.txt" file on a website, you can prevent Google from indexing edit links and such.

You can restrict the results to any URL. For instance, if you want to restrict to a particular namespace, then you would use (URL not properly encoded...)

&site_search=yoursite.com/index.php/NameSpace:

depending on your URL rewrite rules, might also be:

&site_search=yoursite.com/index.php?title=NameSpace:

That tells Google to only return results that contain that URL. --MHart 15:13, 11 August 2006 (UTC)Reply

error on access[edit]

hi Matt,

thanks for the extension! i could not find an email address for you and dont have a meta account, so i hope you don't mind me asking here. i have implemented this in my MediaWiki 1.6.6 testing installation, but am logging an error when I attempt the search:

PHP Fatal error:  Cannot instantiate non-existent class:  specialsearch in /var/www/html/extensions/GoogleSearch.php
on line 13, referer: http://example.com/index.php/Home

any ideas? tia and thanks again!


Note: I was able to get around the error by adding an include for SpecialSearch.php to GoogleSearch.php:

include("/var/www/html/includes/SpecialSearch.php");

Empty documents being returned (and other oddities)[edit]

I'm trying to get this implimented currently. I can get it to try to connect to my appliance, but I get garbage in the httpd log file. Example:

[client 10.4.253.100] PHP Fatal error:  Call to a member function on a non-object in /var/www/html/includes/SkinTemplate.php on 
line  306, referer: http://host1/index.php/Main_Page
[client 10.4.253.100] PHP Warning:   file_get_contents(http://10.2.231.199/search?q=test&output=xml&site=mysite&client=mysite&as_sitesearch=host1&start=)
: failed to open    stream: Connection refused in /var/www/html/extensions/GoogleSearch.php on line 34, referer: http://host1/index.php/Main_Page
/tmp/kwRZt3:1: parser error : Document is empty

^
/tmp/kwRZt3:1: parser error : Start tag expected, '<' not found

^
unable to parse /tmp/kwRZt3

I'm wondering; Do I need to do some configuration on the GA itself to match it up with how you're written this extension?

Edit: Actually, after looking at it further, it seemas though the default search isn't getting entirely disabled. Still at a loss at to what to do though.

--- Probably not a GA problem - rather the file_get_contents() PHP all is failing, either because the GA isn't allowing the query (firewall issue) or, more likely, PHP isn't configured to allow file_get_contents() to query a URL. Check out this article on setting allow-url-fopen configuration in your php.ini file. --MHart 13:42, 25 October 2006 (UTC)Reply


Hi Matt, Can u explain more of what the GA variables u have in GoogleSearch.php file... about customizing teh $url What do i set those 3 variables to be. say my domain is "hello.org" thanks in advance - imti

--- Those variables are:

  • site=collection
    • When the GA indexes, it can put different domains or sub-domains into "collections", so that you can run a query and prevent extra results from sites/domains you don't want. This is so that you can, as an enterprise, index your entire organization but also easily create separate search pages for various sites and domains within the company.
    • If you don't have any collections defined, don't use this parameter: you'll get the default collection (probably all pages)
  • client=clientname
    • The GA is setup to track requests by various clients. However, it isn't necessary unless it is setup to require it. If you use the GA test center to run a query, you can view the source HTML of the test page and see what the test page setup as hidden form fields for variables such as site and client
  • as_sitesearch=site
    • This is to further filter the results. It is possible to create a different collection for every different wiki or sub-pages in a domain, but it is much easier to use the equivalent of Google's site:www.site.com/subpages. That's what this parameter does. I host over 100 wikis on the same server, many of them with something like: wikis.intuit.com/wikiname, so the as_sitesearch= lets me filter those results to just that one wiki.

Your settings depend on how your GA is setup. Ask your GA admin what is the client name and wht is the collection to use for your hello.org site. OR you can login to the GA and access the GA test center and bring up the search page for hello.org. Then view the HTML and see what variables it has set for client= and site=. --MHart 13:42, 25 October 2006 (UTC)Reply

Error returned on search[edit]

Hello, this looks great, but I'm getting an error returned when trying to search:

function getID() on a non-object in /hermes/web10/b996/pow.clf23/htdocs/suwiki/includes/SkinTemplate.php on line 311

Any thoughts on what that might be from? I'd really like to get this extension implemented, thanks! --Clf23 17:42, 8 October 2006 (UTC)Reply

--- Seems likely that you are using a different MediaWiki version than the one this is implemented for. What version are you running? You can mail me at wiki at matthart dot com --MHart 13:54, 25 October 2006 (UTC)Reply

Results page doubling up?[edit]

For some reason I'm seeing a double page when I get my search results. It is essentially 2 of the same page stacked on top of each other, only the bottom has no header to it. Is there something in the xslt code that could be causing this? I can't seem to track it down. --N0ctrnl 20:27, 28 November 2006 (UTC)Reply

Reason was an extra $wgOut->output(); getting into N0ctrnl's code. --MHart 19:50, 29 November 2006 (UTC)Reply

Getting this to work on MediaWiki 1.10.x (with php 5)[edit]

I encountered a problem with the code posted being compatible with php 5. The call to file_get_contents() was returning 0 byte length files. I found a clue in the form of a warning on the php documentation site, here is part of the warning:

When reading from anything that is not a regular local file, such as streams returned when reading remote files or from popen() and fsockopen(), reading will stop after a packet is available. This means that you should collect the data together in chunks...

Here's the fix I had to put in place based on the advice I found there:


With PHP 5 this line of code was not working:

$xmldata = file_get_contents($url);

Instead I had to comment that out and add the following"

# $xmldata = file_get_contents($url);

$handle = fopen($url,"r");
$xmldata = stream_get_contents($handle);
fclose($handle);

for more on this and an alternate approach see: http://us3.php.net/manual/en/function.fread.php

--Maxelrod 20:56, 21 September 2007 (UTC)Reply