Extension talk:Lucene-search/archive/2010

From mediawiki.org

2010[edit]

Build function don't submit IP-Address[edit]

We are using an IP based authentication system for our Wiki. Unfortunately the Lucene Searchengine doesn't submit an IP-Adress when it creates the Index

wikidb is being deployed?[edit]

Hi, I wanna ask a question about Lucene-search. I have receieved an error message when I try to access http://localhost:8123/search/wikidb/test. It said "wikidb is being deployed or is not searched by this host". The full error message listed below:

7754 [pool-2-thread-2] ERROR org.wikimedia.lsearch.search.SearchEngine  - Internal error in SearchEngine trying to make WikiSearcher: wikidb is being deployed or is not searched by this host
java.lang.RuntimeException: wikidb is being deployed or is not searched by this host
	at org.wikimedia.lsearch.search.SearcherCache.getLocalSearcher(SearcherCache.java:369)
	at org.wikimedia.lsearch.search.WikiSearcher.<init>(WikiSearcher.java:96)
	at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:686)
	at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:115)
	at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:92)
	at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193)
	at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:636)

Thanks. --PhiLiP 10:41, 6 January 2010 (UTC)Reply

I think you should concentrate on "is not searched by this host" part.. check your hostnames in lsearch-global.conf --Rainman 11:32, 6 January 2010 (UTC)Reply
I checked my hostname is "philip-ubuntu-pc", and the hostnames in lsearch-global.conf are also "philip-ubuntu-pc". The $wgDBname in my LocalSettings.php is "wikidb". And my MediaWiki is functional normally except Special:Search.
// lsearch-global.conf
...

[Search-Group]
philip-ubuntu-pc : *

[Index]
philip-ubuntu-pc : *

...
--PhiLiP 11:41, 6 January 2010 (UTC)Reply

Seems I've found my way out. Forgot to excute ./build. How stupid I am... --PhiLiP 17:44, 6 January 2010 (UTC)Reply

Well that is not your fault, the error message should have suggested it or detected it... /me makes mental note --Rainman 18:07, 7 January 2010 (UTC)Reply


Initial Error while saying "./configure" ...any suggestions? I definitely need help... trying to install this since years ;-).[edit]

$:/lucene-search-2.1 # ./configure /srv/www/vhosts/mySubDomain.com/subd                                                             omains/sysdoc/httpdocs/
Exception in thread "main" java.io.IOException: Cannot run program "/bin/bash":                                                              java.io.IOException: error=12, Cannot allocate memory
        at java.lang.ProcessBuilder.start(Unknown Source)
        at java.lang.Runtime.exec(Unknown Source)
        at java.lang.Runtime.exec(Unknown Source)
        at org.wikimedia.lsearch.util.Command.exec(Command.java:41)
        at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:84)
        at org.wikimedia.lsearch.util.Configure.main(Configure.java:49)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate m                                                             emory
        at java.lang.UNIXProcess.<init>(Unknown Source)
        at java.lang.ProcessImpl.start(Unknown Source)
        ... 6 more

I tryed it from another directory too:

Exception in thread "main" java.lang.NoClassDefFoundError: org/wikimedia/lsearch/util/Configure
Caused by: java.lang.ClassNotFoundException: org.wikimedia.lsearch.util.Configure
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClassInternal(Unknown Source)
Could not find the main class: org.wikimedia.lsearch.util.Configure.  Program will exit.

What can I do???


I've got this from the repository (binary):

configure.inc:

dbname=wikilucene
wgScriptPath=/wiki/phase3
hostname=oblak
indexes=/opt/lucene-search/indexes
mediawiki=/var/www/wiki/phase3
base=/opt/lucene-search
wgServer=http://localhost

Looks like some private data. what does it mean/do?

Why this extension is so ugly maintained?

My first guess would be to google that IOException error you are getting and make sure that you have enough memory etc on your hosting plan. The "private" data is some defaults. Please note that adopting an insulting tone towards developers will not help answer your questions. This is an open-source project developed purely in after-hours and free time and as such is not as polished for commercial use. --Rainman 11:20, 11 January 2010 (UTC)Reply


Sorry for that tone, but it is very very frustrating and I m trying this install really since years on different servers. I m not a java-guru but web developer with 10 years experience and an open source developer too. I already googled for that exception and got no answer at all. Any ideas why it says cannot run bash programm? Do I have to set a path variable or something else? What are the ram requierements?

From what I gathered, this error appears either when /tmp is full, there is not enough RAM to make a process, or there is some kind of limit on number of processes one can have on shared hosting. In theory, for a smaller wiki 128MB of available RAM should be enough, but lucene-search hasn't been built or optimized to run on scarce resources, on the contrary, it is optimized to make use of large amounts of memory and multiple CPU cores to run most efficiently under heavy load. --Rainman 21:12, 11 January 2010 (UTC)Reply

Is there a standard command to determine a process limit on a suse vserver? I found etc/security/limits.conf but there is nothing in there. My tmp-dir and RAM seems to be ok.

It is best to contact your hosting support on this one. --Rainman 10:13, 12 January 2010 (UTC)Reply

I am getting the same error. I have BlueHost (shared hosting). I guess I'll have to use one of the other search extensions. Tisane 06:23, 28 March 2010 (UTC)Reply

I'm having the same exact error regarding 'Could not find the main class: org.wikimedia.lsearch.util.Configure. '. There is any suggestion? CISL 14:23, 28 June 2011 (UTC)Reply

Can't search any string/keyword with dot (".")[edit]

My Lucence-Search 2.1 is working well, except for searching keyword with dot (.) in the middle.

For example, I am able to search "javadoc", and get results with any combinations with it including "javadoc.abc".

However, if I search "javadoc.abc" directly, I get nothing.

Any idea is greatly appreciated.

There seems to be a bug currently with parsing phrases with stuff like dots in them, because dots are handled specially in the index. Should work if you drop the quotes and bring up the most relevant result. --Rainman 12:51, 21 January 2010 (UTC)Reply
Thank you Rainman for your info, but what do you mean "drop the quotes and bring up the most relevant result"? I am not using any quotes while doing search. Again, if I search for javadoc, I can get all results; if I search for javadoc.abc, I get No page text matches even though there are plenty of pages containing javadoc.abc.
Are you sure you're using the lucene-search backend? do the searches you make come up in lucene-search logs? --Rainman 14:43, 23 January 2010 (UTC)Reply
I am sure I am using lucene-search 2.1 backend, and is using MWSearch to fetch results. Here is the log from searching javadoc.abc:
Fetching search data from http://192.168.1.20:8123/search/mediawiki/javadocu82eabcu800?namespaces=0&offset=0&limit=20&version=2.1&iwlimit=10&searchall=0
Http::request: GET http://192.168.1.20:8123/search/mediawiki/javadocu82eabcu800?namespaces=0&offset=0&limit=20&version=2.1&iwlimit=10&searchall=0
total [0] hits
    
--Ross Xu 18:52, 5 February 2010 (UTC)Reply



I'm seeing the same issue, a search for 5.5.3 shows results with the standard search engine, and lucene returns nothing. The same search on wikipedia (different content of course) returns results, so it feels like I'm missing something. You can see the query hit lucene in the log:

 
4051499 [pool-2-thread-9] INFO  org.wikimedia.lsearch.frontend.HttpHandler  - query:/search/wikidb/5u800u82e5u800u82e3u800?namespaces=0&offset=0&limit=20&version=2&iwlimit=10&searchall=0 what:search dbname:wikidb term:5u800u82e5u800u82e3u800
4051501 [pool-2-thread-9] INFO  org.wikimedia.lsearch.search.SearchEngine  - Using FilterWrapper wrap: {0} []
4051504 [pool-2-thread-9] INFO  org.wikimedia.lsearch.search.SearchEngine  - search wikidb: query=[5u800u82e5u800u82e3u800] parsed=[custom(+(+contents:5^0.2 +contents:u800u82e5u800u82e3u800^0.2) relevance ([((P contents:"5 u800u82e5u800u82e3u800"~100) (((P sections:"5") (P sections:"u800u82e5u800u82e3u800") (P sections:"5 u800u82e5u800u82e3u800"))^0.25))^2.0], ((P alttitle:"5 u800u82e5u800u82e3u800"~20^2.5) (P alttitle:"5"^2.5) (P alttitle:"u800u82e5u800u82e3u800"^2.5)) ((P related:"5 u800u82e5u800u82e3u800"^12.0) (P related:"u800u82e5u800u82e3u800"^12.0) (P related:"5"^12.0))) (P alttitle:"5 u800u82e5u800u82e3u800"~20))] hit=[0] in 2ms using IndexSearcherMul:1264699369439 

I'm using mediawiki 1.15.1, lucene-2.1 r61642, and MWSeach r62451 (with a small patch to make it work on 1.15.1)

--Nivfreak 19:18, 16 February 2010 (UTC)Reply

You need to download the appropriate MWSearch version for your MediaWiki using the "Download snapshot" link on the MWSearch page. Your patch doesn't seem to resolve all the compatibility issues. --Rainman 03:55, 17 February 2010 (UTC)Reply
You are absolutely right, and I knew better. I'm not even sure why I moved to the trunk version anymore. That solved my problems. Sorry for wasting your time. Nivfreak 18:38, 17 February 2010 (UTC)Reply

Red highlight[edit]

How can the red highlight in search results be changed to match Wikipedia's way of working? --Robinson Weijman 08:51, 22 January 2010 (UTC)Reply

Set the searchmatch CSS style in your Common.css, but before that check you have the latest MWSearch, as far as I remember we haven't been using red for a while now.. --Rainman 11:40, 22 January 2010 (UTC)Reply
Thank you for the prompt response! Our MWSearch is r36482 (we have MW13.2). I see that the current MWSearch is 37906 - I'll give that a try. --Robinson Weijman 09:27, 25 January 2010 (UTC)Reply
Well, it took a while but I tried an upgrade - no change. I could not find searchmatch CSS style in Common.css. Any ideas? --Robinson Weijman 10:07, 11 March 2010 (UTC)Reply
Finally solved - it was in the skin's main.css file:
span.searchmatch {
  color: blue;
}

--Robinson weijman 13:50, 19 January 2011 (UTC)Reply

Getting search results[edit]

Hi - how can I see what people are searching for? And how can I work out how good the searching is e.g. % hits (page matches) / searches? --Robinson Weijman 14:56, 25 January 2010 (UTC)Reply

How to read lsearchd results[edit]

Alright, I've figured out how to do that (put lsearchd results in file) and what the problem was (too many search daemons running simultaneously!). But I now I'm confronted with a new problem - how to read those results. Is it documented anywhere? --Robinson Weijman 15:58, 27 January 2010 (UTC)Reply

Case insensitive?[edit]

Hi - how can lucene search be made case insensitive? --Robinson Weijman 10:07, 4 February 2010 (UTC)Reply

My mistake, it is case insensitive. What I meant to ask was:

Wildcards[edit]

Can wildcards be added by default to a search? --Robinson Weijman 10:41, 4 February 2010 (UTC)Reply

Please test first, then ask.. wildcards do work, although they are limited to cases which won't kill the servers (e.g. you cannot do *a*). --Rainman 13:21, 4 February 2010 (UTC)Reply
Your statement implies that I did not test first. Of course I did. Perhaps I was unclear - what I meant to say was can the DEFAULT be when, e.g. if I search for "Exeter" using "Exe" that the default search is "*Exe*". So I'm sorry I was unclear. Please don't make assumptions about your customers - I don't appreciate it. --Robinson Weijman 08:27, 5 February 2010 (UTC)Reply
You are not my customer, I'm just a random person like you. In any case, yes this does not work, and will not work, using this kind of wildcards makes search unacceptably slow for any but sites with very few pages (and lucene-search is designed for big sites). If this doesn't suite your needs you can either hack it yourself of pay someone to do it. --Rainman 01:23, 8 February 2010 (UTC)Reply
Thanks for the info.--Robinson Weijman 08:24, 8 February 2010 (UTC)Reply

Fatal error with searching anything with colon[edit]

I am using Lucene-Search 2.1 and MWsearch.

Whenever I search for any keyword with colon (e.g. searching for "ha:"), I get the Fatal error:

Fatal error: Call to undefined method Language::getNamespaceAliases() in /var/www/html/wiki/extensions/MWSearch/MWSearch_body.php on line 96

It's the same thing with searching anything like "all:something" and "main:something".

Any idea is appreciated. --Ross Xu 20:10, 10 February 2010 (UTC)Reply

Using 2.1 and "Did You Mean" not appearing[edit]

Hi all - as per the title. When I'm running a search, the "Did you mean" functionality does not appear to be working/showing. Do I need to do anything special configuration wise to get this to work or should this just work?

Any idea is appreciated. --Barramya 09:06, 17 February 2010 (UTC)Reply

I am having the same issue right now and I can't fix it ... Did doublecheck the line in LocalSettings.php "$wgLuceneSearchVersion = 2.1;" to be NOT uncommented? I am looking forward for a solution. Thank you. Roemer2201 20:16, 19 April 2010 (UTC)Reply
This line should be uncommented (as the instructions say), and no other special settings are needed. Verify that:
  1. you have matching MediaWiki and MWSearch versions
  2. the searches actually reach the lucene-search deamon - you should be able to see them in the console log you get when you start ./lsearchd
  3. that lucene-search deamon has started without an error, especially the .spell index (also in the console log)
--Rainman 15:22, 20 April 2010 (UTC)Reply

lsearchd doesn't start properly[edit]

after server restart my lsearchd (version 2.1) stopped to work:

# ./lsearchd
Trying config file at path /root/.lsearch.conf
Trying config file at path /vol/sites/jewage.org/search/ls21/lsearch.conf
log4j: Parsing for [root] with value=[INFO, A1].
log4j: Level token is [INFO].
log4j: Category root set to INFO
log4j: Parsing appender named "A1".
log4j: Parsing layout options for "A1".
log4j: Setting property [conversionPattern] to [%-4r [%t] %-5p %c %x - %m%n].
log4j: End of parsing for "A1".
log4j: Parsed "A1" options.
log4j: Finished configuring.
0    [main] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RMIMessenger bound

after this it just hangs up

why would this happen?

java looks normal:

# java -version
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) Client VM (build 14.3-b01, mixed mode, sharing) 

there is my config:

# cat lsearch-global.conf
################################################
# Global search cluster layout configuration
################################################

[Database]
jewage : (single) (spell,4,2)
#(language,en)

[Search-Group]
jewage.org : *

[Index]
jewage.org : *

[Index-Path]
<default> : /search

[OAI]
<default> : http://localhost/w/index.php

[Namespace-Boost]
<default> : (0,2) (1,0.5) (110,5)

[Namespace-Prefix]
all : <all>


tried jewage.org : (single) (spell,4,2) without effect --Eugenem 20:02, 17 February 2010 (UTC)Reply

ok, needed to change all host adresses (config.inc, lsearch-global.conf) and rebuild the index --Eugenem 16:22, 18 February 2010 (UTC)Reply

searching plurals returns result[edit]

for example searching vpn returns correct dataset with a small description but searching vpns returns same dataset only with the title. is this correct as I thought the search engine does a complete match. is there a way to correct this ?

Is there a way to search the raw wikitext?[edit]

Is there a option that will search the raw wikitext? For example, suppose I want to find all pages that use the cite extension tag <ref>. Is there a way to specify "return a list of pages using the tag <ref>". I tried searching for <ref> on wikipedia, but all I got were references to "ref". Dnessett 18:35, 25 February 2010 (UTC)Reply

nope. --Rainman 23:38, 25 February 2010 (UTC)Reply
...but Extension:Replace_Text will do this. Andthepharaohs 07:20, 27 April 2010 (UTC)Reply

External Searches[edit]

Is there a way to include external (to the wiki) databases in the search results? Ideally I'd like to see:

  1. a list of results within the wiki and then
  2. underneath, one or more results for each external database.

By "external database", I mean like a content management system containing office / PDF documents. --Robinson Weijman 11:28, 4 March 2010 (UTC)Reply

No, unless you write a plugin for MediaWiki the query the external database and then show it on the search page. --Rainman 15:42, 4 March 2010 (UTC)Reply
Oops, I missed this reply. Thanks. --Robinson Weijman 10:30, 19 March 2010 (UTC)Reply

Meta Tags[edit]

Can Lucene work with meta tags, e.g. Extension:MetaKeywordsTag. That is, pages with those tags appear higher in searches for those tags. --Robinson Weijman 12:09, 17 March 2010 (UTC)Reply

No. --Rainman 13:42, 17 March 2010 (UTC)Reply
Is it a good idea to add it? --Robinson Weijman 08:13, 18 March 2010 (UTC)Reply

Ranking Suggestion[edit]

Are there any plans to bring out a new Lucene version? I'd like to see functionality dynamically to change the search results based on previous hits and clicks (like Google). Or that users can report "this was a useful / useless link", e.g. by clicking on an up or down arrow. --Robinson Weijman 12:11, 17 March 2010 (UTC)Reply

This is very unlikely. --Rainman 13:42, 17 March 2010 (UTC)Reply
Why is it unlikely? Is nobody continuing to develop this extension? Wouldn't this be an improvement? --Robinson Weijman 08:15, 18 March 2010 (UTC)Reply
I am doing some maintenance in my free time, but there is no-one working full time, and no future major changes are currently planned. Of course, there are million things that would be good to have, but as I said, they are probably not happening unless someone else does them. --Rainman 17:32, 18 March 2010 (UTC)Reply
OK thanks for the info. Let's hope someone steps forward then. --Robinson Weijman 10:28, 19 March 2010 (UTC)Reply

Searches with commas[edit]

We're trying to use Mediawiki and Lucene search but searches for phrases that have commas don't seem to work. Wikipedia demonstrates the problem too: Tom Crean (explorer) contains the sentence “In 1901, while serving on HMS Ringarooma in New Zealand, he volunteered to join Scott's 1901–04 British National Antarctic Expedition on Discovery, thus beginning his exploring career.” Searching for fragments of this sentence that don't involve commas—for example, "serving on HMS Ringarooma" or "thus beginning his exploring career"—turns up the page easily. But if you search for a fragment with a comma—for example, "In 1901, while serving on HMS Ringarooma" or "Discovery, thus beginning his exploring career"—there are no matches. Taking the comma out of the search query doesn't help: there are still no matches.

Is this a bug? If it is intended, is there a way to disable it so that comma phrases can be found? Our intended use case, the OEIS, is all about comma-separated search strings. Thanks. --Russ Cox 05:50, 18 March 2010 (UTC)Reply

yes, unfortunately it is a (known) bug. --Rainman 17:18, 18 March 2010 (UTC)Reply
If I wanted to build a customized version without the bug, where should I be looking for it? I'd be happy to try to track it down, fix it, and send the change back, but I don't know where to start. Thanks again. --Russ Cox 18:30, 18 March 2010 (UTC)Reply
The bug is a byproduct of a undertested "feature" and is actually very easy to fix, but needs a complete index rebuilt. You need to go to FastWikiTokenizerEngine.java and change MINOR_GAP = 2; to MINOR_GAP=1; --Rainman 21:12, 18 March 2010 (UTC)Reply

New version? 2.1.2 -> 2.1.3[edit]

MarkAHershberger has updated the version number. Is there a new release then? --Robinson Weijman 08:10, 22 March 2010 (UTC)Reply

Exception in thread "main" java.lang.NoClassDefFoundError: org/wikimedia/lsearch/util/Configure Caused by: java.lang.ClassNotFoundException: org.wikimedia.lsearch.util.Configure at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:319) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:264) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:332) Could not find the main class: org.wikimedia.lsearch.util.Configure. Program will exit.

I did release a new version, but if you've found problems with it, please file a bug at bugzilla and assign it to me. --MarkAHershberger 23:31, 30 March 2010 (UTC)Reply

Thanks. So can you provide a brief summary of the changes or a link to the release notes? --Robinson Weijman 06:47, 6 April 2010 (UTC)Reply

Sort results by last modified date?[edit]

Is it possible to make Lucene sort its results by the last-modification date of the article? Even better, can this be exposed as an option for the user? Maiden taiwan 19:48, 7 April 2010 (UTC)Reply

How to run as CronJob[edit]

Hello there, I am new here and i have got Problems to run (build) and (lsearchd) as cronjob. I have Ubuntu 10.4 and MediaWiki (1.15.3) with SMW (1.4.3) und SMW+/SMWhalo (1.4.6) on a virtuel System. There had been some problems running (configure) und (build), but after typing "sudo ..." there was everything ok. I also installed the mentiont Init Skript for Ubuntu (3.14 LSearch Daemon Init Script for Ubuntu) and it runs well.

My Problem is to get everything run as CronJob. I got some Help by googling for "crontab -e" and "sudo crontab -e" (for admin) or "gnome-schedule" as GUI. I tried to write the path to search-engine in PATH in the crontab. This or "export PATH=$PATH:(dein Pfad zum Programmordner)" in SHELL helped to start "lsearchd" @reboot. If I tried to start "build" to reindex my Wiki, there had been Problems by finding the file "config.inc". So I added "$(dirname $0)" infront of "config.inc" and "LuceneSearch.jar". This solves the Problem to run the Skript (build) from another Dir:

#!/bin/bash

source $(dirname $0)/config.inc

if [ -n "$1" ]; then
  dumpfile="$1"
else
  dumps="$base/dumps"
  [ -e $dumps ]  || mkdir $dumps
  dumpfile="$dumps/dump-$dbname.xml"
  timestamp=`date -u +%Y-%m-%d`
  slave=`php $mediawiki/maintenance/getSlaveServer.php \
    $dbname \
    --conf $mediawiki/LocalSettings.php \
    --aconf $mediawiki/AdminSettings.php`
  echo "Dumping $dbname..."
  cd $mediawiki && php maintenance/dumpBackup.php \
    $dbname \
    --conf $mediawiki/LocalSettings.php \
    --aconf $mediawiki/AdminSettings.php \
    --current \
    --server=$slave > $dumpfile
  [ -e $indexes/status ] || mkdir -p $indexes/status
  echo "timestamp=$timestamp" > $indexes/status/$dbname
fi

cd $base &&
java -cp $(dirname $0)/LuceneSearch.jar org.wikimedia.lsearch.importer.BuildAll $dumpfile $dbname


Now I will try to start these Skripts from "system-wide crontab" (/etc/crontab), because "build" requires Admin-Rights. In "system-wide crontab" it is posible to run a Skript as root. I hope that's is!


Greetz Benor 16:55, 5 May 2010 (UTC)

Problem with build script (and solution)[edit]

I had issues running the build script. The environment setup in config.inc was completely wrong. It turns out the configure script runs maintenance/eval.php which will dump the contents of AdminSettings.php. This interfered with the configure script. The solution is to put the contents of AdminSettings.php into LocalSettings.php and move or delete AdmingSettings.php. Then re-configure and build should run fine.

My versions:
Ubuntu 10.04 x64
MediaWiki 1.16wmf4 (r66614)
PHP 5.3.2-1ubuntu4.1 (apache2handler)
MySQL 5.1.41-3ubuntu12
--Spt5007 23:06, 19 May 2010 (UTC)Reply

What does "spell" directive do in lsearch-global.conf?[edit]

What is the meaning of the spell directive, e.g.,

wikidb : (single) (spell,4,2) (language,en)

Thank you. Maiden taiwan 18:09, 8 June 2010 (UTC)Reply

It means that a spell-check index is going to be built using words occurring in at least 2 articles, and word combinations occurring in at least 4 articles. --Rainman 20:17, 8 June 2010 (UTC)Reply

java.io.IOException: The markup in the document following the root element must be well-formed[edit]

When running the lucene update script I get the error:

java.io.IOException: The markup in the document following the root element must be well-formed

Things were working fine until I imported a bunch of new articles into the "en" wiki (below) using maintenance/importDump.php, then this error happened on the next reindex. How do you debug a problem like this?

Full output:

Trying config file at path /root/.lsearch.conf
Trying config file at path /usr/local/lucene-search-2.1/lsearch.conf
0    [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for De
98   [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En
150  [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for Es
188  [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for Fr
234  [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for Nl
275  [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - de_wikidb using base url: http://de.mywiki.com/w/index.php?title=Special:OAIRepository
275  [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - de_wikidb using base url: http://de.mywiki.com/w/index.php?title=Special:OAIRepository
275  [main] INFO  org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of de_wikidb from 2010-06-09T20:00:02Z
644  [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - en_wikidb using base url: http://en.mywiki.com/w/index.php?title=Special:OAIRepository
644  [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - en_wikidb using base url: http://en.mywiki.com/w/index.php?title=Special:OAIRepository
644  [main] INFO  org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of en_wikidb from 2010-06-14T14:15:02Z
java.io.IOException: The markup in the document following the root element must be well-formed.
        at org.wikimedia.lsearch.oai.OAIParser.parse(OAIParser.java:68)
        at org.wikimedia.lsearch.oai.OAIHarvester.read(OAIHarvester.java:64)
        at org.wikimedia.lsearch.oai.OAIHarvester.getRecords(OAIHarvester.java:44)
        at org.wikimedia.lsearch.oai.IncrementalUpdater.main(IncrementalUpdater.java:191)
919  [main] WARN  org.wikimedia.lsearch.oai.IncrementalUpdater  - Retry later: error while processing update for en_wikidb : The markup in the document following the root element must be well-formed.
java.io.IOException: The markup in the document following the root element must be well-formed.
        at org.wikimedia.lsearch.oai.OAIParser.parse(OAIParser.java:68)
        at org.wikimedia.lsearch.oai.OAIHarvester.read(OAIHarvester.java:64)
        at org.wikimedia.lsearch.oai.OAIHarvester.getRecords(OAIHarvester.java:44)
        at org.wikimedia.lsearch.oai.IncrementalUpdater.main(IncrementalUpdater.java:191)
920  [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - es_wikidb using base url: http://es.mywiki.com/w/index.php?title=Special:OAIRepository
920  [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - es_wikidb using base url: http://es.mywiki.com/w/index.php?title=Special:OAIRepository
920  [main] INFO  org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of es_wikidb from 2010-06-08T17:09:38Z
1692 [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - fr_wikidb using base url: http://fr.mywiki.com/w/index.php?title=Special:OAIRepository
1692 [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - fr_wikidb using base url: http://fr.mywiki.com/w/index.php?title=Special:OAIRepository
1692 [main] INFO  org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of fr_wikidb from 2010-06-10T16:45:04Z
2556 [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - nl_wikidb using base url: http://nl.mywiki.com/w/index.php?title=Special:OAIRepository
2556 [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - nl_wikidb using base url: http://nl.mywiki.com/w/index.php?title=Special:OAIRepository
2556 [main] INFO  org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of nl_wikidb from 2010-06-08T21:30:11Z

Here is lsearch-global.conf:

[Database]
en_wikidb : (single) (spell,4,2) (language,en)
de_wikidb : (single) (spell,4,2) (language,de)
es_wikidb : (single) (spell,4,2) (language,es)
fr_wikidb : (single) (spell,4,2) (language,fr)
nl_wikidb : (single) (spell,4,2) (language,nl)

[Search-Group]
myhost : *

[Index]
myhost : *

[Index-Path]
<default> : /search

[OAI]
<default> : http://myhost/w/index.php
en_wikidb : http://en.mywiki.com/w/index.php
de_wikidb : http://de.mywiki.com/w/index.php
es_wikidb : http://es.mywiki.com/w/index.php
fr_wikidb : http://fr.mywiki.com/w/index.php
nl_wikidb : http://nl.mywiki.com/w/index.php

[Namespace-Boost]
<default> : (0,2) (1,0.5)

[Namespace-Prefix]
all : <all>
[0] : 0
[1] : 1
[2] : 2
[3] : 3
[4] : 4
[5] : 5
[6] : 6
[7] : 7
[8] : 8
[9] : 9
[10] : 10
[11] : 11
[12] : 12
[13] : 13
[14] : 14
[15] : 15

Thanks for any help! Maiden taiwan 18:49, 14 June 2010 (UTC)Reply

Update: I noticed that Special:Statistics was showing too few pages, so I ran maintenance/updateArticleCount.php --update and the crash went away. The search results page is still not reporting all matches though, only some. Maiden taiwan 16:41, 15 June 2010 (UTC)Reply
Run ./build on your wiki to make a fresh copy of the index, and then start incremental updates from that point. --Rainman 00:54, 16 June 2010 (UTC)Reply
Thanks - I did that, and managed to run "build" for each of my five wikis above. But now the update script is throwing the above-mentioned Java error again. (java.io.IOException: The markup in the document following the root element must be well-formed.) Any ideas how I can debug this? Maiden taiwan 04:11, 16 June 2010 (UTC)Reply
My first guess would be that OAI extension is somehow mis-configured. Update to latest version (SVN) of lucene-search and the exact URL used should appear in the log. Put that into browser to see what kind of output is returned by OAIRepository. Also check your php error logs. --Rainman 17:47, 16 June 2010 (UTC)Reply

Lucene-search on OpenVMS[edit]

I've been working to port this extension to OpenVMS and have ran accross a few snags simply to realize its a Java configuration issue.

The following logicals need to be declared:

$ set proc/parse=extended 
$ @SYS$COMMON:[JAVA$150.COM]JAVA$150_SETUP.COM 
$ define DECC$ARGV_PARSE_STYLE ENABLE 
$ define DECC$EFS_CASE_PRESERVE ENABLE 
$ define DECC$POSIX_SEEK_STREAM_FILE ENABLE 
$ define DECC$EFS_CHARSET ENABLE 
$ define DECC$ENABLE_GETENV_CACHE ENABLE 
$ define DECC$FILE_PERMISSION_UNIX ENABLE 
$ define DECC$FIXED_LENGTH_SEEK_TO_EOF ENABLE 
$ define DECC$RENAME_NO_INHERIT ENABLE 
$ define DECC$ENABLE_TO_VMS_LOGNAME_CACHE ENABLE 
$ FILE_MASK = %x00000008 + %x00040000 
$ DEFINE JAVA$FILENAME_CONTROLS 'file_mask' 

Also, the configure and build files need to be converted into COM files

After that, FSUtils.java needs to be edited to recognize OpenVMS and set up so that for any kind of linking(hard or soft) it calls a C program to convert the filepaths to VMS compliant ones and then copies the file.

I'm still having a few issues with .links files but am working on a fix.

                 --Need to make C program remove all instances of "^." from filename after it is converted from POSIX to VMS           filename.


Indexing works! --Sillas33 18:56, 1 July 2010 (UTC)Reply

How does lsearchd work?[edit]

I'm curious how the lsearchd script works. I am trying to port it to OpenVMS and am currently getting a ClassDefNotFound error, mostly due to me not quite understanding what the script is passing where.

#!/bin/bash
jardir=`dirname $0` # put your jar dir here!
java -Djava.rmi.server.codebase=file://$jardir/LuceneSearch.jar -Djava.rmi.server.hostname=$HOSTNAME -jar $jardir/LuceneSearch.jar $*

Specifically: Why is $0 part of jardir ='dirname $0'

            Figured out that this sets jardir equal to the path to lsearchd.

Is it possible to set this up so that it would run from an exploded jar? --Sillas33 15:34, 2 July 2010 (UTC)Reply

It is possible, but you would need to manually include in the classpath all of the libraries (look at build.xml for the list) which are automatically included in the jar. --Rainman 01:59, 3 July 2010 (UTC)Reply

Ended up running it from an exploded jar like so:

$ set def root:[000000]
$ java -cp "''class_path'" -
       "-D java.rmi.server.codebase=file:root/LUCENESEARCH.JAR" -
       "-D java.rmi.server.hostname=hostname" -
        "org.wikimedia.lsearch.config.StartupManager" "/root/"

I left a copy of the original jar in the directory along with the exploded version (I couldn't seem to update the jar with the ONE file I changed to make the this work on VMS)

Skip a namespace[edit]

Is it possible to tell Lucene not to index a given namespace? Or can we tell MWSearch not to search a given namespace? Thanks. Maiden taiwan 15:43, 22 July 2010 (UTC)Reply

Lucene and LiquidThreads[edit]

On my wiki I use LQT 2.0alpha. Build indexes of lucene was stopped on page, that is lqt comment. How can fix this problem? Are anyone also have problems with lucene and lqt?

My wiki info: here

Make sure you use the latest lucene-search version (from SVN). LQT search was designed for this extension and should work. --Rainman 23:52, 31 July 2010 (UTC)Reply
Build .jar from sources and have this problem again - indexing will stop on LQT pages :( I don't know where is the problem, why on my wiki lucene don't work, how and who can help me with this?

Lucene Suggest and Fuzzy Search Problems[edit]

I am having some trouble getting the Lucene daemon to give spelling suggestions and to work with fuzzy searches. I have tried on different distros, versions, JDKs, and data sets, but nothing seems to work. The spelling indexes get built when I run the indexer, so that part seems ok. However, I notice that only wiki and wiki.links appear under indexes/search after indexing (wiki.spell appears under snapshots though). I also played around with the Java source to see if I can track it down. As far as I can see, this is the code where the problem happens:

SearchEngine.java

// find host 
String host = cache.getRandomHost(iid.getSpell());
if(host == null)
	return; // no available 


cache.getRandomHost returns a null value, so the suggestion generation is skipped. Digging in a little more, I found that the following lines in SearcherCache.java pass back the null value:

Hashtable<String,RemoteSearcherPool> pools = remoteCache.get(iid.toString());
if(pools == null)
	return null;

http://www.mediawiki.org/w/index.php?title=Extension_talk:Lucene-search&action=edit&section=91

Any idea what is going on here? I feel like it has something to do with how indexes are tied to hosts, but I just can't seem to get a working configuration.

Thanks!

It would be helpful if you provided your configuration files. --Rainman 23:51, 28 August 2010 (UTC)Reply
I didn't want to clutter up the talk page, so I sent over the config info in an email. Hopefully that works for you. I have tried a number of things with the config, so this just happens to be the latest that I have. I have also sent the output from the lsearchd startup logging. I noticed from an above post about a similar issue, you mention that the lsearchd startup should say something about the spelling index, and mine does not. I can tell you that some kind of spelling index gets built. I have actually used a Lucene utility to look at it, and it contains terms that are specific to the wiki I indexed. --Mehle
I got your message back, and that was exactly the problem. Suggestions, fuzzy search, and for an added bonus, related articles all work, and they are even better than I hoped. For anyone who might be having the same problem, here is what I did wrong. I thought the * in lsearch-global.conf was a placeholder for the database name, so I went with the instructions in the 2.0 docs for doing the Search-Group and Index sections. The result was that lsearchd was only picking up the main index and skipping the rest. So if you are having a similar problem, do not follow the 2.0 instructions and make sure to leave the * right where it is. Thank you once again for the help and for creating such a great search engine. --Mehle

Error running build using lucene search 2.1.3[edit]

[root@server~]# PATH=/wiki/usr/local/java/bin:$PATH;export PATH 
[root@uswv1app04a ~]# echo $PATH /wiki/usr/local/java/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
[root@uswv1app04a ~]# java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) 

[root@server lucene]# ./configure /wiki/www/htdocs/mediawiki-1.9.6 Generating configuration files for wikidb ... 
Making lsearch.conf
Making lsearch-global.conf
Making lsearch.log4j
Making config.inc
[root@server lucene]# ./build
Dumping wikidb...
MediaWiki lucene-search indexer - rebuild all indexes associated with a database.
Trying config file at path /root/.lsearch.conf Trying config file at path /wiki/usr/local/lucene-search-2.1.3/lsearch.conf
MediaWiki lucene-search indexer - index builder from xml database dumps.

0    [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En
60   [main] INFO  org.wikimedia.lsearch.ranks.Links  - Making index at /wiki/usr/local/lucene-search-2.1.3/indexes/import/wikidb.links
122  [main] INFO  org.wikimedia.lsearch.ranks.LinksBuilder  - Calculating article links...
224  [main] FATAL org.wikimedia.lsearch.importer.Importer  - Cannot store link analytics: Content is not allowed in prolog.

java.io.IOException: Trying to hardlink nonexisting file /wiki/usr/local/lucene-search-2.1.3/indexes/import/wikidb
        at org.wikimedia.lsearch.util.FSUtils.createHardLinkRecursive(FSUtils.java:97)
        at org.wikimedia.lsearch.util.FSUtils.createHardLinkRecursive(FSUtils.java:81)
        at org.wikimedia.lsearch.importer.BuildAll.copy(BuildAll.java:157)
        at org.wikimedia.lsearch.importer.BuildAll.main(BuildAll.java:112)

227  [main] ERROR org.wikimedia.lsearch.importer.BuildAll  - Error during rebuild of wikidb : Trying to hardlink nonexisting file /wiki/usr/local/lucene-search-2.1.3/indexes/import/wikidb

java.io.IOException: Trying to hardlink nonexisting file /wiki/usr/local/lucene-search-2.1.3/indexes/import/wikidb
        at org.wikimedia.lsearch.util.FSUtils.createHardLinkRecursive(FSUtils.java:97)
        at org.wikimedia.lsearch.util.FSUtils.createHardLinkRecursive(FSUtils.java:81)
        at org.wikimedia.lsearch.importer.BuildAll.copy(BuildAll.java:157)
        at org.wikimedia.lsearch.importer.BuildAll.main(BuildAll.java:112)
Finished build in 0s

Not sure what's going wrong here, the wiki/usr/local/lucene-search-2.1.3/indexes/import has R/W access for all users.

Any clue?

- Vicki

It looks like the dump process has failed. Look at the dumps/wikidb.xml and verify it doesn't contain errors. --Rainman 15:55, 8 September 2010 (UTC)Reply
I get the same errors. My dumps/dump-wikidb.xml file is completely empty. Please help, I do not know what to do. -- Nicole
*blush* I made a typo in the username in Adminsettings.php.... -- Nicole

Search database of Word documentation[edit]

Has anyone tried to use this to search documentation linked from MediaWiki? So if a page contains links to docs A & B, Lucene will search that? Anyone tried that with Word documentation? It would be a great additional feature. --88.159.118.8 17:17, 30 October 2010 (UTC)Reply

Get the version number?[edit]

Is there a way to obtain the Lucene version number from PHP? Maiden taiwan 18:01, 12 November 2010 (UTC)Reply

2 DBs and 4 Wikis[edit]

Configuration:

This is the configuration of my wiki farm:

wikidb1
|-- wiki1    # <= 'wiki1' is table prefix and interwiki link
|-- wiki2    # <= 'wiki2' is table prefix and interwiki link

wikidb2
|-- wiki3    # <= 'wiki3' is table prefix and interwiki link
|-- wiki4    # <= 'wiki4' is table prefix and interwiki link

All 4 wikis uses English as default language.

What is desired:

  • when searching in wiki1 there will also be interwiki results from wiki2, wiki3 and wiki4 (when the searched term is also in these wikis)
  • when searching in wiki2 there will also be interwiki results from wiki1, wiki3 and wiki4 (when the searched term is also in these wikis)
  • ...

Questions:

  1. Is this possible to configure lucene-search to match my wishes?
  2. If ( 1. ) is possible: How do I have to set this up? (global configuration, ...)--JBE 09:59, 21 December 2010 (UTC)Reply
There is no support for table prefixes. Wikis need to be in separate databases. --Rainman 01:00, 6 January 2011 (UTC)Reply
Thank you!--JBE 07:30, 6 January 2011 (UTC)Reply