Extension talk:Lucene-search/archive/2010

= 2010 =

Build function don't submit IP-Address
We are using an IP based authentication system for our Wiki. Unfortunately the Lucene Searchengine doesn't submit an IP-Adress when it creates the Index

wikidb is being deployed?
Hi, I wanna ask a question about Lucene-search. I have receieved an error message when I try to access http://localhost:8123/search/wikidb/test. It said "wikidb is being deployed or is not searched by this host". The full error message listed below: 7754 [pool-2-thread-2] ERROR org.wikimedia.lsearch.search.SearchEngine - Internal error in SearchEngine trying to make WikiSearcher: wikidb is being deployed or is not searched by this host java.lang.RuntimeException: wikidb is being deployed or is not searched by this host at org.wikimedia.lsearch.search.SearcherCache.getLocalSearcher(SearcherCache.java:369) at org.wikimedia.lsearch.search.WikiSearcher. (WikiSearcher.java:96) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:686) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:115) at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:92) at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193) at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Thanks. --PhiLiP 10:41, 6 January 2010 (UTC)


 * I think you should concentrate on "is not searched by this host" part.. check your hostnames in lsearch-global.conf --Rainman 11:32, 6 January 2010 (UTC)


 * I checked my hostname is "philip-ubuntu-pc", and the hostnames in lsearch-global.conf are also "philip-ubuntu-pc". The $wgDBname in my LocalSettings.php is "wikidb". And my MediaWiki is functional normally except Special:Search.

// lsearch-global.conf ...

[Search-Group] philip-ubuntu-pc : *

[Index] philip-ubuntu-pc : *

...
 * --PhiLiP 11:41, 6 January 2010 (UTC)

Seems I've found my way out. Forgot to excute. How stupid I am... --PhiLiP 17:44, 6 January 2010 (UTC)
 * Well that is not your fault, the error message should have suggested it or detected it... /me makes mental note --Rainman 18:07, 7 January 2010 (UTC)

Initial Error while saying "./configure" ...any suggestions? I definitely need help... trying to install this since years ;-).
$:/lucene-search-2.1 # ./configure /srv/www/vhosts/mySubDomain.com/subd                                                            omains/sysdoc/httpdocs/ Exception in thread "main" java.io.IOException: Cannot run program "/bin/bash":                                                             java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(Unknown Source) at java.lang.Runtime.exec(Unknown Source) at java.lang.Runtime.exec(Unknown Source) at org.wikimedia.lsearch.util.Command.exec(Command.java:41) at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:84) at org.wikimedia.lsearch.util.Configure.main(Configure.java:49) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate m                                                            emory at java.lang.UNIXProcess. (Unknown Source) at java.lang.ProcessImpl.start(Unknown Source) ... 6 more

I tryed it from another directory too:

Exception in thread "main" java.lang.NoClassDefFoundError: org/wikimedia/lsearch/util/Configure Caused by: java.lang.ClassNotFoundException: org.wikimedia.lsearch.util.Configure at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) Could not find the main class: org.wikimedia.lsearch.util.Configure. Program will exit.

What can I do???

I've got this from the repository (binary):

configure.inc: dbname=wikilucene wgScriptPath=/wiki/phase3 hostname=oblak indexes=/opt/lucene-search/indexes mediawiki=/var/www/wiki/phase3 base=/opt/lucene-search wgServer=http://localhost

Looks like some private data. what does it mean/do?

Why this extension is so ugly maintained?


 * My first guess would be to google that IOException error you are getting and make sure that you have enough memory etc on your hosting plan. The "private" data is some defaults. Please note that adopting an insulting tone towards developers will not help answer your questions. This is an open-source project developed purely in after-hours and free time and as such is not as polished for commercial use. --Rainman 11:20, 11 January 2010 (UTC)

Sorry for that tone, but it is very very frustrating and I m trying this install really since years on different servers. I m not a java-guru but web developer with 10 years experience and an open source developer too. I already googled for that exception and got no answer at all. Any ideas why it says cannot run bash programm? Do I have to set a path variable or something else? What are the ram requierements?


 * From what I gathered, this error appears either when /tmp is full, there is not enough RAM to make a process, or there is some kind of limit on number of processes one can have on shared hosting. In theory, for a smaller wiki 128MB of available RAM should be enough, but lucene-search hasn't been built or optimized to run on scarce resources, on the contrary, it is optimized to make use of large amounts of memory and multiple CPU cores to run most efficiently under heavy load. --Rainman 21:12, 11 January 2010 (UTC)

Is there a standard command to determine a process limit on a suse vserver? I found etc/security/limits.conf but there is nothing in there. My tmp-dir and RAM seems to be ok.
 * It is best to contact your hosting support on this one. --Rainman 10:13, 12 January 2010 (UTC)

I am getting the same error. I have BlueHost (shared hosting). I guess I'll have to use one of the other search extensions. Tisane 06:23, 28 March 2010 (UTC)

Can't search any string/keyword with dot (".")
My Lucence-Search 2.1 is working well, except for searching keyword with dot (.) in the middle.

For example, I am able to search "javadoc", and get results with any combinations with it including "javadoc.abc".

However, if I search "javadoc.abc" directly, I get nothing.

Any idea is greatly appreciated.


 * There seems to be a bug currently with parsing phrases with stuff like dots in them, because dots are handled specially in the index. Should work if you drop the quotes and bring up the most relevant result. --Rainman 12:51, 21 January 2010 (UTC)


 * Thank you Rainman for your info, but what do you mean "drop the quotes and bring up the most relevant result"? I am not using any quotes while doing search. Again, if I search for javadoc, I can get all results; if I search for javadoc.abc, I get No page text matches even though there are plenty of pages containing javadoc.abc.


 * Are you sure you're using the lucene-search backend? do the searches you make come up in lucene-search logs? --Rainman 14:43, 23 January 2010 (UTC)


 * I am sure I am using lucene-search 2.1 backend, and is using MWSearch to fetch results. Here is the log from searching javadoc.abc:

Fetching search data from http://192.168.1.20:8123/search/mediawiki/javadocu82eabcu800?namespaces=0&offset=0&limit=20&version=2.1&iwlimit=10&searchall=0 Http::request: GET http://192.168.1.20:8123/search/mediawiki/javadocu82eabcu800?namespaces=0&offset=0&limit=20&version=2.1&iwlimit=10&searchall=0 total [0] hits
 * --Ross Xu 18:52, 5 February 2010 (UTC)

I'm seeing the same issue, a search for 5.5.3 shows results with the standard search engine, and lucene returns nothing. The same search on wikipedia (different content of course) returns results, so it feels like I'm missing something. You can see the query hit lucene in the log:

4051499 [pool-2-thread-9] INFO org.wikimedia.lsearch.frontend.HttpHandler  - query:/search/wikidb/5u800u82e5u800u82e3u800?namespaces=0&offset=0&limit=20&version=2&iwlimit=10&searchall=0 what:search dbname:wikidb term:5u800u82e5u800u82e3u800 4051501 [pool-2-thread-9] INFO org.wikimedia.lsearch.search.SearchEngine  - Using FilterWrapper wrap: {0} [] 4051504 [pool-2-thread-9] INFO org.wikimedia.lsearch.search.SearchEngine  - search wikidb: query=[5u800u82e5u800u82e3u800] parsed=[custom(+(+contents:5^0.2 +contents:u800u82e5u800u82e3u800^0.2) relevance ([((P contents:"5 u800u82e5u800u82e3u800"~100) (((P sections:"5") (P sections:"u800u82e5u800u82e3u800") (P sections:"5 u800u82e5u800u82e3u800"))^0.25))^2.0], ((P alttitle:"5 u800u82e5u800u82e3u800"~20^2.5) (P alttitle:"5"^2.5) (P alttitle:"u800u82e5u800u82e3u800"^2.5)) ((P related:"5 u800u82e5u800u82e3u800"^12.0) (P related:"u800u82e5u800u82e3u800"^12.0) (P related:"5"^12.0))) (P alttitle:"5 u800u82e5u800u82e3u800"~20))] hit=[0] in 2ms using IndexSearcherMul:1264699369439

I'm using mediawiki 1.15.1, lucene-2.1 r61642, and MWSeach r62451 (with a small patch to make it work on 1.15.1)

--Nivfreak 19:18, 16 February 2010 (UTC)


 * You need to download the appropriate MWSearch version for your MediaWiki using the "Download snapshot" link on the MWSearch page. Your patch doesn't seem to resolve all the compatibility issues. --Rainman 03:55, 17 February 2010 (UTC)


 * You are absolutely right, and I knew better. I'm not even sure why I moved to the trunk version anymore. That solved my problems. Sorry for wasting your time. Nivfreak 18:38, 17 February 2010 (UTC)

Red highlight
How can the red highlight in search results be changed to match Wikipedia's way of working? --Robinson Weijman 08:51, 22 January 2010 (UTC)


 * Set the searchmatch CSS style in your Common.css, but before that check you have the latest MWSearch, as far as I remember we haven't been using red for a while now.. --Rainman 11:40, 22 January 2010 (UTC)


 * Thank you for the prompt response! Our MWSearch is r36482 (we have MW13.2).  I see that the current MWSearch is 37906 - I'll give that a try.  --Robinson Weijman 09:27, 25 January 2010 (UTC)


 * Well, it took a while but I tried an upgrade - no change. I could not find searchmatch CSS style in Common.css.  Any ideas?  --Robinson Weijman 10:07, 11 March 2010 (UTC)


 * Finally solved - it was in the skin's main.css file:

span.searchmatch { color: blue; } --Robinson weijman 13:50, 19 January 2011 (UTC)

Getting search results
Hi - how can I see what people are searching for? And how can I work out how good the searching is e.g. % hits (page matches) / searches? --Robinson Weijman 14:56, 25 January 2010 (UTC)

How to read lsearchd results

 * Alright, I've figured out how to do that (put lsearchd results in file) and what the problem was (too many search daemons running simultaneously!). But I now I'm confronted with a new problem - how to read those results.  Is it documented anywhere?  --Robinson Weijman 15:58, 27 January 2010 (UTC)

Case insensitive?
Hi - how can lucene search be made case insensitive? --Robinson Weijman 10:07, 4 February 2010 (UTC)
 * My mistake, it is case insensitive. What I meant to ask was:

Wildcards
Can wildcards be added by default to a search? --Robinson Weijman 10:41, 4 February 2010 (UTC)


 * Please test first, then ask.. wildcards do work, although they are limited to cases which won't kill the servers (e.g. you cannot do *a*). --Rainman 13:21, 4 February 2010 (UTC)


 * Your statement implies that I did not test first. Of course I did.  Perhaps I was unclear - what I meant to say was can the DEFAULT be when, e.g. if I search for "Exeter" using "Exe" that the default search is "*Exe*".  So I'm sorry I was unclear.  Please don't make assumptions about your customers - I don't appreciate it.  --Robinson Weijman 08:27, 5 February 2010 (UTC)


 * You are not my customer, I'm just a random person like you. In any case, yes this does not work, and will not work, using this kind of wildcards makes search unacceptably slow for any but sites with very few pages (and lucene-search is designed for big sites). If this doesn't suite your needs you can either hack it yourself of pay someone to do it. --Rainman 01:23, 8 February 2010 (UTC)


 * Thanks for the info.--Robinson Weijman 08:24, 8 February 2010 (UTC)

Fatal error with searching anything with colon
I am using Lucene-Search 2.1 and MWsearch.

Whenever I search for any keyword with colon (e.g. searching for "ha:"), I get the Fatal error: Fatal error: Call to undefined method Language::getNamespaceAliases in /var/www/html/wiki/extensions/MWSearch/MWSearch_body.php on line 96

It's the same thing with searching anything like "all:something" and "main:something".

Any idea is appreciated. --Ross Xu 20:10, 10 February 2010 (UTC)

Using 2.1 and "Did You Mean" not appearing
Hi all - as per the title. When I'm running a search, the "Did you mean" functionality does not appear to be working/showing. Do I need to do anything special configuration wise to get this to work or should this just work?

Any idea is appreciated. --Barramya 09:06, 17 February 2010 (UTC)
 * I am having the same issue right now and I can't fix it ... Did doublecheck the line in LocalSettings.php "$wgLuceneSearchVersion = 2.1;" to be NOT uncommented? I am looking forward for a solution. Thank you. Roemer2201 20:16, 19 April 2010 (UTC)


 * This line should be uncommented (as the instructions say), and no other special settings are needed. Verify that:
 * you have matching MediaWiki and MWSearch versions
 * the searches actually reach the lucene-search deamon - you should be able to see them in the console log you get when you start ./lsearchd
 * that lucene-search deamon has started without an error, especially the .spell index (also in the console log)
 * --Rainman 15:22, 20 April 2010 (UTC)

lsearchd doesn't start properly
after server restart my lsearchd (version 2.1) stopped to work:

Trying config file at path /root/.lsearch.conf Trying config file at path /vol/sites/jewage.org/search/ls21/lsearch.conf log4j: Parsing for [root] with value=[INFO, A1]. log4j: Level token is [INFO]. log4j: Category root set to INFO log4j: Parsing appender named "A1". log4j: Parsing layout options for "A1". log4j: Setting property [conversionPattern] to [%-4r [%t] %-5p %c %x - %m%n]. log4j: End of parsing for "A1". log4j: Parsed "A1" options. log4j: Finished configuring. 0   [main] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RMIMessenger bound
 * 1) ./lsearchd

after this it just hangs up

why would this happen?

java looks normal: java version "1.6.0_17" Java(TM) SE Runtime Environment (build 1.6.0_17-b04) Java HotSpot(TM) Client VM (build 14.3-b01, mixed mode, sharing)
 * 1) java -version

there is my config:

[Database] jewage : (single) (spell,4,2) [Search-Group] jewage.org : * [Index] jewage.org : * [Index-Path] : /search [OAI] : http://localhost/w/index.php [Namespace-Boost] : (0,2) (1,0.5) (110,5) [Namespace-Prefix] all :
 * 1) cat lsearch-global.conf
 * 2) Global search cluster layout configuration
 * 1) Global search cluster layout configuration
 * 1) (language,en)

tried jewage.org : (single) (spell,4,2) without effect --Eugenem 20:02, 17 February 2010 (UTC)

ok, needed to change all host adresses (config.inc, lsearch-global.conf) and rebuild the index --Eugenem 16:22, 18 February 2010 (UTC)

searching plurals returns result
for example searching vpn returns correct dataset with a small description but searching vpns returns same dataset only with the title. is this correct as I thought the search engine does a complete match. is there a way to correct this ?

Is there a way to search the raw wikitext?
Is there a option that will search the raw wikitext? For example, suppose I want to find all pages that use the cite extension tag. Is there a way to specify "return a list of pages using the tag ". I tried searching for on wikipedia, but all I got were references to "ref". Dnessett 18:35, 25 February 2010 (UTC)
 * nope. --Rainman 23:38, 25 February 2010 (UTC)
 * ...but Extension:Replace_Text will do this. Andthepharaohs 07:20, 27 April 2010 (UTC)

External Searches
Is there a way to include external (to the wiki) databases in the search results? Ideally I'd like to see: By "external database", I mean like a content management system containing office / PDF documents. --Robinson Weijman 11:28, 4 March 2010 (UTC)
 * 1) a list of results within the wiki and then
 * 2) underneath, one or more results for each external database.
 * No, unless you write a plugin for MediaWiki the query the external database and then show it on the search page. --Rainman 15:42, 4 March 2010 (UTC)


 * Oops, I missed this reply. Thanks.  --Robinson Weijman 10:30, 19 March 2010 (UTC)

Meta Tags
Can Lucene work with meta tags, e.g. Extension:MetaKeywordsTag. That is, pages with those tags appear higher in searches for those tags. --Robinson Weijman 12:09, 17 March 2010 (UTC)


 * No. --Rainman 13:42, 17 March 2010 (UTC)


 * Is it a good idea to add it? --Robinson Weijman 08:13, 18 March 2010 (UTC)

Ranking Suggestion
Are there any plans to bring out a new Lucene version? I'd like to see functionality dynamically to change the search results based on previous hits and clicks (like Google). Or that users can report "this was a useful / useless link", e.g. by clicking on an up or down arrow. --Robinson Weijman 12:11, 17 March 2010 (UTC)


 * This is very unlikely. --Rainman 13:42, 17 March 2010 (UTC)


 * Why is it unlikely? Is nobody continuing to develop this extension?  Wouldn't this be an improvement?  --Robinson Weijman 08:15, 18 March 2010 (UTC)


 * I am doing some maintenance in my free time, but there is no-one working full time, and no future major changes are currently planned. Of course, there are million things that would be good to have, but as I said, they are probably not happening unless someone else does them. --Rainman 17:32, 18 March 2010 (UTC)


 * OK thanks for the info. Let's hope someone steps forward then.  --Robinson Weijman 10:28, 19 March 2010 (UTC)

Searches with commas
We're trying to use Mediawiki and Lucene search but searches for phrases that have commas don't seem to work. Wikipedia demonstrates the problem too: Tom Crean (explorer) contains the sentence “In 1901, while serving on HMS Ringarooma in New Zealand, he volunteered to join Scott's 1901–04 British National Antarctic Expedition on Discovery, thus beginning his exploring career.” Searching for fragments of this sentence that don't involve commas—for example,  or  —turns up the page easily. But if you search for a fragment with a comma—for example,  or  —there are no matches. Taking the comma out of the search query doesn't help: there are still no matches.

Is this a bug? If it is intended, is there a way to disable it so that comma phrases can be found? Our intended use case, the OEIS, is all about comma-separated search strings. Thanks. --Russ Cox 05:50, 18 March 2010 (UTC)


 * yes, unfortunately it is a (known) bug. --Rainman 17:18, 18 March 2010 (UTC)


 * If I wanted to build a customized version without the bug, where should I be looking for it? I'd be happy to try to track it down, fix it, and send the change back, but I don't know where to start.  Thanks again.  --Russ Cox 18:30, 18 March 2010 (UTC)


 * The bug is a byproduct of a undertested "feature" and is actually very easy to fix, but needs a complete index rebuilt. You need to go to FastWikiTokenizerEngine.java and change MINOR_GAP = 2; to MINOR_GAP=1; --Rainman 21:12, 18 March 2010 (UTC)

New version? 2.1.2 -> 2.1.3
MarkAHershberger has updated the version number. Is there a new release then? --Robinson Weijman 08:10, 22 March 2010 (UTC)

Exception in thread "main" java.lang.NoClassDefFoundError: org/wikimedia/lsearch/util/Configure Caused by: java.lang.ClassNotFoundException: org.wikimedia.lsearch.util.Configure at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:319) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:264) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:332) Could not find the main class: org.wikimedia.lsearch.util.Configure. Program will exit.

I did release a new version, but if you've found problems with it, please file a bug at bugzilla and assign it to me. --MarkAHershberger 23:31, 30 March 2010 (UTC)


 * Thanks. So can you provide a brief summary of the changes or a link to the release notes? --Robinson Weijman 06:47, 6 April 2010 (UTC)

Sort results by last modified date?
Is it possible to make Lucene sort its results by the last-modification date of the article? Even better, can this be exposed as an option for the user? Maiden taiwan 19:48, 7 April 2010 (UTC)

How to run as CronJob
Hello there, I am new here and i have got Problems to run (build) and (lsearchd) as cronjob. I have Ubuntu 10.4 and MediaWiki (1.15.3) with SMW (1.4.3) und SMW+/SMWhalo (1.4.6) on a virtuel System. There had been some problems running (configure) und (build), but after typing "sudo ..." there was everything ok. I also installed the mentiont Init Skript for Ubuntu (3.14 LSearch Daemon Init Script for Ubuntu) and it runs well.

My Problem is to get everything run as CronJob. I got some Help by googling for "crontab -e" and "sudo crontab -e" (for admin) or "gnome-schedule" as GUI. I tried to write the path to search-engine in PATH in the crontab. This or "export PATH=$PATH:(dein Pfad zum Programmordner)" in SHELL helped to start "lsearchd" @reboot. If I tried to start "build" to reindex my Wiki, there had been Problems by finding the file "config.inc". So I added "$(dirname $0)" infront of "config.inc" and "LuceneSearch.jar". This solves the Problem to run the Skript (build) from another Dir:

source $(dirname $0)/config.inc if [ -n "$1" ]; then dumpfile="$1" else dumps="$base/dumps" [ -e $dumps ] || mkdir $dumps dumpfile="$dumps/dump-$dbname.xml" timestamp=`date -u +%Y-%m-%d` slave=`php $mediawiki/maintenance/getSlaveServer.php \ $dbname \ --conf $mediawiki/LocalSettings.php \ --aconf $mediawiki/AdminSettings.php` echo "Dumping $dbname..." cd $mediawiki && php maintenance/dumpBackup.php \ $dbname \ --conf $mediawiki/LocalSettings.php \ --aconf $mediawiki/AdminSettings.php \ --current \ --server=$slave > $dumpfile [ -e $indexes/status ] || mkdir -p $indexes/status echo "timestamp=$timestamp" > $indexes/status/$dbname fi cd $base && java -cp $(dirname $0)/LuceneSearch.jar org.wikimedia.lsearch.importer.BuildAll $dumpfile $dbname
 * 1) !/bin/bash

Now I will try to start these Skripts from "system-wide crontab" (/etc/crontab), because "build" requires Admin-Rights. In "system-wide crontab" it is posible to run a Skript as root. I hope that's is!

Greetz Benor 16:55, 5 May 2010 (UTC)

Problem with build script (and solution)
I had issues running the build script. The environment setup in config.inc was completely wrong. It turns out the configure script runs maintenance/eval.php which will dump the contents of AdminSettings.php. This interfered with the configure script. The solution is to put the contents of AdminSettings.php into LocalSettings.php and move or delete AdmingSettings.php. Then re-configure and build should run fine.

My versions: Ubuntu 10.04 x64 MediaWiki 1.16wmf4 (r66614) PHP 5.3.2-1ubuntu4.1 (apache2handler) MySQL 5.1.41-3ubuntu12 --Spt5007 23:06, 19 May 2010 (UTC)

What does "spell" directive do in lsearch-global.conf?
What is the meaning of the  directive, e.g.,

wikidb : (single) (spell,4,2) (language,en)

Thank you. Maiden taiwan 18:09, 8 June 2010 (UTC)


 * It means that a spell-check index is going to be built using words occurring in at least 2 articles, and word combinations occurring in at least 4 articles. --Rainman 20:17, 8 June 2010 (UTC)

java.io.IOException: The markup in the document following the root element must be well-formed
When running the lucene  script I get the error:

java.io.IOException: The markup in the document following the root element must be well-formed

Things were working fine until I imported a bunch of new articles into the "en" wiki (below) using, then this error happened on the next reindex. How do you debug a problem like this?

Full output:

Trying config file at path /root/.lsearch.conf Trying config file at path /usr/local/lucene-search-2.1/lsearch.conf 0   [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for De 98   [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En 150  [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for Es 188  [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for Fr 234  [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for Nl 275  [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - de_wikidb using base url: http://de.mywiki.com/w/index.php?title=Special:OAIRepository 275 [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - de_wikidb using base url: http://de.mywiki.com/w/index.php?title=Special:OAIRepository 275 [main] INFO  org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of de_wikidb from 2010-06-09T20:00:02Z 644 [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - en_wikidb using base url: http://en.mywiki.com/w/index.php?title=Special:OAIRepository 644 [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - en_wikidb using base url: http://en.mywiki.com/w/index.php?title=Special:OAIRepository 644 [main] INFO  org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of en_wikidb from 2010-06-14T14:15:02Z java.io.IOException: The markup in the document following the root element must be well-formed. at org.wikimedia.lsearch.oai.OAIParser.parse(OAIParser.java:68) at org.wikimedia.lsearch.oai.OAIHarvester.read(OAIHarvester.java:64) at org.wikimedia.lsearch.oai.OAIHarvester.getRecords(OAIHarvester.java:44) at org.wikimedia.lsearch.oai.IncrementalUpdater.main(IncrementalUpdater.java:191) 919 [main] WARN  org.wikimedia.lsearch.oai.IncrementalUpdater  - Retry later: error while processing update for en_wikidb : The markup in the document following the root element must be well-formed. java.io.IOException: The markup in the document following the root element must be well-formed. at org.wikimedia.lsearch.oai.OAIParser.parse(OAIParser.java:68) at org.wikimedia.lsearch.oai.OAIHarvester.read(OAIHarvester.java:64) at org.wikimedia.lsearch.oai.OAIHarvester.getRecords(OAIHarvester.java:44) at org.wikimedia.lsearch.oai.IncrementalUpdater.main(IncrementalUpdater.java:191) 920 [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - es_wikidb using base url: http://es.mywiki.com/w/index.php?title=Special:OAIRepository 920 [main] INFO  org.wikimedia.lsearch.oai.OAIHarvester  - es_wikidb using base url: http://es.mywiki.com/w/index.php?title=Special:OAIRepository 920 [main] INFO  org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of es_wikidb from 2010-06-08T17:09:38Z 1692 [main] INFO org.wikimedia.lsearch.oai.OAIHarvester  - fr_wikidb using base url: http://fr.mywiki.com/w/index.php?title=Special:OAIRepository 1692 [main] INFO org.wikimedia.lsearch.oai.OAIHarvester  - fr_wikidb using base url: http://fr.mywiki.com/w/index.php?title=Special:OAIRepository 1692 [main] INFO org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of fr_wikidb from 2010-06-10T16:45:04Z 2556 [main] INFO org.wikimedia.lsearch.oai.OAIHarvester  - nl_wikidb using base url: http://nl.mywiki.com/w/index.php?title=Special:OAIRepository 2556 [main] INFO org.wikimedia.lsearch.oai.OAIHarvester  - nl_wikidb using base url: http://nl.mywiki.com/w/index.php?title=Special:OAIRepository 2556 [main] INFO org.wikimedia.lsearch.oai.IncrementalUpdater  - Resuming update of nl_wikidb from 2010-06-08T21:30:11Z

Here is lsearch-global.conf:

[Database] en_wikidb : (single) (spell,4,2) (language,en) de_wikidb : (single) (spell,4,2) (language,de) es_wikidb : (single) (spell,4,2) (language,es) fr_wikidb : (single) (spell,4,2) (language,fr) nl_wikidb : (single) (spell,4,2) (language,nl)

[Search-Group] myhost : *

[Index] myhost : *

[Index-Path] : /search

[OAI] : http://myhost/w/index.php en_wikidb : http://en.mywiki.com/w/index.php de_wikidb : http://de.mywiki.com/w/index.php es_wikidb : http://es.mywiki.com/w/index.php fr_wikidb : http://fr.mywiki.com/w/index.php nl_wikidb : http://nl.mywiki.com/w/index.php

[Namespace-Boost] : (0,2) (1,0.5)

[Namespace-Prefix] all : [0] : 0 [1] : 1 [2] : 2 [3] : 3 [4] : 4 [5] : 5 [6] : 6 [7] : 7 [8] : 8 [9] : 9 [10] : 10 [11] : 11 [12] : 12 [13] : 13 [14] : 14 [15] : 15

Thanks for any help! Maiden taiwan 18:49, 14 June 2010 (UTC)


 * Update: I noticed that Special:Statistics was showing too few pages, so I ran  and the crash went away. The search results page is still not reporting all matches though, only some. Maiden taiwan 16:41, 15 June 2010 (UTC)


 * Run ./build on your wiki to make a fresh copy of the index, and then start incremental updates from that point. --Rainman 00:54, 16 June 2010 (UTC)


 * Thanks - I did that, and managed to run "build" for each of my five wikis above. But now the update script is throwing the above-mentioned Java error again. (java.io.IOException: The markup in the document following the root element must be well-formed.) Any ideas how I can debug this? Maiden taiwan 04:11, 16 June 2010 (UTC)


 * My first guess would be that OAI extension is somehow mis-configured. Update to latest version (SVN) of lucene-search and the exact URL used should appear in the log. Put that into browser to see what kind of output is returned by OAIRepository. Also check your php error logs. --Rainman 17:47, 16 June 2010 (UTC)

Lucene-search on OpenVMS
I've been working to port this extension to OpenVMS and have ran accross a few snags simply to realize its a Java configuration issue.

The following logicals need to be declared:

$ set proc/parse=extended $ @SYS$COMMON:[JAVA$150.COM]JAVA$150_SETUP.COM $ define DECC$ARGV_PARSE_STYLE ENABLE $ define DECC$EFS_CASE_PRESERVE ENABLE $ define DECC$POSIX_SEEK_STREAM_FILE ENABLE $ define DECC$EFS_CHARSET ENABLE $ define DECC$ENABLE_GETENV_CACHE ENABLE $ define DECC$FILE_PERMISSION_UNIX ENABLE $ define DECC$FIXED_LENGTH_SEEK_TO_EOF ENABLE $ define DECC$RENAME_NO_INHERIT ENABLE $ define DECC$ENABLE_TO_VMS_LOGNAME_CACHE ENABLE $ FILE_MASK = %x00000008 + %x00040000 $ DEFINE JAVA$FILENAME_CONTROLS 'file_mask'

Also, the configure and build files need to be converted into COM files

After that, FSUtils.java needs to be edited to recognize OpenVMS and set up so that for any kind of linking(hard or soft) it calls a C program to convert the filepaths to VMS compliant ones and then copies the file.

I'm still having a few issues with .links files but am working on a fix. --Need to make C program remove all instances of "^." from filename after it is converted from POSIX to VMS          filename.

Indexing works! --Sillas33 18:56, 1 July 2010 (UTC)

How does lsearchd work?
I'm curious how the lsearchd script works. I am trying to port it to OpenVMS and am currently getting a ClassDefNotFound error, mostly due to me not quite understanding what the script is passing where.

jardir=`dirname $0` # put your jar dir here! java -Djava.rmi.server.codebase=file://$jardir/LuceneSearch.jar -Djava.rmi.server.hostname=$HOSTNAME -jar $jardir/LuceneSearch.jar $*
 * 1) !/bin/bash

Specifically: Why is $0 part of jardir ='dirname $0' Figured out that this sets jardir equal to the path to lsearchd.

Is it possible to set this up so that it would run from an exploded jar? --Sillas33 15:34, 2 July 2010 (UTC)

It is possible, but you would need to manually include in the classpath all of the libraries (look at build.xml for the list) which are automatically included in the jar. --Rainman 01:59, 3 July 2010 (UTC)

Ended up running it from an exploded jar like so: $ set def root:[000000] $ java -cp "''class_path'" - "-D java.rmi.server.codebase=file:root/LUCENESEARCH.JAR" - "-D java.rmi.server.hostname=hostname" - "org.wikimedia.lsearch.config.StartupManager" "/root/"

I left a copy of the original jar in the directory along with the exploded version (I couldn't seem to update the jar with the ONE file I changed to make the this work on VMS)

Skip a namespace
Is it possible to tell Lucene not to index a given namespace? Or can we tell MWSearch not to search a given namespace? Thanks. Maiden taiwan 15:43, 22 July 2010 (UTC)

Lucene and LiquidThreads
On my wiki I use LQT 2.0alpha. Build indexes of lucene was stopped on page, that is lqt comment. How can fix this problem? Are anyone also have problems with lucene and lqt?

My wiki info: here


 * Make sure you use the latest lucene-search version (from SVN). LQT search was designed for this extension and should work. --Rainman 23:52, 31 July 2010 (UTC)
 * Build .jar from sources and have this problem again - indexing will stop on LQT pages :( I don't know where is the problem, why on my wiki lucene don't work, how and who can help me with this?

Lucene Suggest and Fuzzy Search Problems
I am having some trouble getting the Lucene daemon to give spelling suggestions and to work with fuzzy searches. I have tried on different distros, versions, JDKs, and data sets, but nothing seems to work. The spelling indexes get built when I run the indexer, so that part seems ok. However, I notice that only wiki and wiki.links appear under indexes/search after indexing (wiki.spell appears under snapshots though). I also played around with the Java source to see if I can track it down. As far as I can see, this is the code where the problem happens:

SearchEngine.java

// find host String host = cache.getRandomHost(iid.getSpell); if(host == null) return; // no available

cache.getRandomHost returns a null value, so the suggestion generation is skipped. Digging in a little more, I found that the following lines in SearcherCache.java pass back the null value:

Hashtable pools = remoteCache.get(iid.toString); if(pools == null) return null; http://www.mediawiki.org/w/index.php?title=Extension_talk:Lucene-search&action=edit&section=91

Any idea what is going on here? I feel like it has something to do with how indexes are tied to hosts, but I just can't seem to get a working configuration.

Thanks!


 * It would be helpful if you provided your configuration files. --Rainman 23:51, 28 August 2010 (UTC)


 * I didn't want to clutter up the talk page, so I sent over the config info in an email. Hopefully that works for you.  I have tried a number of things with the config, so this just happens to be the latest that I have.  I have also sent the output from the lsearchd startup logging.  I noticed from an above post about a similar issue, you mention that the lsearchd startup should say something about the spelling index, and mine does not.  I can tell you that some kind of spelling index gets built.  I have actually used a Lucene utility to look at it, and it contains terms that are specific to the wiki I indexed. --Mehle


 * I got your message back, and that was exactly the problem. Suggestions, fuzzy search, and for an added bonus, related articles all work, and they are even better than I hoped.  For anyone who might be having the same problem, here is what I did wrong.  I thought the * in lsearch-global.conf was a placeholder for the database name, so I went with the instructions in the 2.0 docs for doing the Search-Group and Index sections.  The result was that lsearchd was only picking up the main index and skipping the rest.  So if you are having a similar problem, do not follow the 2.0 instructions and make sure to leave the * right where it is.  Thank you once again for the help and for creating such a great search engine. --Mehle

Error running build using lucene search 2.1.3
[root@server~]# PATH=/wiki/usr/local/java/bin:$PATH;export PATH [root@uswv1app04a ~]# echo $PATH /wiki/usr/local/java/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin [root@uswv1app04a ~]# java -version java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)

[root@server lucene]# ./configure /wiki/www/htdocs/mediawiki-1.9.6 Generating configuration files for wikidb ... Making lsearch.conf Making lsearch-global.conf Making lsearch.log4j Making config.inc [root@server lucene]# ./build Dumping wikidb... MediaWiki lucene-search indexer - rebuild all indexes associated with a database. Trying config file at path /root/.lsearch.conf Trying config file at path /wiki/usr/local/lucene-search-2.1.3/lsearch.conf MediaWiki lucene-search indexer - index builder from xml database dumps.

0   [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En 60   [main] INFO  org.wikimedia.lsearch.ranks.Links  - Making index at /wiki/usr/local/lucene-search-2.1.3/indexes/import/wikidb.links 122 [main] INFO  org.wikimedia.lsearch.ranks.LinksBuilder  - Calculating article links... 224 [main] FATAL org.wikimedia.lsearch.importer.Importer  - Cannot store link analytics: Content is not allowed in prolog.

java.io.IOException: Trying to hardlink nonexisting file /wiki/usr/local/lucene-search-2.1.3/indexes/import/wikidb at org.wikimedia.lsearch.util.FSUtils.createHardLinkRecursive(FSUtils.java:97) at org.wikimedia.lsearch.util.FSUtils.createHardLinkRecursive(FSUtils.java:81) at org.wikimedia.lsearch.importer.BuildAll.copy(BuildAll.java:157) at org.wikimedia.lsearch.importer.BuildAll.main(BuildAll.java:112)

227 [main] ERROR org.wikimedia.lsearch.importer.BuildAll  - Error during rebuild of wikidb : Trying to hardlink nonexisting file /wiki/usr/local/lucene-search-2.1.3/indexes/import/wikidb

java.io.IOException: Trying to hardlink nonexisting file /wiki/usr/local/lucene-search-2.1.3/indexes/import/wikidb at org.wikimedia.lsearch.util.FSUtils.createHardLinkRecursive(FSUtils.java:97) at org.wikimedia.lsearch.util.FSUtils.createHardLinkRecursive(FSUtils.java:81) at org.wikimedia.lsearch.importer.BuildAll.copy(BuildAll.java:157) at org.wikimedia.lsearch.importer.BuildAll.main(BuildAll.java:112) Finished build in 0s

Not sure what's going wrong here, the wiki/usr/local/lucene-search-2.1.3/indexes/import has R/W access for all users.

Any clue?

- Vicki


 * It looks like the dump process has failed. Look at the dumps/wikidb.xml and verify it doesn't contain errors. --Rainman 15:55, 8 September 2010 (UTC)
 * I get the same errors. My dumps/dump-wikidb.xml file is completely empty. Please help, I do not know what to do. -- Nicole
 * *blush* I made a typo in the username in Adminsettings.php.... -- Nicole

Search database of Word documentation
Has anyone tried to use this to search documentation linked from MediaWiki? So if a page contains links to docs A & B, Lucene will search that? Anyone tried that with Word documentation? It would be a great additional feature. --88.159.118.8 17:17, 30 October 2010 (UTC)

Get the version number?
Is there a way to obtain the Lucene version number from PHP? Maiden taiwan 18:01, 12 November 2010 (UTC)

2 DBs and 4 Wikis
Configuration:

This is the configuration of my wiki farm: wikidb1
 * -- wiki1   # <= 'wiki1' is table prefix and interwiki link
 * -- wiki2   # <= 'wiki2' is table prefix and interwiki link

wikidb2
 * -- wiki3   # <= 'wiki3' is table prefix and interwiki link
 * -- wiki4   # <= 'wiki4' is table prefix and interwiki link

All 4 wikis uses English as default language.

What is desired:
 * when searching in wiki1 there will also be interwiki results from wiki2, wiki3 and wiki4 (when the searched term is also in these wikis)
 * when searching in wiki2 there will also be interwiki results from wiki1, wiki3 and wiki4 (when the searched term is also in these wikis)

Questions:
 * 1) Is this possible to configure lucene-search to match my wishes?
 * 2) If ( 1. ) is possible: How do I have to set this up? (global configuration, ...)--JBE 09:59, 21 December 2010 (UTC)


 * There is no support for table prefixes. Wikis need to be in separate databases. --Rainman 01:00, 6 January 2011 (UTC)
 * Thank you!--JBE 07:30, 6 January 2011 (UTC)