Extension talk:Lucene-search/archive/2009

=2009=

./configure for v. 2.1 does not seem to work
Running Ubuntu 8.04, Ant 1.7, Java 1.6.0_07, using the Binary install package: user@host: ./configure /path/to/mw/install

"0 [main] WARN org.wikimedia.lsearch.util.Command - Got exit value 1 while executing [/bin/bash, -c, cd /path/to/mw/install && (echo "return \$wgDBname" | php maintenance/eval.php)] Exception in thread "main" java.io.IOException: Error executing command: 	at org.wikimedia.lsearch.util.Command.exec(Command.java:45)	at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77)	at org.wikimedia.lsearch.util.Configure.main(Configure.java:42)

user@host: sudo ./configure /path/to/mw/install 0 [main] WARN org.wikimedia.lsearch.util.Command - Got exit value 1 while executing [/bin/bash, -c, cd /path/to/mw/install && (echo "return \$wgDBname" | php maintenance/eval.php)] Exception in thread "main" java.io.IOException: Error executing command: at org.wikimedia.lsearch.util.Command.exec(Command.java:45) at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77) at org.wikimedia.lsearch.util.Configure.main(Configure.java:42)

user@host: sudo su root@host: ./configure /path/to/mw/install 0 [main] WARN org.wikimedia.lsearch.util.Command - Got exit value 1 while executing [/bin/bash, -c, cd  /path/to/mw/instal && (echo "return \$wgDBname" | php maintenance/eval.php)] Exception in thread "main" java.io.IOException: Error executing command: at org.wikimedia.lsearch.util.Command.exec(Command.java:45) at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77) at org.wikimedia.lsearch.util.Configure.main(Configure.java:42) Seems to me that this is highly unlikely to be a permissions issue. My MW installation is working just fine otherwise.

I can't even get past the first step of the instructions, which does not bode well. Will try building from source, but doubt that will make any difference.... Any ideas? --Fungiblename 20:38, 18 March 2009 (UTC)


 * You need to replace /path/to/mw/install with the actual path to your mediawiki installation (e.g. something like /var/www/mediawiki/). --Rainman 21:07, 18 March 2009 (UTC)
 * Thanks, I was using my actual path but did not want to reproduce it here in full. I was able to compile the SVN version, however, and even after changing the "hostname" variable to my actual hostname as recognized by Apache, I get the following:

./configure /var/www/mw 0 [main] WARN org.wikimedia.lsearch.util.Command - Got exit value 1 while executing [/bin/bash, -c, cd /var/www/mw && (echo "return \$wgDBname" | php maintenance/eval.php)] Exception in thread "main" java.io.IOException: Error executing command: at org.wikimedia.lsearch.util.Command.exec(Command.java:45) at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77) at org.wikimedia.lsearch.util.Configure.main(Configure.java:42) --Fungiblename 21:14, 18 March 2009 (UTC)
 * If you go into your mw installation dir (i.e. one you supplied) and run echo "return \$wgDBname" | php maintenance/eval.php what do you get? Do you get the name of your database? --Rainman 21:28, 18 March 2009 (UTC)


 * Thanks for the troubleshooting advice! It seems like this was a major an oversight on my part. I get the same error as above because I'm running a small wiki farm with shared code (symlinks from the install directory to the shared MediaWiki code). Once I wrote "export MW_INSTALL_PATH=/var/www/mw && ./configure /var/www/mw" it wrote all the config files. You may want to add a note on the main page about configuring for installations with shared code (at least this very basic step). I'll play around on my own to try to find a way to have multiple separate indexes (my plan is to set up multiple directories with separate config files, index directories, and a symlink to the main jar). I'll try to get it working with just one first, though. Thanks again for your help and all your hard work on this! --Fungiblename 07:39, 19 March 2009 (UTC)


 * For me configure sets wrong value of dbname in config.ini and it cause . Here I see "dbname=> DatabaseName>". Note wrong ">" signs. Calling echo "return \$wgDBname" | php maintenance/eval.php returns

> DatabaseName >
 * eval.php at some servers prints prompt to stdout. I found that it happens when php function posix_isatty exists. Sometimes it does not.
 * Also configure wants php to be in PATH. It is not always true either. --Roma7
 * I had the same problem and solved it. It seems that the ./configure didn't "recognize" the PHP in the LAMP package and so I simply installed PHP CLI and it worked...--Gregra 21:55, 5 December 2009 (UTC)

Here's just a taste of my output from trying to build from source of the STABLE version
user@host:~/common/elements/lucene-SVN-stable-2009-03-18$ ant Buildfile: build.xml

build: [mkdir] Created dir: /home/username/common/elements/lucene-SVN-stable-2009-03-18/bin [javac] Compiling 101 source files to /home/username/common/elements/lucene-SVN-stable-2009-03-18/bin [javac] /home/username/common/elements/lucene-SVN-stable-2009-03-18/src/org/wikimedia/lsearch/analyzers/WikiQueryParser.java:24: package org.mediawiki.importer does not exist [javac] import org.mediawiki.importer.ExactListFilter; [javac]                             ^ [javac] /home/username/common/elements/lucene-SVN-stable-2009-03-18/src/org/wikimedia/lsearch/importer/DumpImporter.java:13: package org.mediawiki.importer does not exist... .... rTest.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 70 errors

BUILD FAILED /home/username/common/elements/lucene-SVN-stable-2009-03-18/build.xml:68: Compile failed; see the compiler error output for details.

Total time: 2 seconds

"ant -Xlint:deprecation -f build.xml Unknown argument: -Xlint:deprecation"

Does anyone have any instructions about how to even get this thing running? Are there some hidden instructions/prerequisites that I'm missing? Seems to me this should be pretty easy to run on Linux.... --Fungiblename 20:53, 18 March 2009 (UTC)
 * Must place "mwdumper.jar" in "lib" of directory downloaded from SVN. --Fungiblename 21:12, 18 March 2009 (UTC)

Unable to build
When building from the binary I get this error. I am in Ubuntu:

root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# ./build Dumping wikidb... 2009-03-19 20:14:42: wikidb 99 pages (143.215/sec), 100 revs (144.661/sec), ETA 2009-03-19 20:14:45 [max 513] 2009-03-19 20:14:42: wikidb 199 pages (192.676/sec), 200 revs (193.645/sec), ETA 2009-03-19 20:14:44 [max 513] 2009-03-19 20:14:43: wikidb 299 pages (222.928/sec), 300 revs (223.674/sec), ETA 2009-03-19 20:14:44 [max 513] 2009-03-19 20:14:43: wikidb 399 pages (230.430/sec), 400 revs (231.008/sec), ETA 2009-03-19 20:14:44 [max 513] 2009-03-19 20:14:43: wikidb 458 pages (243.707/sec), 458 revs (243.707/sec), ETA 2009-03-19 20:14:44 [max 513] mkdir: cannot create directory `/var/lib/mediawiki/extensions/lucene-search-2.1/indexes/status': No such file or directory ./build: line 19: /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/status/wikidb: No such file or directory MediaWiki lucene-search indexer - rebuild all indexes associated with a database. Trying config file at path /root/.lsearch.conf Trying config file at path /var/lib/mediawiki/extensions/lucene-search-2.1/lsearch.conf MediaWiki lucene-search indexer - index builder from xml database dumps.

1   [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En 2799 [main] INFO  org.wikimedia.lsearch.ranks.Links  - Making index at /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/import/wikidb.links 3208 [main] INFO org.wikimedia.lsearch.ranks.LinksBuilder  - Calculating article links... 458 pages (26.889/sec), 458 revs (26.889/sec) 21058 [main] INFO org.wikimedia.lsearch.index.IndexThread  - Making snapshot for wikidb.links 21291 [main] INFO org.wikimedia.lsearch.index.IndexThread  - Made snapshot /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/snapshot/wikidb.links/20090319161516 21405 [main] INFO org.wikimedia.lsearch.search.UpdateThread  - Syncing wikidb.links 21963 [main] INFO org.wikimedia.lsearch.ranks.Links  - Opening for read /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/search/wikidb.links 21973 [main] INFO org.wikimedia.lsearch.related.RelatedBuilder  - Rebuilding related mapping from links 34467 [main] INFO org.wikimedia.lsearch.index.IndexThread  - Making snapshot for wikidb.related 34649 [main] INFO org.wikimedia.lsearch.index.IndexThread  - Made snapshot /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/snapshot/wikidb.related/20090319161529 34661 [main] INFO org.wikimedia.lsearch.importer.Importer  - Indexing articles (index+highlight+titles)... 34663 [main] INFO org.wikimedia.lsearch.ranks.Links  - Opening for read /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/search/wikidb.links 35075 [main] INFO org.wikimedia.lsearch.analyzers.StopWords  - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 329 ms 35077 [main] INFO  org.wikimedia.lsearch.importer.SimpleIndexWriter  - Making new index at /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/import/wikidb 35087 [main] INFO org.wikimedia.lsearch.importer.SimpleIndexWriter  - Making new index at /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/import/wikidb.hl Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(libgcj.so.81) at java.io.ByteArrayOutputStream.write(libgcj.so.81) at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:514) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:317) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:166) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:659) at org.apache.lucene.index.IndexReader.document(IndexReader.java:525) at org.wikimedia.lsearch.storage.RelatedStorage.getRelated(RelatedStorage.java:56) at org.wikimedia.lsearch.importer.DumpImporter.writeEndPage(DumpImporter.java:109) at org.mediawiki.importer.PageFilter.writeEndPage(Unknown Source) at org.mediawiki.importer.XmlDumpReader.closePage(Unknown Source) at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(libgcj.so.81) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.wikimedia.lsearch.importer.Importer.main(Importer.java:186) at org.wikimedia.lsearch.importer.BuildAll.main(BuildAll.java:109) root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1#

root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# java -version java version "1.5.0" gij (GNU libgcj) version 4.2.4 (Ubuntu 4.2.4-1ubuntu3)

Copyright (C) 2007 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# javac Eclipse Java Compiler v_774_R33x, 3.3.1 Copyright IBM Corp 2000, 2007. All rights reserved. Usage: Now, add the following to LocalSettings.php: # lsearch require_once("extensions/MWSearch/MWSearch.php"); $wgSearchType = 'LuceneSearch'; $wgLuceneHost = 'YourHostName'; # <-- change this! $wgLucenePort = 8123; # uncomment this if you use lucene-search 2.1 # (MUST be AFTER the require_once!) $wgLuceneSearchVersion = 2.1;

Where YourHostName is the results of 'hostname'. The search doesn't work on my machine if I use the default, "192.168.0.1".

How to customize synonyms and stop words?
How can I edit the synonyms and stop words in order to bring the engine more in line with our needs?
 * You need to checkout the source from svn. Then edit resources/dist/wordnet-en.txt (for synonyms) and stopwords-en.txt. If this does not work, then you could also try making your own Filter class and plugging it in into the FilterFactory class. --Rainman 18:31, 1 May 2009 (UTC)

Thanks. I have done as you suggest. However, I do not see any indication that the system is ignoring stop words (e.g. if I search with the word "me", I get results). I also do not know how to confirm that the synonyms are working. Are there some good tests I could run to verify? Marc 14:31, 6 May 2009 (MDT)

Searching Attachments
I am running MediaWiki 1.13.1 PHP 5.2.4-2ubuntu5.6(apache2handler) MySQL 5.0.51a-3ubuntu5.4

I have the FileIndexer

http://www.mediawiki.org/wiki/Extension:FileIndexer

and

http://www.mediawiki.org/wiki/Extension:MWSearch

now installed and running.

The Lucene search capability seems to work far better than the default search capability except that it no longer generates search results from attachements that were turned into text and then inserted in the image field

Is this a limitation of the present software? I had hoped the Lucene Search would index the attachments, especially given the use of the FileIndexer.

Is is significant that the FQDN is http://wiki.tesla.local/ (on a local LAN) but that the hostname is wiki

Attached are the configuration files.

lsearch.conf

MWConfig.global=file:///home/chris/lucene-search-2.1/lsearch-global.conf Indexes.path=/home/chris/lucene-search-2.1/indexes Rsync.path=/usr/bin/rsync Search.updateinterval=0.1 Search.updatedelay=0 Search.checkinterval=10 Search.warmupaggregate=true Search.ramdirectory=false Search.disablewordnet=true SearcherPool.size=1 Index.snapshotinterval=2880 Index.maxqueuecount=5000 Index.maxqueuetimeout=12 Localization.url=file:///home/chris/public_html_3/wiki/languages/messages OAI.maxqueue=5000 OAI.bufferdocs=500 Logging.logconfig=/home/chris/lucene-search-2.1/lsearch.log4j Logging.debug=false lsearch-global.conf [Database] MediaWiki : (single) (spell,4,2) (language,en) [Search-Group] wiki : * [Index] wiki : * [Index-Path] : /search [OAI] : http://localhost/index.php [Namespace-Boost] : (0,2) (1,0.5) [Namespace-Prefix] all : [0] : 0 [1] : 1 [2] : 2 [3] : 3 [4] : 4 [5] : 5 [6] : 6 [7] : 7 [8] : 8 [9] : 9 [10] : 10 [11] : 11 [12] : 12 [13] : 13 [14] : 14 [15] : 15
 * 1) By default, will check /etc/lsearch.conf
 * 1) Global configuration
 * 1) Global configuration
 * 1) URL to global configuration, this is the shared main config file, it can
 * 2) be on a NFS partition or available somewhere on the network
 * 1) Local path to root directory of indexes
 * 1) Path to rsync
 * 1) Extra params for rsync
 * 2) Rsync.params=--bwlimit=8192
 * 1) Search node related configuration
 * 1) Search node related configuration
 * 1) Port of http daemon, if different from default 8123
 * 2) Search.port=8000
 * 1) In minutes, how frequently will the index host be checked for updates
 * 1) In seconds, delay after which the update will be fetched
 * 2) used to scatter the updates around the hour
 * 1) In seconds, how frequently the dead search nodes should be checked
 * 1) In milliseconds, for how long should the query be executed
 * 2) Search.timelimit=1000
 * 1) if to wait for aggregates to warm up before deploying the searcher
 * 1) cache *whole* index in RAM
 * 1) Disable wordnet aliases
 * 1) If this host runs on multiple CPUs maintain a pool of index searchers
 * 2) It's good idea to make it number of CPUs+1, or some larger odd number
 * 1) Indexer related configuration
 * 1) Indexer related configuration
 * 1) In minutes, how frequently is a clean snapshot of index created
 * 1) Daemon type (http is started by default)
 * 2) Index.daemon=xmlrpc
 * 1) Port of daemon (default is 8321)
 * 2) Index.port=8080
 * 1) Maximal queue size after which index is being updated
 * 1) Maximal time an update can remain in queue before being processed (in seconds)
 * 1) If to delete all old snapshots always (default to false - leaves the last good snapshot)
 * 2) Index.delsnapshots=true
 * 1) Log, ganglia, localization
 * 1) Log, ganglia, localization
 * 1) URL to MediaWiki message files
 * 1) Username/password for password authenticated OAI repo
 * 2) OAI.username=user
 * 3) OAI.password=pass
 * 1) Max queue size on remote indexer after which we wait a bit
 * 1) Number of docs to buffer before sending to inc updater
 * 1) Log configuration
 * 1) Set debug to true to diagnose problems with log4j configuration
 * 1) Turn this on to broadcast status to a Ganglia reporting system.
 * 2) Requires that 'gmetric' be in the PATH and runnable. You can
 * 3) override the default UDP broadcast port and interface if required.
 * 4) Ganglia.report=true
 * 5) Ganglia.port=8649
 * 6) Ganglia.interface=eth0
 * 1) Global search cluster layout configuration
 * 1) Global search cluster layout configuration

config.inc dbname=MediaWiki wgScriptPath= hostname=wiki indexes=/home/chris/lucene-search-2.1/indexes mediawiki=/home/chris/public_html_3/wiki base=/home/chris/lucene-search-2.1 wgServer=http://localhost


 * Unfortunately lucene-search won't search attachments no matter what kind of extra extension you use. You could however try Extension:EzMwLucene which is also lucene-based but has a different set of features, doesn't have some lucene-search stuff, but has attachment search. --Rainman 09:31, 26 May 2009 (UTC)


 * Thank you so much for the prompt response. I will try the Extension:EzMwLucene search as attachment searching is key feature I would like in our company wiki.


 * Thanks a bunch Rainman. Do you know offhand what the major differences are between both Lucene extensions?  We have Lucene-search installed but would like to enable EzMwLucene but it would be good to know what the feature differences are. --Gkullberg 13:59, 3 July 2009 (UTC)

Search within files?
Is it possible to use Lucene to search within files uploaded to MediaWiki?

On the Lucene page on Wikipedia it says:

"At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others can all be indexed so long as their textual information can be extracted."

It would be great if I could search within PDFs and Docs and whatever else I upload to my MediaWiki instance. --Gkullberg 19:55, 2 July 2009 (UTC)
 * See answer to previous question.... --Rainman 10:08, 3 July 2009 (UTC)

How to use CJKAnalyzer
Is it possible to use CJKAnalyzer for indexing pages written in Japanese?
 * Yes, just change (language,en) to (language,ja) in your config file (and re-run the build process). --Rainman 08:27, 10 July 2009 (UTC)

Periodic fatal errors while rebuilding index - "no segments* file"
I'm running Lucene-search on our local wiki. The build script runs correctly and produces a valid index, which is picked up by the daemon, and everything works fine...for a bit. I've created a cron job that runs the build script hourly, with the output of the script being emailed to me. The cron job runs happily for a spell and then I receive this in the output:

MediaWiki lucene-search indexer - rebuild all indexes associated with a database. Trying config file at path /home/system/mymintel-svc/.lsearch.conf Trying config file at path /data/mymintel/mediawiki/lucene_search/lsearch.conf MediaWiki lucene-search indexer - index builder from xml database dumps.

0   [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En 582  [main] INFO  org.wikimedia.lsearch.ranks.Links  - Making index at /data/mymintel/mediawiki/lucene_search/indexes/import/it_wiki.links 924 [main] INFO  org.wikimedia.lsearch.ranks.LinksBuilder  - Calculating article links... 3,759 pages (338.679/sec), 3,759 revs (338.679/sec) 14271 [main] INFO org.wikimedia.lsearch.index.IndexThread  - Making snapshot for it_wiki.links 14645 [main] INFO org.wikimedia.lsearch.index.IndexThread  - Made snapshot /data/mymintel/mediawiki/lucene_search/indexes/snapshot/it_wiki.links/20090731050111 14696 [main] INFO org.wikimedia.lsearch.search.UpdateThread  - Syncing it_wiki.links 15632 [main] INFO org.wikimedia.lsearch.ranks.Links  - Opening for read /data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links 15637 [main] INFO org.wikimedia.lsearch.related.RelatedBuilder  - Rebuilding related mapping from links 15640 [main] FATAL org.wikimedia.lsearch.importer.Importer - Cannot make related mapping: no segments* file found in  org.apache.lucene.store.FSDirectory@/data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links: files: MediaWiki lucene-search indexer - build spelling suggestion index.

16802 [main] INFO org.wikimedia.lsearch.spell.SuggestBuilder  - Building spell-check for it_wiki 16802 [main] INFO org.wikimedia.lsearch.util.Localization  - Reading localization for En 16931 [main] INFO  org.wikimedia.lsearch.spell.SuggestBuilder  - Rebuilding precursor index... 17037 [main] INFO org.wikimedia.lsearch.analyzers.StopWords  - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 68 ms 17039 [main] INFO  org.wikimedia.lsearch.spell.CleanIndexWriter  - Using phrase stopwords: [only, theirs, some, where, being, after, doing, did, they, herself, as, so, our, than, your, for, down, the, other, of, does, no, ours, with, from, them, by, also, you, hers, until, yourself, has, she, it, up, why, have, this, those, about, between, which, under, these, i, yours, but, his, myself, yourselves, having, more, be, her, into, its, an, he, on, over, was, here, to, such, above, because, nor, had, him, below, and, whoever, during, their, itself, been, most, that, out, each, or, a, own, all, what, in, ourselves, were, themselves, both, not, same, do, am, too, once, any, when, then, who, how, whom, my, through, there, before, very, we, against, few, while, again, me, at, if, himself, are, is, off, further] 17129 [main] INFO org.wikimedia.lsearch.ranks.Links  - Opening for read /data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links java.io.IOException: no segments* file found in org.apache.lucene.store.FSDirectory@/data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links: files: at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)

From this point onwards, the job will not run correctly until I have deleted the indexes directory and started from scratch.

I've dumped the directory structure of the filesystem when the index is working correctly, and when it's broken; the output is below.

Working config
indexes/ |-- import |  |-- it_wiki |  |   |-- _7.cfs |  |   |-- segments.gen |  |   `-- segments_h |  |-- it_wiki.hl |   |   |-- _7.cfs |  |   |-- segments.gen |  |   `-- segments_h |  |-- it_wiki.links |  |   |-- _8.cfs |  |   |-- segments.gen |  |   `-- segments_j |  |-- it_wiki.related |  |   |-- _d.cfs |  |   |-- segments.gen |  |   `-- segments_t |  |-- it_wiki.spell |  |   |-- _1v.cfs |  |   |-- segments.gen |  |   `-- segments_3t |  `-- it_wiki.spell.pre |      |-- _8.cfs |      |-- segments.gen |      `-- segments_j |-- index |  |-- it_wiki |  |   |-- _7.cfs |  |   |-- segments.gen |  |   `-- segments_h |  |-- it_wiki.hl |   |   |-- _7.cfs |  |   |-- segments.gen |  |   `-- segments_h |  |-- it_wiki.links |  |   |-- _8.cfs |  |   |-- segments.gen |  |   `-- segments_j |  `-- it_wiki.spell.pre |      |-- _8.cfs |      |-- segments.gen |      `-- segments_j |-- search |  |-- it_wiki -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki/20090730163156 |  |-- it_wiki.hl -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.hl/20090730163156 |  |-- it_wiki.links -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090730163123 |  |-- it_wiki.related -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.related/20090730163127 |  `-- it_wiki.spell -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.spell/20090730163230 |-- snapshot |  |-- it_wiki |  |   `-- 20090730163156 |   |       |-- _7.cfs |  |       |-- segments.gen |  |       `-- segments_h |  |-- it_wiki.hl |   |   `-- 20090730163156 |  |       |-- _7.cfs |  |       |-- segments.gen |  |       `-- segments_h |  |-- it_wiki.links |  |   `-- 20090730163123 |   |       |-- _8.cfs |  |       |-- segments.gen |  |       `-- segments_j |  |-- it_wiki.related |  |   `-- 20090730163127 |   |       |-- _d.cfs |  |       |-- segments.gen |  |       `-- segments_t |  |-- it_wiki.spell |  |   `-- 20090730163230 |   |       |-- _1v.cfs |  |       |-- segments.gen |  |       `-- segments_3t |  `-- it_wiki.spell.pre |      `-- 20090730163210 |           |-- _8.cfs |          |-- segments.gen |          `-- segments_j |-- status |  `-- it_wiki `-- update |-- it_wiki |  `-- 20090730163156     |       |-- _7.cfs |      |-- segments.gen |      `-- segments_h |-- it_wiki.hl    |   `-- 20090730163156 |      |-- _7.cfs |      |-- segments.gen |      `-- segments_h |-- it_wiki.links |  `-- 20090730163123     |       |-- _8.cfs |      |-- segments.gen |      `-- segments_j |-- it_wiki.related |  `-- 20090730163127     |       |-- _d.cfs |      |-- segments.gen |      `-- segments_t `-- it_wiki.spell `-- 20090730163230            |-- _1v.cfs |-- segments.gen `-- segments_3t

Broken Config
indexes/ |-- import |  |-- it_wiki |  |   |-- _2f.cfs |  |   |-- segments.gen |  |   `-- segments_58 |  |-- it_wiki.hl |   |   |-- _2f.cfs |  |   |-- segments.gen |  |   `-- segments_58 |  |-- it_wiki.links |  |   |-- _5h.cfs |  |   |-- segments.gen |  |   `-- segments_bm |  |-- it_wiki.related |  |   |-- _4n.cfs |  |   |-- segments.gen |  |   `-- segments_9o |  |-- it_wiki.spell |  |   |-- _oj.cfs |  |   |-- segments.gen |  |   `-- segments_1dh |  `-- it_wiki.spell.pre |      |-- _39.fdt |      |-- _39.fdx |      |-- segments.gen |      |-- segments_74 |      `-- write.lock |-- index |  |-- it_wiki |  |   |-- _2f.cfs |  |   |-- segments.gen |  |   `-- segments_58 |  |-- it_wiki.hl |   |   |-- _2f.cfs |  |   |-- segments.gen |  |   `-- segments_58 |  |-- it_wiki.links |  |   |-- _5h.cfs |  |   |-- segments.gen |  |   `-- segments_bm |  `-- it_wiki.spell.pre |      |-- _39.fdt |      |-- _39.fdx |      |-- segments.gen |      |-- segments_74 |      `-- write.lock |-- search |  |-- it_wiki -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki/20090731040228 |  |-- it_wiki.hl -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.hl/20090731040228 |  |-- it_wiki.links |  |   |-- 20090731050111 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731050111 |  |   |-- 20090731060116 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731060116 |  |   |-- 20090731070104 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731070104 |  |   |-- 20090731080121 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731080121 |  |   |-- 20090731090112 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731090112 |  |   |-- 20090731100113 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731100113 |  |   |-- 20090731110108 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731110108 |  |   |-- 20090731120051 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731120051 |  |   `-- 20090731130055 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731130055 |  |-- it_wiki.related -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.related/20090731040125 |  `-- it_wiki.spell -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.spell/20090731040320 |-- snapshot |  |-- it_wiki |  |   |-- 20090731030246 |   |   |   |-- _27.cfs |  |   |   |-- segments.gen |  |   |   `-- segments_4r |  |   `-- 20090731040228 |   |       |-- _2f.cfs |  |       |-- segments.gen |  |       `-- segments_58 |  |-- it_wiki.hl |   |   |-- 20090731030247 |  |   |   |-- _27.cfs |  |   |   |-- segments.gen |  |   |   `-- segments_4r |  |   `-- 20090731040228 |   |       |-- _2f.cfs |  |       |-- segments.gen |  |       `-- segments_58 |  |-- it_wiki.links |  |   |-- 20090731120051 |   |   |   |-- _58.cfs |  |   |   |-- segments.gen |  |   |   `-- segments_b3 |  |   `-- 20090731130055 |   |       |-- _5h.cfs |  |       |-- segments.gen |  |       `-- segments_bm |  |-- it_wiki.related |  |   |-- 20090731030132 |   |   |   |-- _49.cfs |  |   |   |-- segments.gen |  |   |   `-- segments_8v |  |   `-- 20090731040125 |   |       |-- _4n.cfs |  |       |-- segments.gen |  |       `-- segments_9o |  |-- it_wiki.spell |  |   |-- 20090731030355 |   |   |   |-- _mn.cfs |  |   |   |-- segments.gen |  |   |   `-- segments_19o |  |   `-- 20090731040320 |   |       |-- _oj.cfs |  |       |-- segments.gen |  |       `-- segments_1dh |  `-- it_wiki.spell.pre |      |-- 20090731030320 |       |   |-- _2z.cfs |      |   |-- segments.gen |      |   `-- segments_6c |      `-- 20090731040253 |           |-- _38.cfs |          |-- segments.gen |          `-- segments_6v |-- status |  `-- it_wiki `-- update |-- it_wiki |  |-- 20090731030246     |   |   |-- _27.cfs |  |   |-- segments.gen |  |   `-- segments_4r |  `-- 20090731040228     |       |-- _2f.cfs |      |-- segments.gen |      `-- segments_58 |-- it_wiki.hl    |   |-- 20090731030247 |  |   |-- _27.cfs |  |   |-- segments.gen |  |   `-- segments_4r |  `-- 20090731040228     |       |-- _2f.cfs |      |-- segments.gen |      `-- segments_58 |-- it_wiki.links |  |-- 20090731120051     |   |   |-- _58.cfs |  |   |-- segments.gen |  |   `-- segments_b3 |  `-- 20090731130055     |       |-- _5h.cfs |      |-- segments.gen |      `-- segments_bm |-- it_wiki.related |  |-- 20090731030132     |   |   |-- _49.cfs |  |   |-- segments.gen |  |   `-- segments_8v |  `-- 20090731040125     |       |-- _4n.cfs |      |-- segments.gen |      `-- segments_9o `-- it_wiki.spell |-- 20090731030355        |   |-- _mn.cfs |  |-- segments.gen |  `-- segments_19o `-- 20090731040320            |-- _oj.cfs |-- segments.gen `-- segments_1dh

As you can see, the contents of index/search/it_wiki.links is completely different. I suspect that it's this that's causing the problem, but I don't know enough about what's going on to diagnose. Java version is: java version "1.5.0_14-p8" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-p8-root_04_sep_2008_18_49) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-p8-root_04_sep_2008_18_49, mixed mode)

...and i'm running on FreeBSD 7.0, if that makes a difference. Any ideas what's going on? It'd be nice not to have to delete and rebuild the indexes by hand every day!

So this part

|-- search |  |-- it_wiki.links |  |   |-- 20090731050111 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731050111 |  |   |-- 20090731060116 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731060116 |  |   |-- 20090731070104 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731070104 |  |   |-- 20090731080121 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731080121 |  |   |-- 20090731090112 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731090112 |  |   |-- 20090731100113 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731100113 |  |   |-- 20090731110108 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731110108 |  |   |-- 20090731120051 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731120051 |  |   `-- 20090731130055 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731130055

Looks quite wrong.. all the files in search/ should be symlinks and should not have any subdirectories.. I'm not sure how these are created. You are sure that the whole build process takes less than an hour? If you get overlapping jobs trying to do the same thing they might lock eachother indexes. --Rainman 16:43, 3 August 2009 (UTC)

When I run the process by hand, it never takes more than 5 mins to complete, so I'd be very surprised if jobs are overlapping. Would probably be a good idea to make certain though, so I'll change the cronjob to time the process and send an update next time it fails. -- Mrgroucho 16:48, 3 August 2009 (UTC)

OK, I think we can rule out overlapping jobs. The indexer ran successfully as scheduled for over 12 hours yesterday, and then failed early this morning. The time stats for the job, and the one preceding it, are as follows:

Preceding Job
real		 3m24.606s user		 2m15.323s sys		 0m16.361s

Failed Job
real		 0m59.046s user		 0m20.773s sys		 0m2.410s

The failed job takes less time, but you'd expect that: it failed. I've noted that the output mentions various Threads - is there any way that this could be some sort of race condition/locking problem between those threads? -- Mrgroucho 13:44, 4 August 2009 (UTC)

Any idea at all on how to fix this? I've put a workaround in place that deletes all of the indexes every midnight and then runs ./build, which means that if it breaks during the day the indexes will never be massively out of date, but it's hardly a pretty fix. -- Mrgroucho 13:33, 7 August 2009 (UTC)

I've had the same problem. The index building job (cronned for every hour) started to fail every couple of weeks, and then every week, and then every few days etc., until I was manually clearing out and rebuild a couple of times a day! I'd already written a wrapper script around the 'build' script, which was just for making the process a bit more cron-friendly, checking the search daemon was running, and for aborting it altogether when my server-backup routines are in operation etc. I've now decided to have it build the indexes in a new location every hour, and then 'cut-over' if no error is encountered. So far, this seems to be effective.

The indexes folder I was using seemed to be growing exponentially, and I think this may have been related to the problem. However, not being a Java or Lucene expert, I think this is a problem I'm gonna have to continue to work-around instead of solving. --140.131.255.2 05:38, 7 September 2009 (UTC)


 * Yeah, I have this problem too. If anyone can post a solution - I'd be grateful.  Even a workaround script.  Thanks.  --Robinson Weijman 09:08, 5 March 2010 (UTC)


 * We too are experiencing this problem and I am surprised no solution has been posted although many have reported this exact problem on the net. The last solution I tried was to do a "rm -r" of the whole "indexes" directory before each call to "build" the index. The problem still arises but at least, when it fails once, it usually succeeds afterwards. I think I gave a great chance to LuceneSearch 2.1.3 and I go to 2.0, hoping this problem will disappear. Phil Reid 30 Nov 2010

I also have this problem, and do indeed believe it to be some sort of locking issue. The way I (think) I've "solved" it is a bit of a kluge -- I am killing the search daemon (lsearchd) before the update and restarting it after the update. At the moment there is an interruption of 30 seconds every hour. This is not great, but I figure this is better than the index not being updated. Looking forward to a more elegant fix [Thu Dec 16 21:10:30 GMT 2010]


 * So much for that. This kluge worked until the symlink lucene-search-2.1.3/indexes/search/wikidb.links (pointing to lucene-search-2.1.3/indexes/update/wikidb.links/&lt;datestamp&gt;) was replaced by lucene-search-2.1.3/indexes/search/wikidb.links as a directory in its own right, containing symlinks to the datestamped dirs. I've now modifed the update script to check if indexes/search/wikidb.links is a symlink and recreate it if not, making the kluge even worse. [Thu Dec 16 23:55:20 GMT 2010]

We also have this stupid problem. I'm looking forward to change the cron script and remove the index if there is a failure. It would be nice if you post yours cron script dirty workarrounds. At this time, my cron bash script does check the logfile for errors after the update script runs, e.g.: CHECK=$(grep -i "Rebuild I/O error:" $LOGFILE | head -1 | cut -c10) If there is a failure, I send a mail... --BSG2000 15:46, 20 December 2010 (UTC)

how to fix this when I run ./lsearchd
0rz  # ./lsearchd RMI registry started. Trying config file at path /root/.lsearch.conf Trying config file at path /usr/local/search/ls2-bin/lsearch.conf Exception in thread "main" java.lang.NullPointerException at org.wikimedia.lsearch.config.GlobalConfiguration.makeIndexIdPool(GlobalConfiguration.java:531) at org.wikimedia.lsearch.config.GlobalConfiguration.read(GlobalConfiguration.java:413) at org.wikimedia.lsearch.config.GlobalConfiguration.readFromURL(GlobalConfiguration.java:247) at org.wikimedia.lsearch.config.Configuration. (Configuration.java:116) at org.wikimedia.lsearch.config.Configuration.open(Configuration.java:68) at org.wikimedia.lsearch.config.StartupManager.main(StartupManager.java:39) 0rz  #

my environment is: 0rz  # java -version java version "1.6.0_13" Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing) 0rz  # ant -version Apache Ant version 1.7.1 compiled on June 27 2008 0rz </usr/local/search/ls2-bin> #

LSearch Daemon Init Script for Ubuntu

 * This is just a sample. You will need to adjust this based on where you put the lucene-search directory.
 * 1) !/bin/sh -e
 * 2) BEGIN INIT INFO
 * 3) Provides:             lsearchd
 * 4) Required-Start:       $syslog
 * 5) Required-Stop:        $syslog
 * 6) Default-Start:        2 3 4 5
 * 7) Default-Stop:         1
 * 8) Short-Description:    Start the Lucene Search daemon
 * 9) Description:          Provide a Lucene Search backend for MediaWiki
 * 10) END INIT INFO

test -x /usr/local/lucene-search-2.1/lsearchd || exit 0

OPTIONS="" if [ -f "/etc/default/lsearchd" ] ; then . /etc/default/lsearchd fi

. /lib/lsb/init-functions

case "$1" in start)    cd /usr/local/lucene-search-2.1    log_begin_msg "Starting Lucene Search Daemon..."    start-stop-daemon --start --quiet --oknodo --chdir /usr/local/lucene-search-2.1 --background --exec /usr/local/lucene-search-2.1/lsearchd -- $OPTIONS    log_end_msg $?    ;;  stop) log_begin_msg "Stopping Lucene Search Daemon..." start-stop-daemon --stop --quiet --oknodo --retry 2 --chdir /usr/local/lucene-search-2.1 --exec /usr/local/lucene-search-2.1/lsearchd log_end_msg $? ;; restart)    $0 stop    sleep 1    $0 start    ;;  reload|force-reload) log_begin_msg "Reloading Lucene Search Daemon..." stat-stop-daemon --stop -signal 1 --chdir /usr/local/lucene-search-2.1 --exec /usr/local/lucene-search-2.1/lsearchd log_end_msg $? ;; status)    status_of_proc /usr/local/lucene-search-2.1/lsearchd lsearchd && exit 0 || exit $?    ;;  *) log_success_msg "Usage: /etc/init.d/lsearchd {start|stop|restart|reload|force-reload|status}" exit 1 esac

exit 0 55,6

Error here when use configure
Hi, After I run "ant" to build the jar and generate configuration files, here comes the error. What's wrong?

Mediawiki: 1.15.1; Lucence: 2.1; OS: centOS [root@xxx lucene-search-2.1]# ./configure /var/wk/ Exception in thread "main" java.net.UnknownHostException: 00:16:3h:2d:6c:b0-hk0.localdomain: 00:16:3h:2d:6c:b0-hk0.localdomain at java.net.InetAddress.getLocalHost(InetAddress.java:1425) at org.wikimedia.lsearch.util.Configure.main(Configure.java:52) --Alpha3 11:26, 27 August 2009 (UTC)

In which context config.inc will be used?
--Ans 08:22, 4 September 2009 (UTC)
 * It is used in build process "./build" --Ans 09:40, 4 September 2009 (UTC)

Install from SVN or Binary.
I would recommend SVN any day. I've been through several installations of Lucene Search this morning, and the most rapid and problem-free methods was to use the SVN approach. It was also the *easiest* method to get Lucene to work - it just works. Also reads your MW config and produces its own *correct* configuration files.

MWSearch works just fine and dandy on top of this Lucene instance.

The 2.02 installation went badly, several times for me - and it chews a LOT more resources. I did get it working, but when it came to rebuilding indexes, it spewed up on the whole Computer - chewed 100% Chip, chewed more than 100% RAM, which caused Kernel Panic and failures. Had to reboot forcibly with hardware. Re-attempted several times and re-configured settings to test against "should be working" configuration - got the same results with the machine crawling. So I gave up, went to SVN, and instead of choking on the indexes, it rebuilt them in under 20 seconds. I understand there are Java issues around this. Forget them, it's not worth breaking Java on the System, or putting up with some strange configuration, just to get Lucene working. Gooooo SVN!! :-)

BTW: It should be made apparent on the Lucene-search Extension page that the SVN installation DOES work, and works VERY well. I had previously avoided this method as I am SVN-wary - where with a bit of prompting, that would have been my first choice.

Cheers, Mike

Brief period with zero search results (using update script)
Our wiki runs the "update" script every 15 minutes, and the update takes about 2 minutes. Updates are done locally on the single wiki server.

Unfortunately, for a brief time while this script runs, searches return zero results. This problem lasts just a few seconds, but our users do encounter it and become confused.

Any advice on eliminating this "zero results" period? We thought about running two different reindexing processes, each writing to a different directory, and switching between them with a symbolic link. Something like:


 * 1) At 6:00, update index #1 in /usr/local/lucene1.
 * 2) Point symbolic link /usr/local/lucene at /usr/local/lucene1.
 * 3) At 6:15, update index #2 in /usr/local/lucene2.
 * 4) Point symbolic link /usr/local/lucene at /usr/local/lucene2.
 * 5) At 6:30, update index #1 in /usr/local/lucene1.
 * 6) Point symbolic link /usr/local/lucene at /usr/local/lucene1.

But I don't see a way to make Lucene reindex one directory while serving out of another. Any better suggestions? Maiden taiwan 15:03, 23 September 2009 (UTC)


 * This should not happen. Are there any errors in logs during this period? --Rainman 17:20, 23 September 2009 (UTC)


 * Yes: the update script outputs:

MediaWiki lucene-search indexer - build a map of related articles. ... 413 [main] INFO  org.wikimedia.lsearch.related.RelatedBuilder  - Rebuilding related mapping from links 416 [main] FATAL org.wikimedia.lsearch.related.RelatedBuilder  - Rebuild I/O error: no segments* file found in org.apache.lucene.store.FSDirectory@/usr/local/lucene-search-2.1/indexes/search/wikidb.links: files: java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.FSDirectory@/usr/local/lucene-search-2.1/indexes/search/wikidb.links: files: at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:587) at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) at org.apache.lucene.index.IndexReader.open(IndexReader.java:209) at org.apache.lucene.index.IndexReader.open(IndexReader.java:173) at org.wikimedia.lsearch.ranks.Links.flushForRead(Links.java:213) at org.wikimedia.lsearch.ranks.Links.ensureRead(Links.java:239) at org.wikimedia.lsearch.ranks.Links.getKeys(Links.java:773) at org.wikimedia.lsearch.related.RelatedBuilder.rebuildFromLinks(RelatedBuilder.java:91) at org.wikimedia.lsearch.related.RelatedBuilder.main(RelatedBuilder.java:72)


 * The named folder (wikidb.links) contains only symbolic links named after timestamps: 20090924121512, etc., linking to folders. Inside the folders (that exist) are files:

-rw-r--r-- 2 root root 4952161 Sep 24 12:00 _hq.cfs -rw-r--r-- 2 root root     46 Sep 24 12:00 segments_172 -rw-r--r-- 2 root root     20 Sep 24 12:00 segments.gen


 * --Maiden taiwan 16:29, 24 September 2009 (UTC)

No, this appears to be a separate issues. In any case, the extension does use symbolic links to quickly switch between the new and the old index, and it also allows for the new and the old index to co-exist for a while until all the old searches finish or timeout. So, that shouldn't be a problem. What could be a problem is that if you have the indexer and searcher on the same machine with insufficient RAM, then the indexer bogs down the machine causing high I/O which then slows down the searchers to the point of searches timing out. --Rainman 08:56, 25 September 2009 (UTC)


 * Thanks. We have plenty of RAM (4 GB I believe) on a virtual machine, and while the load average does go up to about 3.0 - 4.0 during indexing, users don't perceive any slowness. That is, the search query returns quickly with zero results.  How long is the timeout? Maiden taiwan 11:49, 25 September 2009 (UTC)

svn revision # for 2.1.2
What is the revision number for the 2.1.2 binary hosted at SourceForge?

I am having trouble getting any search results when building from the HEAD of http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/ The index builds fine, but when I query I get no results. Returns exception:

java.lang.IllegalArgumentException: nDocs must be > 0</tt>

Querying directly to http://localhost:8123/search/wikidb/help gives:

267
 * 1) info search=[gziebold-15624s.local], highlight=[], suggest=[gziebold-15624s.local] in 46 ms
 * 2) no suggestion
 * 3) interwiki 0 0
 * 4) results 0

However, indexing and running based on 2.1.2 binary works fine.
 * It's the svn revision of the date of release, don't know offhand, you'll have to look it up. I've built some indexes but didn't have problems with latest svn, can you provide a full stack trace? --Rainman 11:09, 10 November 2009 (UTC)

RMI registry started. Trying config file at path /Users/gziebold/.lsearch.conf Trying config file at path /Users/gziebold/Projects/mediawiki/lucene-search-2.1.built/lsearch.conf 0   [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En 733  [main] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RMIMessenger bound 737 [Thread-1] INFO  org.wikimedia.lsearch.frontend.HTTPIndexServer  - Indexer started on port 8321 739 [Thread-2] INFO  org.wikimedia.lsearch.frontend.SearchServer  - Searcher started on port 8123 746 [Thread-5] INFO  org.wikimedia.lsearch.search.SearcherCache  - Starting initial deployer for [wikidb, wikidb.hl, wikidb.links, wikidb.related, wikidb.spell] 818 [Thread-5] INFO  org.wikimedia.lsearch.search.SearcherCache  - Caching meta fields for wikidb ... 2522 [Thread-5] INFO  org.wikimedia.lsearch.search.SearcherCache  - Finished caching wikidb in 1705 ms 2554 [Thread-5] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable $0 bound 2562 [Thread-5] INFO org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable<wikidb.hl>$0 bound 2567 [Thread-5] INFO org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable<wikidb.links>$0 bound 2575 [Thread-5] INFO org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable<wikidb.related>$0 bound 2582 [Thread-5] INFO org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable<wikidb.spell>$0 bound 6879 [Thread-8] INFO org.wikimedia.lsearch.frontend.HttpMonitor  - HttpMonitor thread started 6881 [pool-2-thread-1] INFO org.wikimedia.lsearch.frontend.HttpHandler  - query:/search/wikidb/wind?namespaces=0%2C500&offset=0&limit=20&version=2.1&iwlimit=10 what:search dbname:wikidb term:wind 6919 [pool-2-thread-1] INFO org.wikimedia.lsearch.analyzers.StopWords  - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 21 ms 7052 [pool-2-thread-1] INFO  org.wikimedia.lsearch.search.SearchEngine  - Using FilterWrapper wrap: {0, 500} [] java.lang.IllegalArgumentException: nDocs must be > 0 at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110) at org.wikimedia.lsearch.search.WikiSearcher.search(WikiSearcher.java:184) at org.apache.lucene.search.Searcher.search(Searcher.java:132) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:722) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:129) at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:101) at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193) at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:637) 7076 [pool-2-thread-1] WARN org.wikimedia.lsearch.search.SearchEngine  - Retry, temporal error for query: [wind] on wikidb : nDocs must be > 0 java.lang.IllegalArgumentException: nDocs must be > 0 at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110) at org.wikimedia.lsearch.search.WikiSearcher.search(WikiSearcher.java:184) at org.apache.lucene.search.Searcher.search(Searcher.java:132) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:722) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:129) at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:101) at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193) at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:637)

I also confirmed that the index built from the HEAD is fine. If I swap the LuceneSearch.jar from 2.1.2 binary and run it against indexes built from HEAD, it works. Processing query with LuceneSearch.jar from HEAD fails.

Other details: MediaWiki 1.15.1 (r50) MWSearch (Version r45173)

-- Looks like r48153 == 2.1.2 Or at least I was able to get that to successfully work with MediaWiki 1.15.1 --GregZ 04:06, 11 November 2009 (UTC)


 * Ah I see.. fixed in svn. Thanks for the report. --Rainman 11:23, 11 November 2009 (UTC)

Category search
Can anyone elaborate on the following from OVERVIEW.txt?

searching categories. Syntax is: query incategory:"exact category name". It is important to note that category names are themselves not tokenized. Using logical operators, intersection, union and difference of categories can be searched. Since exact category is needed (only case is not important), it is maybe best to incorporate this somewhere on category page, and have category name put into query by MediaWiki instead manually by user.

The incategory: syntax does not appear to work as described (in 2.1.2)

Also, what is meant by ...it is maybe best to incorporate this somewhere on category page, and have category name put into query by MediaWiki instead manually by user. ? Is the suggestion to put a search form on the Category page and insert the incategory: syntax there?


 * It does work, but only for categories that are not added via templates, but in main article text. E.g. . --Rainman 11:02, 10 November 2009 (UTC)

Ah ha. That explains why my incategory: query was not working. The categories were added via templates. This is caused because lucene-search indexes the wikitext for articles and does not resolve templates? Has there been discussion of a different index of the rendered html content instead of only indexing wikitext?


 * Yes, however, the current mediawiki architecture makes it difficult to do... in fact, what we would want is article not in html, but in wikitext with expanded templates... In any case, you won't find a search extension that does it. --Rainman 15:51, 10 November 2009 (UTC)


 * How about; Keep indexing wikitext, but access the MediaWiki database to determine category relationships. Maiden taiwan 18:30, 16 December 2009 (UTC)


 * On default mysql install that won't scale very well, otherwise it would be implemented a long time ago. --Rainman 22:31, 16 December 2009 (UTC)


 * Thanks. Can you explain why it wouldn't scale? Couldn't it be done while Lucene builds the index -- just read the Mediawiki database tables once and you're done?  Or maybe it could be dynamic like DPL does: it checks category membership dynamically just fine. Finally, even if it's slow, could you make this behavior an option and let the sysadmin decide whether it works on his/her site?  Right now "incategory" produces different results than Mediawiki Category pages do. That seems like a bug.... Thank you. --Maiden taiwan 01:06, 17 December 2009 (UTC)

Also, can you comment on this ...it is maybe best to incorporate this somewhere on category page, and have category name put into query by MediaWiki instead manually by user. ? Is the suggestion to put a search form on the Category page and insert the incategory: syntax there?


 * Well yes... if someone would have done it that would be nice.. You need to take into consideration that the file has probably been written at 2am on some sunday, and not take everything in it very seriously ;) --Rainman 15:51, 10 November 2009 (UTC)

I simply needed the "late-night fog" translation. :) --GregZ 04:03, 11 November 2009 (UTC)

Is incategory going to be fixed?
Is the problem of incategory and transcluded category tags planned to be fixed? If incategory is returning wrong results (missing all articles with transcluded category tags), people are going to see incomplete search results and make wrong decisions ("Well, I guess there are no articles that match my query..."). This just happened on our wiki. Can't the extension just look at the wiki database to discover category relationships? Thanks. Maiden taiwan 18:02, 16 December 2009 (UTC)


 * No, there are no plans to fix it. Lucene-search is primarily developed towards needs of WMF, and because this would never get enabled on WMF projects due to inefficiency it is not planned to be implemented. However, this is an open-source project and if you need this functionality you can make it yourself or pay someone to do it. --Rainman 12:52, 19 December 2009 (UTC)