Extension talk:Lucene-search

From MediaWiki.org
Jump to: navigation, search
An archive box Archives 

Archive

Start a new discussion
First page
First page
Previous page
Previous page
Last page
Last page

Multiple Wikis, Multiple Databases, Single Search.

My team has two wikis, one wiki (which we call 'public') into which we place information for consumption by others within the company, the other wiki (which we call the 'team' wiki) into which we place information to be viewed only by our team members. The team wiki requires the user to log in to view the articles. Both wikis are working fine by themselves, but I would to set up a search feature, of which only the team members will use, which will allow them to simultaneously search both wikis. Can this extension be configured to work this way?

Spafbi15:51, 30 January 2012

As answered in the previous thread: no.

Rainman23:30, 1 February 2012
 

Multilanguage wiki farm that searches multiple databases?

Our multilanguage wiki farm (one database per wiki, one single code tree) has a lucene config (below) where each host searches only the content in its own database. How can we change this so one of the sites (say, "fr") searches not only its own database (fr_wikidb), but also a second database (en_wikidb)? Thanks.

[Database]
en_wikidb : (single) (spell,4,2) (language,en)
de_wikidb : (single) (spell,4,2) (language,de)
fr_wikidb : (single) (spell,4,2) (language,fr)

[Search-Group]
myhost : *

[Index]
myhost : *

[Index-Path]
<default> : /search

[OAI]
<default> : http://myhost/w/index.php
en_wikidb : http://en.myhost.net/w/index.php
de_wikidb : http://de.myhost.net/w/index.php
fr_wikidb : http://fr.myhost.net/w/index.php
Maiden taiwan17:31, 27 January 2012

Unfortunately, this is not possible at this moment. One workaround would be to dump both databases into one XML dump file and then create another "virtual" database with that, but then MediaWiki would also need to know how to properly handle these search results (in terms of providing the link to the correct site/article).

Rainman15:47, 28 January 2012
 

Init.d script for Ubuntu

The script referenced in the text (also found in the archives) does not work for Ubuntu. There are two main issues.

  1. There is a typo in the "reload" section (stat-stop-daemon is not a valid command).
  2. The stop functionality does not work - in fact, it actually loads another copy of lsearchd and java.

Has anyone been able to overcome these issues and have a properly functioning script for Ubuntu?

Reference: LSearch Daemon Init Script for Ubuntu

Jlemley22:37, 26 January 2012

Meta tag Indexing

Lucene does not index Meta tags like the ones created by Extension:MetaKeywordsTag. This was mentioned before, but no solution was provided. Apparently, Lucene can be modified to index meta tags, but I have not been able to find documentation on how to do this for MediaWiki. Has anyone found a solution to this?

Reference: Meta Tags

Jlemley22:45, 19 January 2012

That would require modifying the Java code that parses the wikitext. Remember that this extension indexes raw wikitext. One workaround would be to just create an empty template Template:MetaTags, and then use it in articles {{MetaTags|meta1|meta2}}. Template parameters are indexed, even if they are ignored by the empty template.

Rainman09:46, 20 January 2012

This works perfectly - thanks for the workaround, Rainman!

Jlemley15:05, 20 January 2012
 
 

Accentless searching and hiding of redirects in suggestions

Hello,

I hope this is the right place to ask this — i'm not actually sure which specific extension controls this functionality, so i'm sorry if i've posted this in error.

Anyway, i'm having trouble with the search suggestions provided through the search bar (both in the top right and on the search page); i am trying to get it set up to work like Wikipedia's, and although it's almost there, there are two issues that i can't resolve:

1. Accentless searching does not work in the suggestion box. On Wikipedia for example i can type in 'lubeck' and it will display the result 'Lübeck'; on my own wiki, as soon as i hit the 'u' in 'lubeck', the Lübeck result will disappear (unless i have a redirect from 'Lubeck', which leads to my second problem...).

2. Redirects are always shown in the suggestions box. On Wikipedia if i type in 'united states of mex' (all lower-case), the only result that is shown is 'United States of Mexico' (mixed case). This is how i want mine to work, but instead i get all of the redirects that match that text — like 'United states of mexico' (sentence case) and 'United States of Mexicans' and so on.

Assuming Lucene is what controls this (and please let me know if it's not — i have MWSearch and TitleKey installed as well, so i suppose on of those could come into it), what changes might i need to make to get this working properly?

Configuration info: MW 1.18, lucene-search 2.1.3, MWSearch MW1.18-r90287, TitleKey MW1.18-r81220, Debian 6

Thank you!

75.173.170.6502:35, 21 December 2011

Yes, on WMF wikis this is controlled by lucene-search. Basically, what you need to do is to add something like this into your global settings:

 [Database]
 yourwiki: (prefix)

Then you can re-run the build script to build the prefix index as well. Next you need to tell MediaWiki to use lucene as backend for prefix matches. This is done by adding the following into your localsettings.php:

 # default host for mwsuggest backend
 $wgEnableLucenePrefixSearch = true;
 $wgLucenePrefixHost = '10.0.3.18'; # IP or hostname of your lucene box

For more info on WMF settings: http://noc.wikimedia.org/conf/

Rainman00:24, 23 December 2011
 

Searching pdf files

Hello,

I was wondering if it is possible to set up Lucene to index pdf files and make them searchable.

I have many pdf files in my wiki and would like them to be searchable. Is it possible to search for/find text within a pdf and then have a link to the pdf come up in the search results?

Thanks in advance for any help. I am not a programmer and would greatly appreciate any help or pointers for whether it's possible or how to set up such a search.

18.111.86.6819:37, 18 December 2011

No, unfortunately this is not possible.

Rainman22:48, 19 December 2011
 

Help - port 8123 refusing all connections after Linux update & reboot

After a "yum update" on our Linux server, and a reboot, Lucene is no longer listening on port 8123. We have not changed any Lucene config files.

$ telnet localhost 8123
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
telnet: Unable to connect to remote host: Connection refused

lsearchd is running and is listening on port 8321 for incremental reindexes. Java is running as well. When I start lsearchd manually, it says:

sudo  /usr/local/bin/lucene-run
RMI registry started.
Trying config file at path /root/.lsearch.conf
Trying config file at path /usr/local/lucene-search-2.1.3/lsearch.conf
0    [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En
727  [main] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RMIMessenger bound
730  [Thread-1] INFO  org.wikimedia.lsearch.frontend.HTTPIndexServer  - Indexer started on port 8321

It definitely does NOT print the usual message about port 8123:

771  [Thread-2] INFO  org.wikimedia.lsearch.frontend.SearchServer  - Searcher started on port 8123

Any tips? Where do I start looking? This is a critical site for our business with thousands of users daily. Thanks.

--Maiden taiwan 03:54, 12 December 2011 (UTC)

Maiden taiwan03:54, 12 December 2011

I should mention that the "yum update" was NOT for lucene-search, nor for Java. Just for core CentOS Linux packages. Maiden taiwan 04:00, 12 December 2011 (UTC)

Maiden taiwan04:00, 12 December 2011

Changing the port number in lsearch.conf does not affect the problem. Maiden taiwan 04:07, 12 December 2011 (UTC)

Maiden taiwan04:07, 12 December 2011

Did your hostname change somehow? If there was a conflict, it would print out an error message. It seems like it doesn't even want to start a searcher because it might think this is not the right host to start it up?

Rainman09:22, 12 December 2011

The hostname is still the same. Maiden taiwan 12:43, 12 December 2011 (UTC)

Maiden taiwan12:43, 12 December 2011

Well don't know then. My hunch is that there is something wrong with how the hostname is understood. Have you tried calling java with:

-Djava.rmi.server.hostname=<your hostname, not localhost!>

And then use the same hostname in your configuration files?

Rainman15:25, 12 December 2011
 

Thanks for the tip. lsearchd currently runs this line:

java -Djava.rmi.server.codebase=file://$jardir/LuceneSearch.jar \
-Djava.rmi.server.hostname=$HOSTNAME -jar $jardir/LuceneSearch.jar $*

and $HOSTNAME = the correct value: I ran "ps uax" and saw it. Maiden taiwan 16:01, 12 December 2011 (UTC)

Maiden taiwan16:01, 12 December 2011
 
 
 
 
 

Speed performance tuning for Lucene-search extension

Hi, we're trying to build a highly scalable Wikipedia search mirror, which is required to handle 10,000 requests per second for searches. I tried using Lucene-search extension, but just couldn't get the average fulltext search time go up, our current average search time is around 500ms.

It seems that the deviation of search time is large too, with some searches being 4ms and others being 2,000ms. These performances tests were carried out using JMeter with 10 user request thread and 500 loops. We've tried tuning the JVM memory, tried storing the whole index (index built on the 20111007 wikipedia snapshot XML dump) in RAM, tried increasing/decreasing the SearcherPool.size parameter. So our question is that whether this extension is designed to perform better speedwise? Thanks very much

207.171.191.6000:16, 3 December 2011

Have you split your index into frequently searched namespaces and other namespaces (e.g. main vs rest)? This should make the index much smaller and heavily decrease the median time. Some searches will take longer however (if a user searched all of the content). Also note that 10k req/s is quite large. At wikimedia we get maybe 500 req/s.

Rainman11:44, 3 December 2011

Thanks for the reply! Yeh, we tried splitting the indexes (main vs rest), but still couldn't get to our anticipated performance unless we use lots of hardware :).

207.171.191.6018:20, 5 December 2011
 
 

Searching all namespaces without "all:" prefix?

lucene-search ignores some namespaces by default, such as "Template", unless you prefix your search string with "all:". Is there a way to disable this behavior so all namespaces are searched always, even without "all:"?

Maiden taiwan20:51, 7 November 2011

Internal error in SearchEngine trying to make WikiSearcher: null

I'm encountering a strange error msg today. It's working fine before, it started just after I restarted the server.

Following is the log from the server. [pool-2-thread-13] ERROR org.wikimedia.lsearch.search.SearchEngine - Internal error in SearchEngine trying to make WikiSearcher: null

On Browser I got: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Error: 500 Server error</title> </head> <body>

500 Server error

Internal error in SearchEngine: null

LSearch daemon on localhost

</body> </html>

I'll really appreciate if you can help me to understand the error. Thanks in advance,

sincerely yours Babur

Bnucist10:58, 11 October 2011

This is the whole log of the error.

3812760 [pool-2-thread-43] INFO org.wikimedia.lsearch.frontend.HttpHandler - query:/search/dev/Haferkleie?namespaces=1156&offset=0&limit=6&version=2.1&iwlimit=10 what:search dbname:dev term:Haferkleie java.lang.NullPointerException

at java.util.Hashtable.put(Hashtable.java:411)
at org.wikimedia.lsearch.search.WikiSearcher.<init>(WikiSearcher.java:97)
at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:686)
at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:115)
at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:92)
at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193)
at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

3812762 [pool-2-thread-43] ERROR org.wikimedia.lsearch.search.SearchEngine - Internal error in SearchEngine trying to make WikiSearcher: null java.lang.NullPointerException

at java.util.Hashtable.put(Hashtable.java:411)
at org.wikimedia.lsearch.search.WikiSearcher.<init>(WikiSearcher.java:97)
at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:686)
at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:115)
at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:92)
at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193)
at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Bnucist11:17, 11 October 2011
 

Bug when I doing a ./build

When I doing a ./build, i becomming a error message.

1327 [main] INFO  org.wikimedia.lsearch.ranks.Links  - Opening for read /opt/mediawiki/lucene-search-2.1.3/indexes/search/wiki.links
java.io.IOException: no segments* file found in org.apache.lucene.store.FSDirectory@/opt/mediawiki/lucene-search-2.1.3/indexes/search/wiki.links: files:
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
        at org.wikimedia.lsearch.spell.SuggestBuilder.main(SuggestBuilder.java:98)
        at org.wikimedia.lsearch.importer.BuildAll.main(BuildAll.java:124)
Caused by: org.xml.sax.SAXException: no segments* file found in org.apache.lucene.store.FSDirectory@/opt/mediawiki/lucene-search-2.1.3/indexes/search/wiki.links: files:
        at org.mediawiki.importer.XmlDumpReader.endElement(XmlDumpReader.java:227)
        at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        ... 2 more
1346 [main] FATAL org.wikimedia.lsearch.spell.SuggestBuilder  - I/O error reading dump for wiki from /opt/mediawiki/lucene-search-2.1.3/dumps/dump-wiki.xml : no segments* file found in org.apache.lucene.store.FSDirectory@/opt/mediawiki/lucene-search-2.1.3/indexes/search/wiki.links: files:
217.91.108.10206:37, 26 September 2011

Try deleting the /opt/mediawiki/lucene-search-2.1.3/indexes/search/wiki.links directory before running build.

Rainman15:25, 27 September 2011
 

API search from a Java program

Following from my earlier post on this forum and a helpful Rainman's reply, I managed to make the API- based search work from a Java program. I am posting the steps (perhaps clumsy at places, but working) here as it may be of interest to others. Setup:

  • add LuceneSearch.jar on your classpath
  • modify lsearch.conf, lsearch-global.conf and lsearch.log4j and put them to your project's folder

[edit] lsearch.conf

  • set MWConfig.global to point to lsearch-global.conf
  • set Indexes.path
  • set Logging.logconfig to point to lsearch.log4j

[edit] lsearch-global.conf

  • make sure that you point to localhost

code:

    Configuration config = Configuration.open();
    SearcherCache cache = SearcherCache.getInstance();         
    URI uri = new URI("http://localhost:8123/search/wiki/maradona?limit=3");
    HashMap query = new QueryStringMap(uri);
    double version = getVersion(query);
    SearchEngine search = new SearchEngine();
    IndexId iid = IndexId.get("wiki");
    iid.forceMySearch();
    cache.waitForInitialDeployment();
    cache.getSearcherPoolStatus(iid);
    cache.waitForInitialDeployment();
    SearchResults res = search.search("wiki","search","maradona",query,version);
    for (ResultSet title : res.getResults())
    {
         String result = title.title;
         System.out.println(result);
   }
kalten18:37, 9 September 2011

build vs update script

I'm encountering some strange behavior with the update script. If I run the build script lucene properly indexes everything and it becomes searchable. If I run the update script its almost as if nothing happens? I've looked at the output of the script and it doesn't seem to indicate any obvious error. Thoughts?

Jking 122:10, 31 August 2011

The update script will only work if OAIRepository extension is installed and configured. There should be an error somewhere about not being able to contact it, although it might be buried at the beginning of the log.

Rainman00:18, 1 September 2011
 

Simple mistake, complete failure of build script

Just a word of warning When I called the configure script I left the / after the path to my wiki did

sudo configure /var/www/mediawiki/

instead of

sudo configure /var/www/mediawiki

Then all sorts of missing files statement...

66.130.16.10019:37, 28 August 2011

search 2 wikis

I have one wiki running Lucene-search. I want to add another wiki to Lucene. Is there a manual how to use many wikis with lucene?

I have read Docs and added 2nd database to lsearch-global.conf. But where I have to config path to second wiki installation for username and password for database access?

-- 08:58, 21 January 2011 (UTC)

Pdcemulator16:38, 31 March 2011

I have the very same problem. Were you able to solve it? Thanks! Fladei 18:45, 19 July 2011 (UTC)

Fladei18:45, 19 July 2011
 

Searching (spell.Suggest) too slow

I run a very large wiki (250,000 pages) and lucene often takes so long to search that the results page will load without showing any results. A subsequent search will show matches, since the results are in lucene's cache.

The problem seems to be with org.wikimedia.lsearch.spell.Suggest which can take up to 15000 ms to complete.

org.wikimedia.lsearch.search.SearchEngine seems to be ok, returning results in ~100ms. Is there any way to speed up/remove the spelling suggestions to increase speed?

62.94.26.717:04, 18 July 2011

Do you have enough RAM for linux to cache the whole spell-check index in memory? If not, you might be hitting I/O which slows down things around 100 times. The size of your wiki shouldn't be an issue (it works fine on much bigger wikis like en.wikipedia.org).

You can also try to decrease the size of spellcheck index. Try settings like these: yourwiki: (spell,40,10)

And then rebuild your spellcheck index. This will index only those words which appear in at least 40 articles, and phrases in at least 10 articles.

Rainman22:12, 18 July 2011
 

wikidb.links subdirectories - must they be kept?

We're indexing a MediaWiki instance using Lucene on a remote server. Over time, many subdirectories get created in indexes/update/wikidb.links/timestamp on the remote server. Do all these subdirectories need to be kept, or can some/all of them be deleted? Thanks.

Maiden taiwan16:28, 6 July 2011

Note: We're doing incremental updates using OAIRepository.

Maiden taiwan16:29, 6 July 2011

Only the last one is needed. It should delete the older ones. Are you using the latest (svn) version of the extension?

Rainman21:19, 10 July 2011
 

I am using lucene-search 2.1.3 on the index host.

Maiden taiwan14:03, 14 July 2011

The auto-deletion worked when lucene-search was running on the same server as Mediawiki. Since we put lucene-search on a separate server, the deletion has stopped working. Any ideas?

In the meantime, I wrote a cron job to clean up.

Maiden taiwan14:13, 14 July 2011
 
 
 

0 search results returned rebuild of index required

I have run MWSearch/Lucene/OAIrepo for about 6 months now and it has gone fairly smoothly and now (within the last 2 weeks) the search just returns 0 results. I attempted to do an ./update and this did not resolve the issue, iv created a log for lucene via my init script and i see the search is dispatched to the org.wikimedia.lsearch.search.SearchEngine but I don't really have a lot of purview into that portion of the application. If I delete the indexes dir and then do a build this resolves the issue, we are running Linux/Apache with a cronjob to update the search index. I am mainly looking for what are additional troubleshooting steps I can take to possible resolve this as this is a knowledge base for a front line support team and id rather not script a bandaid if I can avoid it

Luckenbach17:38, 6 July 2011

This seems to be happening more often, any one have any ideas where I should start?

Luckenbach20:51, 8 July 2011

I guess the first thing is to look at lucene logs. If I remember correctly by default they go the the console where lsearchd was started.

Rainman21:20, 10 July 2011
 
 

Periodic fatal errors while rebuilding index - "no segments* file"... not solved yet?

I've happened to come across an error while rebuilding my index (it is set as a cronjob every 20 minutes in a rather small wiki), which I found to be recurring since almost 2 years ago: Periodic fatal errors while rebuilding index - "no segments* file".

Apparently, the rebuilding process created the /indexes/search/wiki.links as a folder instead of a symlink, which is why it is supposed to fail.

Has anybody found a solution that does not imply stopping the lsearchd daemon or deleting the index directory?

A.H.16:17, 7 June 2011

I see this quite often too, it happens at least once a day for us.

C.C.19:52, 8 June 2011

The only thing I can think of is that multiple cronjobs are trying to modify the same directory which leads to an inconsistency. Can you be sure this is not happening?

Rainman09:46, 9 June 2011

We see this all the time and there is definitely only one update operation running at a time.

Maiden taiwan19:06, 20 June 2011
 

We have already checked it and, as in the above case of Maiden taiwan, we have just one cronjob running at a time. Although I'd hate to do that, I guess maybe we'll have to work on modifying the rebuilding routine or create another cronjob to ensure the rebuilding process does work properly, as we don't know yet why it sometimes fail when it is not supposed to.

A.H.10:17, 21 June 2011

I am performing the same action but have found that periodically lucnene just loses it cookies and can no longer search. I think kill the index and do a build and its all better.

99.107.241.16214:24, 6 July 2011
 
 
 
 

Does an indexing server need to be a mediawiki server?

I am following the instructions to put the indexer on a different host. Let's say the original MediaWiki/Lucene host is called MW and the new index server is called IS. After installing Java, configuring stuff as documented, and copying my lucene-search directory tree from MW to IS, I start up lsearchd and get this warning:

0    [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En
5    [main] WARN  org.wikimedia.lsearch.util.Localization  - Error processing message file at
 file:///var/www/html/w/languages/messages/MessagesEn.php
java.io.FileNotFoundException: /var/www/html/w/languages/messages/MessagesEn.php
 (No such file or directory)

This is clearly happening because MediaWiki is not installed on index server IS. However, lsearchd starts up fine, and when I run the update script on IS, the resulting index gets properly rsync'ed to host MW. Searches on MW seem to be working fine.

My question is: does host IS also need a full MediaWiki installation on it? Or just the file tree? Or if I do neither (as above), how serious is the warning message above? Thanks.

Maiden taiwan18:21, 7 June 2011

The only thing that the indexer needs are the Message files where it gets the localized namespace names from (to be able to parse the links in articles correctly). If you are using english, this warning shouldn't be a problem as canonical names are already supported even without the message file.

Rainman09:44, 9 June 2011
 
First page
First page
Previous page
Previous page
Last page
Last page
Personal tools
Namespaces
Variants
Actions
Site
Support
Download
Development
Communication
Print/export
Toolbox