Extension talk:Lucene-search
Contents
![]() First page |
![]() Previous page |
![]() Next page |
![]() Last page |
My team has two wikis, one wiki (which we call 'public') into which we place information for consumption by others within the company, the other wiki (which we call the 'team' wiki) into which we place information to be viewed only by our team members. The team wiki requires the user to log in to view the articles. Both wikis are working fine by themselves, but I would to set up a search feature, of which only the team members will use, which will allow them to simultaneously search both wikis. Can this extension be configured to work this way?
Our multilanguage wiki farm (one database per wiki, one single code tree) has a lucene config (below) where each host searches only the content in its own database. How can we change this so one of the sites (say, "fr") searches not only its own database (fr_wikidb), but also a second database (en_wikidb)? Thanks.
[Database] en_wikidb : (single) (spell,4,2) (language,en) de_wikidb : (single) (spell,4,2) (language,de) fr_wikidb : (single) (spell,4,2) (language,fr) [Search-Group] myhost : * [Index] myhost : * [Index-Path] <default> : /search [OAI] <default> : http://myhost/w/index.php en_wikidb : http://en.myhost.net/w/index.php de_wikidb : http://de.myhost.net/w/index.php fr_wikidb : http://fr.myhost.net/w/index.php
Unfortunately, this is not possible at this moment. One workaround would be to dump both databases into one XML dump file and then create another "virtual" database with that, but then MediaWiki would also need to know how to properly handle these search results (in terms of providing the link to the correct site/article).
The script referenced in the text (also found in the archives) does not work for Ubuntu. There are two main issues.
- There is a typo in the "reload" section (stat-stop-daemon is not a valid command).
- The stop functionality does not work - in fact, it actually loads another copy of lsearchd and java.
Has anyone been able to overcome these issues and have a properly functioning script for Ubuntu?
Reference: LSearch Daemon Init Script for Ubuntu
Lucene does not index Meta tags like the ones created by Extension:MetaKeywordsTag. This was mentioned before, but no solution was provided. Apparently, Lucene can be modified to index meta tags, but I have not been able to find documentation on how to do this for MediaWiki. Has anyone found a solution to this?
Reference: Meta Tags
That would require modifying the Java code that parses the wikitext. Remember that this extension indexes raw wikitext. One workaround would be to just create an empty template Template:MetaTags, and then use it in articles {{MetaTags|meta1|meta2}}. Template parameters are indexed, even if they are ignored by the empty template.
Hello,
I hope this is the right place to ask this — i'm not actually sure which specific extension controls this functionality, so i'm sorry if i've posted this in error.
Anyway, i'm having trouble with the search suggestions provided through the search bar (both in the top right and on the search page); i am trying to get it set up to work like Wikipedia's, and although it's almost there, there are two issues that i can't resolve:
1. Accentless searching does not work in the suggestion box. On Wikipedia for example i can type in 'lubeck' and it will display the result 'Lübeck'; on my own wiki, as soon as i hit the 'u' in 'lubeck', the Lübeck result will disappear (unless i have a redirect from 'Lubeck', which leads to my second problem...).
2. Redirects are always shown in the suggestions box. On Wikipedia if i type in 'united states of mex' (all lower-case), the only result that is shown is 'United States of Mexico' (mixed case). This is how i want mine to work, but instead i get all of the redirects that match that text — like 'United states of mexico' (sentence case) and 'United States of Mexicans' and so on.
Assuming Lucene is what controls this (and please let me know if it's not — i have MWSearch and TitleKey installed as well, so i suppose on of those could come into it), what changes might i need to make to get this working properly?
Configuration info: MW 1.18, lucene-search 2.1.3, MWSearch MW1.18-r90287, TitleKey MW1.18-r81220, Debian 6
Thank you!
Yes, on WMF wikis this is controlled by lucene-search. Basically, what you need to do is to add something like this into your global settings:
[Database] yourwiki: (prefix)
Then you can re-run the build script to build the prefix index as well. Next you need to tell MediaWiki to use lucene as backend for prefix matches. This is done by adding the following into your localsettings.php:
# default host for mwsuggest backend $wgEnableLucenePrefixSearch = true; $wgLucenePrefixHost = '10.0.3.18'; # IP or hostname of your lucene box
For more info on WMF settings: http://noc.wikimedia.org/conf/
Hello,
I was wondering if it is possible to set up Lucene to index pdf files and make them searchable.
I have many pdf files in my wiki and would like them to be searchable. Is it possible to search for/find text within a pdf and then have a link to the pdf come up in the search results?
Thanks in advance for any help. I am not a programmer and would greatly appreciate any help or pointers for whether it's possible or how to set up such a search.
After a "yum update" on our Linux server, and a reboot, Lucene is no longer listening on port 8123. We have not changed any Lucene config files.
$ telnet localhost 8123 Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused telnet: Unable to connect to remote host: Connection refused
lsearchd is running and is listening on port 8321 for incremental reindexes. Java is running as well. When I start lsearchd manually, it says:
sudo /usr/local/bin/lucene-run RMI registry started. Trying config file at path /root/.lsearch.conf Trying config file at path /usr/local/lucene-search-2.1.3/lsearch.conf 0 [main] INFO org.wikimedia.lsearch.util.Localization - Reading localization for En 727 [main] INFO org.wikimedia.lsearch.interoperability.RMIServer - RMIMessenger bound 730 [Thread-1] INFO org.wikimedia.lsearch.frontend.HTTPIndexServer - Indexer started on port 8321
It definitely does NOT print the usual message about port 8123:
771 [Thread-2] INFO org.wikimedia.lsearch.frontend.SearchServer - Searcher started on port 8123
Any tips? Where do I start looking? This is a critical site for our business with thousands of users daily. Thanks.
--Maiden taiwan 03:54, 12 December 2011 (UTC)
I should mention that the "yum update" was NOT for lucene-search, nor for Java. Just for core CentOS Linux packages. Maiden taiwan 04:00, 12 December 2011 (UTC)
Changing the port number in lsearch.conf does not affect the problem. Maiden taiwan 04:07, 12 December 2011 (UTC)
Did your hostname change somehow? If there was a conflict, it would print out an error message. It seems like it doesn't even want to start a searcher because it might think this is not the right host to start it up?
The hostname is still the same. Maiden taiwan 12:43, 12 December 2011 (UTC)
Well don't know then. My hunch is that there is something wrong with how the hostname is understood. Have you tried calling java with:
-Djava.rmi.server.hostname=<your hostname, not localhost!>
And then use the same hostname in your configuration files?
Thanks for the tip. lsearchd currently runs this line:
java -Djava.rmi.server.codebase=file://$jardir/LuceneSearch.jar \ -Djava.rmi.server.hostname=$HOSTNAME -jar $jardir/LuceneSearch.jar $*
and $HOSTNAME = the correct value: I ran "ps uax" and saw it. Maiden taiwan 16:01, 12 December 2011 (UTC)
Hi, we're trying to build a highly scalable Wikipedia search mirror, which is required to handle 10,000 requests per second for searches. I tried using Lucene-search extension, but just couldn't get the average fulltext search time go up, our current average search time is around 500ms.
It seems that the deviation of search time is large too, with some searches being 4ms and others being 2,000ms. These performances tests were carried out using JMeter with 10 user request thread and 500 loops. We've tried tuning the JVM memory, tried storing the whole index (index built on the 20111007 wikipedia snapshot XML dump) in RAM, tried increasing/decreasing the SearcherPool.size parameter. So our question is that whether this extension is designed to perform better speedwise? Thanks very much
Have you split your index into frequently searched namespaces and other namespaces (e.g. main vs rest)? This should make the index much smaller and heavily decrease the median time. Some searches will take longer however (if a user searched all of the content). Also note that 10k req/s is quite large. At wikimedia we get maybe 500 req/s.
lucene-search ignores some namespaces by default, such as "Template", unless you prefix your search string with "all:". Is there a way to disable this behavior so all namespaces are searched always, even without "all:"?
I'm encountering a strange error msg today. It's working fine before, it started just after I restarted the server.
Following is the log from the server. [pool-2-thread-13] ERROR org.wikimedia.lsearch.search.SearchEngine - Internal error in SearchEngine trying to make WikiSearcher: null
On Browser I got: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Error: 500 Server error</title> </head> <body>
500 Server error
LSearch daemon on localhost
</body> </html>
I'll really appreciate if you can help me to understand the error. Thanks in advance,
sincerely yours Babur
This is the whole log of the error.
3812760 [pool-2-thread-43] INFO org.wikimedia.lsearch.frontend.HttpHandler - query:/search/dev/Haferkleie?namespaces=1156&offset=0&limit=6&version=2.1&iwlimit=10 what:search dbname:dev term:Haferkleie java.lang.NullPointerException
at java.util.Hashtable.put(Hashtable.java:411) at org.wikimedia.lsearch.search.WikiSearcher.<init>(WikiSearcher.java:97) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:686) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:115) at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:92) at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193) at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)
3812762 [pool-2-thread-43] ERROR org.wikimedia.lsearch.search.SearchEngine - Internal error in SearchEngine trying to make WikiSearcher: null java.lang.NullPointerException
at java.util.Hashtable.put(Hashtable.java:411) at org.wikimedia.lsearch.search.WikiSearcher.<init>(WikiSearcher.java:97) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:686) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:115) at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:92) at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193) at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)
When I doing a ./build, i becomming a error message.
1327 [main] INFO org.wikimedia.lsearch.ranks.Links - Opening for read /opt/mediawiki/lucene-search-2.1.3/indexes/search/wiki.links
java.io.IOException: no segments* file found in org.apache.lucene.store.FSDirectory@/opt/mediawiki/lucene-search-2.1.3/indexes/search/wiki.links: files:
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
at org.wikimedia.lsearch.spell.SuggestBuilder.main(SuggestBuilder.java:98)
at org.wikimedia.lsearch.importer.BuildAll.main(BuildAll.java:124)
Caused by: org.xml.sax.SAXException: no segments* file found in org.apache.lucene.store.FSDirectory@/opt/mediawiki/lucene-search-2.1.3/indexes/search/wiki.links: files:
at org.mediawiki.importer.XmlDumpReader.endElement(XmlDumpReader.java:227)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
... 2 more
1346 [main] FATAL org.wikimedia.lsearch.spell.SuggestBuilder - I/O error reading dump for wiki from /opt/mediawiki/lucene-search-2.1.3/dumps/dump-wiki.xml : no segments* file found in org.apache.lucene.store.FSDirectory@/opt/mediawiki/lucene-search-2.1.3/indexes/search/wiki.links: files:
Following from my earlier post on this forum and a helpful Rainman's reply, I managed to make the API- based search work from a Java program. I am posting the steps (perhaps clumsy at places, but working) here as it may be of interest to others. Setup:
- add LuceneSearch.jar on your classpath
- modify lsearch.conf, lsearch-global.conf and lsearch.log4j and put them to your project's folder
[edit] lsearch.conf
- set MWConfig.global to point to lsearch-global.conf
- set Indexes.path
- set Logging.logconfig to point to lsearch.log4j
[edit] lsearch-global.conf
- make sure that you point to localhost
code:
Configuration config = Configuration.open();
SearcherCache cache = SearcherCache.getInstance();
URI uri = new URI("http://localhost:8123/search/wiki/maradona?limit=3");
HashMap query = new QueryStringMap(uri);
double version = getVersion(query);
SearchEngine search = new SearchEngine();
IndexId iid = IndexId.get("wiki");
iid.forceMySearch();
cache.waitForInitialDeployment();
cache.getSearcherPoolStatus(iid);
cache.waitForInitialDeployment();
SearchResults res = search.search("wiki","search","maradona",query,version);
for (ResultSet title : res.getResults())
{
String result = title.title;
System.out.println(result);
}
I'm encountering some strange behavior with the update script. If I run the build script lucene properly indexes everything and it becomes searchable. If I run the update script its almost as if nothing happens? I've looked at the output of the script and it doesn't seem to indicate any obvious error. Thoughts?
Just a word of warning When I called the configure script I left the / after the path to my wiki did
- sudo configure /var/www/mediawiki/
instead of
- sudo configure /var/www/mediawiki
Then all sorts of missing files statement...
I have one wiki running Lucene-search. I want to add another wiki to Lucene. Is there a manual how to use many wikis with lucene?
I have read Docs and added 2nd database to lsearch-global.conf. But where I have to config path to second wiki installation for username and password for database access?
-- 08:58, 21 January 2011 (UTC)
I run a very large wiki (250,000 pages) and lucene often takes so long to search that the results page will load without showing any results. A subsequent search will show matches, since the results are in lucene's cache.
The problem seems to be with org.wikimedia.lsearch.spell.Suggest which can take up to 15000 ms to complete.
org.wikimedia.lsearch.search.SearchEngine seems to be ok, returning results in ~100ms. Is there any way to speed up/remove the spelling suggestions to increase speed?
Do you have enough RAM for linux to cache the whole spell-check index in memory? If not, you might be hitting I/O which slows down things around 100 times. The size of your wiki shouldn't be an issue (it works fine on much bigger wikis like en.wikipedia.org).
You can also try to decrease the size of spellcheck index. Try settings like these: yourwiki: (spell,40,10)
And then rebuild your spellcheck index. This will index only those words which appear in at least 40 articles, and phrases in at least 10 articles.
We're indexing a MediaWiki instance using Lucene on a remote server. Over time, many subdirectories get created in indexes/update/wikidb.links/timestamp on the remote server. Do all these subdirectories need to be kept, or can some/all of them be deleted? Thanks.
Note: We're doing incremental updates using OAIRepository.
Only the last one is needed. It should delete the older ones. Are you using the latest (svn) version of the extension?
I am using lucene-search 2.1.3 on the index host.
I have run MWSearch/Lucene/OAIrepo for about 6 months now and it has gone fairly smoothly and now (within the last 2 weeks) the search just returns 0 results. I attempted to do an ./update and this did not resolve the issue, iv created a log for lucene via my init script and i see the search is dispatched to the org.wikimedia.lsearch.search.SearchEngine but I don't really have a lot of purview into that portion of the application. If I delete the indexes dir and then do a build this resolves the issue, we are running Linux/Apache with a cronjob to update the search index. I am mainly looking for what are additional troubleshooting steps I can take to possible resolve this as this is a knowledge base for a front line support team and id rather not script a bandaid if I can avoid it
Periodic fatal errors while rebuilding index - "no segments* file"... not solved yet?
I've happened to come across an error while rebuilding my index (it is set as a cronjob every 20 minutes in a rather small wiki), which I found to be recurring since almost 2 years ago: Periodic fatal errors while rebuilding index - "no segments* file".
Apparently, the rebuilding process created the /indexes/search/wiki.links as a folder instead of a symlink, which is why it is supposed to fail.
Has anybody found a solution that does not imply stopping the lsearchd daemon or deleting the index directory?
I see this quite often too, it happens at least once a day for us.
The only thing I can think of is that multiple cronjobs are trying to modify the same directory which leads to an inconsistency. Can you be sure this is not happening?
We have already checked it and, as in the above case of Maiden taiwan, we have just one cronjob running at a time. Although I'd hate to do that, I guess maybe we'll have to work on modifying the rebuilding routine or create another cronjob to ensure the rebuilding process does work properly, as we don't know yet why it sometimes fail when it is not supposed to.
I am following the instructions to put the indexer on a different host. Let's say the original MediaWiki/Lucene host is called MW and the new index server is called IS. After installing Java, configuring stuff as documented, and copying my lucene-search directory tree from MW to IS, I start up lsearchd and get this warning:
0 [main] INFO org.wikimedia.lsearch.util.Localization - Reading localization for En 5 [main] WARN org.wikimedia.lsearch.util.Localization - Error processing message file at file:///var/www/html/w/languages/messages/MessagesEn.php java.io.FileNotFoundException: /var/www/html/w/languages/messages/MessagesEn.php (No such file or directory)
This is clearly happening because MediaWiki is not installed on index server IS. However, lsearchd starts up fine, and when I run the update script on IS, the resulting index gets properly rsync'ed to host MW. Searches on MW seem to be working fine.
My question is: does host IS also need a full MediaWiki installation on it? Or just the file tree? Or if I do neither (as above), how serious is the warning message above? Thanks.
![]() First page |
![]() Previous page |
![]() Next page |
![]() Last page |



