Extension talk:EzMwLucene

Is there an option for the database tables prefix?
My wiki uses prefixes for the tables in the database.

$wgDBprefix        = "mw";

Is there an option to set the prefix in the properties file?


 * Answer
 * 1) see patch offer. Unfortunately the package is currently not well maintained.
 * 2) workaround: copy the database tables $wgDBprefix{page,revision,text,image} to page, revision, text,  image before starting the loader.sh and drop/remove them afterwards.

Do we not have to schedule re-run of loader.sh
The installation notes say that it only has to be run on startup. Do we not have to run this on a schedule to ensure that new documents / content are added to the index ?

Search Inside Documents
I don't seem to be indexing the text inside MSWord, MSExcel, etc documents, is that possible?

Prerequisites
You should list the prerequisites needed to run this extension:

1. Java-JRE 2. PHP-Curl

Newer MediaWiki versions
Given that version 1.14 changed the way MediaWiki handles files, does anyone have any experiences getting this extension to work with it? thanks

The extension also fails in 1.15 for the same reason i believe. Making symbolic links to simulate previous image folder structures seem to work. However I'm getting loads of XML tag mismatch errors when using the search :

Warning: DOMDocument::loadXML [domdocument.loadxml]: Premature end of data in tag HTML line 2 in Entity, line: 50 in /export/home/www/default/wiki/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 116 Notice: Trying to get property of non-object in /export/home/www/default/wiki/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 147
 * What version of PHP is installed on the machine?--Gregra 05:55, 18 November 2009 (UTC)


 * How did you fix the images folder with symlinks? Can you pass along what you did to fix it? I also get the above DOMDocument error. Anymore updates coming to EzMwLucene?--ftclausen 12 February 2010

Initial Errors
1. Try: ./service.sh start Found JAVA=/usr/java/jre1.6.0_17/bin/java in JAVA_HOME=/ Starting EzMwLucene: start-stop-daemon: invalid option -- c Try `start-stop-daemon --help' for more information.

2. Try: vs170019:/srv/search/lucene/server dirname: missing operand Try `dirname --help' for more information. Found JAVA= in JAVA_HOME= Cannot find a JRE or JDK. Please set JAVA_HOME to a >=1.2 JRE Any solutions available?
 * 1) ./service.sh start

I am getting the same error. I have shared hosting. Tisane 06:47, 28 March 2010 (UTC)

Why?
Why would I use this instead of Lucene? Is it easier to install? --Robinson Weijman 10:04, 4 March 2010 (UTC)
 * Two reasons:
 * Much easier to install, especially on windows box.
 * This extension can index uploaded files (pdf, doc) OOTB, while with Lucene you would really have to break a sweat to make it work.

--Gregra 10:08, 4 March 2010 (UTC)


 * Great, thanks for your prompt response. I suggest you add these two reasons to the article page.  --Robinson Weijman 10:03, 5 March 2010 (UTC)

New link to the guide
Where's actually the EzMwLucene Installation Guide? The link in the extension page is broken!


 * Yikes. It's still broken. Anyone got a viable link? -Whidmark 20:13, 5 April 2011 (UTC)

MW/Lucene Search - made easy?
The Title of this extension suggests that following the guide, setup of the Lucene search engine for MediaWiki, would be a relatively EASY or simple process.

It is not.

I thought it was difficult (certainly possible, and with good results) to setup Lucene Search the old way, but this new method introduces a whole new meaning to the words "BROKEN EXTENSION".

No real distribution awareness was undertaken by the author of this extension. Startup scripts outright dont work, as they're not UNIX files (can be fixed, i know).. Dont bother. It's actually EASIER to follow your nose and go for the original MWSearch/Lucene Search setup guide. You will have to modify the instructions in order to get it working for your distribution, and probably MW version, but it does work - I've got it running on 2 Live Production Wiki's right now.

This method throws nothing but Java errors and missing class warnings after I finally got it working at filesystem level, and got it talking to the Database. I seriously can not be bothered troubleshooting other peoples bad or missing Java code, so I quit trying. (After making sure that I didn't have any FILES missing). I *know* the other way works (originial MWSearch/LuceneSearch), and I will attempt to get a definitive guide up regarding this. Until then, you're better off without EzMwLucene.

Version
Did you look at the version information and the fact that this is a beta extension? It has not been updated to more recent versions of Mediawiki. I cannot speak to the ease of the original LuceneSearch now, but two years ago when this extension was written it was not easy. I know that Brion Vibber liked this extension and was planning on integrating some of the features into LuceneSearch, but again, I don't know the status of it. --Cmreigrut 18:11, 10 June 2010 (UTC)

Running EzMWLucene under Debian
To run this extension automatically on Debian startup create the file  and paste the following code (I hope this is correct, since I'm not a Debian guru):
 * 1) !/bin/sh

JAVA=$JAVA_HOME/bin/java EZMWLUCENE_HOME=/usr/share/lucene EZMWLUCENE_PID=/var/run/ezmwlucene.pid

CLASSPATH="$EZMWLUCENE_HOME/ezmwlucene.jar:$EZMWLUCENE_HOME/lib/jetty-6.1.14.jar:$EZMWLUCENE_HOME/lib/jetty-util-6.1.14.jar:$EZMWLUCENE_HOME/lib/servlet-api-2.5-6.1.14.jar:$EZMWLUCENE_HOME/lib/commons-codec-1.3.jar:$EZMWLUCENE_HOME/lib/commons-httpclient-3.1.jar:$EZMWLUCENE_HOME/lib/commons-logging.jar:$EZMWLUCENE_HOME/lib/FontBox-0.1.0-dev.jar:$EZMWLUCENE_HOME/lib/lucene-core-2.4.0.jar:$EZMWLUCENE_HOME/lib/lucene-highlighter-2.4.0.jar:$EZMWLUCENE_HOME/lib/PDFBox-0.7.3.jar:$EZMWLUCENE_HOME/lib/poi-3.5-beta3-20080926.jar:$EZMWLUCENE_HOME/lib/poi-scratchpad-3.5-beta3-20080926.jar"

JAVA_OPTIONS="$JAVA_OPTIONS -Dezmwlucene.home=$EZMWLUCENE_HOME -Djava.io.tmpdir=$TMP"

RUN_ARGS="$JAVA_OPTIONS -cp $CLASSPATH net.sourceforge.ezmwlucene.service.EzMwLuceneService"

case "$1" in start)        echo -n "Starting EzMwLucene: "		start-stop-daemon --start --pidfile $EZMWLUCENE_PID -d $EZMWLUCENE_HOME -b -m -a $JAVA -- $RUN_ARGS 		exit 0		;;

stop)       echo -n "Stopping EzMwLucene: "		start-stop-daemon --stop --pidfile $EZMWLUCENE_PID -d $EZMWLUCENE_HOME -a $JAVA -s HUP 		exit 0        ;;

restart)		start-stop-daemon --stop --pidfile $EZMWLUCENE_PID -d $EZMWLUCENE_HOME -a $JAVA -s HUP 		start-stop-daemon --start --pidfile $EZMWLUCENE_PID -d $EZMWLUCENE_HOME -b -m -a $JAVA -- $RUN_ARGS 		exit 0       ;;

echo "Usage: $0 {start|stop|restart} " exit 1 ;; esac

exit 0

Thanks for this, that great. I had to set JAVA=$JAVA_HOME/usr/bin/java

to get it to work on a Turnkey Linux Mediawiki Virtual Machine.

Alterego's bugs
These are the bugs I've found with this extension using Kubuntu Jaunty Jackal and MediaWiki 1.16alpha (r50407) --Alterego 00:31, 10 May 2009 (UTC)

Search does not return all results
Search for "Android" you expect to see all Android.*. But it only returns Android.NPE.

Search for "javadoc" you expect to see "javadoc.url". But it doesn't get any result. If search for the whole word of "javadoc.url", it can get the result expected.

It seems that partial word search for any word with dot is not working well.

Is it a known bug, or a configuration I can change?

Problem with Searching Japanese Characters
If I search Japanese characters, I always get "No page text matches" even though the pages exist.

If one of the Search results (by searching English words) contains Japanese characters, they all show as question marks (?????????).

Is this a bug, or a problem of my configuration? Any idea would be greatly appreciated.

Searching for an empty string
2009-05-09 18:21:05.580::WARN: /query net.sourceforge.ezmwlucene.helper.LuceneHelperException: org.apache.lucene.queryParser.ParseException: Cannot parse ' AND (namespaceid:6)': Encountered " ")" ") "" at line 1, column 1. Was expecting one of:  ...   "+" ...    "-" ...    "(" ...    "*" ...     ...     ...     ...     ...    "[" ...    "{" ...     ...     ...    "*" ...

at net.sourceforge.ezmwlucene.helper.LuceneQueryHelper.query(LuceneQueryHelper.java:228) at net.sourceforge.ezmwlucene.web.QueryServlet.doQuery(QueryServlet.java:123) at net.sourceforge.ezmwlucene.web.QueryServlet.doGet(QueryServlet.java:105) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.handler.HandlerList.handle(HandlerList.java:49) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse ' AND (namespaceid:6)': Encountered " ")" ") "" at line 1, column 1. Was expecting one of:  ...   "+" ...    "-" ...    "(" ...    "*" ...     ...     ...     ...     ...    "[" ...    "{" ...     ...     ...    "*" ...

at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:175) at net.sourceforge.ezmwlucene.helper.LuceneQueryHelper.query(LuceneQueryHelper.java:158) ... 18 more

workaround

This can be overcome by adding the following code to extensions/EzMwLucene/EzMwLuceneSearchEngine, just inside the searchText method (before any other code). I am not good with diff, if someone can make a real patch to illustrate this, it may be helpful. if (empty($term)){ wfDebugLog('EzMwLucene', "SearchEngine::searchText: term is empty, returning a blank result set"); return new EzMwLuceneSearchResultSet(false); }

Loading Special:Search
The following error is repeated 25 times after first loading Special:Search (searches do work, however)



This can be supressed by setting

error_reporting = E_COMPILE_ERROR|E_ERROR|E_CORE_ERROR

in php.ini


 * Thanks! This method also fixes my warning messages below on the Special:Search page.


 * Warning: DOMDocument::loadXML [domdocument.loadxml]: Premature end of data in tag PREFIXTERM line 105 in Entity, line: 157 in $IP/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 116


 * Warning: DOMDocument::loadXML [domdocument.loadxml]: Premature end of data in tag TERM line 104 in Entity, line: 157 in $IP/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 116


 * Warning: DOMDocument::loadXML [domdocument.loadxml]: Premature end of data in tag QUOTED line 103 in Entity, line: 157 in $IP/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 116

Distribution does not include README/INSTALL files
The documentation for this distribution should be included in the package itself, not just on SourceForge which is extremely inconvenient. http://sourceforge.net/docman/display_doc.php?docid=179038&group_id=254413


 * It also be acceptable to put the documentation on this wiki page, and have the README link here. It would be really nice to have everything hosted in wikimedia's svn. --Ryan lane 14:43, 11 May 2009 (UTC)

loader.sh cannot open all files because of mishandling unicode characters
61 grey:.../EzMwLucene/server# sh loader.sh java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/d/dd/BrouilletCondÃ©BealEtAl99.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/c/c7/Carrillo-ReidTecuapetlaIbÃ¡Ã±ez-SandovalEtAl09.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/d/d8/CepedaWuAndrÃ©EtAl07.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/4/4d/Dbf2007.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/f/fe/DÂ’AngeloDeZeeuw08.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999 java.lang.NullPointerException at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194) at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182) at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) at net.sourceforge.ezmwlucene.AttachmentExtractor.getPdfText(AttachmentExtractor.java:259) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:227) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:183) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.lang.NullPointerException at org.apache.poi.hslf.model.SimpleShape.getClientRecords(SimpleShape.java:322) at org.apache.poi.hslf.model.SimpleShape.getClientDataRecord(SimpleShape.java:307) at org.apache.poi.hslf.model.TextShape.getPlaceholderAtom(TextShape.java:547) at org.apache.poi.hslf.model.Sheet.getPlaceholder(Sheet.java:408) at org.apache.poi.hslf.model.HeadersFooters.isVisible(HeadersFooters.java:244) at org.apache.poi.hslf.model.HeadersFooters.isHeaderVisible(HeadersFooters.java:148) at org.apache.poi.hslf.extractor.PowerPointExtractor.getText(PowerPointExtractor.java:173) at org.apache.poi.hslf.extractor.PowerPointExtractor.getText(PowerPointExtractor.java:144) at net.sourceforge.ezmwlucene.AttachmentExtractor.getPowerpointText(AttachmentExtractor.java:315) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:221) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:183) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/9/9f/Pauli_oreilly.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/9/99/PisellaGrÃ©aTiliketeEtAl00.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/3/3d/RamÃ­rez-LugoZavala-VegaBermÃºdez-Rattoni06.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) Your document seemed to be mostly unicode, but the section definition was in bytes! Trying anyway, but things may well go wrong!

Special:Upload broken on new install


This can be fixed by applying the following patch:

--- EzMwLuceneIndexer.php.old  2009-05-08 16:26:28.000000000 -0500 +++ EzMwLuceneIndexer.php      2009-05-08 16:26:00.000000000 -0500 @@ -38,7 +38,7 @@              $xml .= " <![CDATA[{$revision->getRawText}]]> "; /* If the article is in the Image namespace, add the file url */ if ($article->getTitle->getNamespace==6) { -                      $file =  wfFindFile($article->getTitle); +                      $file = wfFindFile($article->getTitle,false,0,true); $xml .= " <![CDATA[{$file->getFullUrl}]]> "; }              $xml .= ' ';


 * Indead of applying the "patch" with some command like "$ patchphp EzMwLuceneIndexer.php < the_code_above.php", I just simply edited the EzMwLuceneIndexer.php file, and replace the single line of "-" with the line of "+".
 * It works!

Extension is mispackaged
The correct way to pack a tarball is such that when you unpack it, it contains the following directory structure: EzMwLucene_1.0/ client/ server/ It is rude to put client and server at the top level of the package.

More unicode mishandling

 * WARNING: Funny characters in title SchrauwenD???HaeneVerstraetenEtAl07
 * 14000 row(s) processed
 * 14100 row(s) processed
 * 14200 row(s) processed
 * WARNING: Funny characters in title SerinoGiovagnoliL??davas09
 * 14300 row(s) processed
 * 14400 row(s) processed
 * 14500 row(s) processed
 * 14600 row(s) processed
 * 14700 row(s) processed
 * 14800 row(s) processed
 * 14900 row(s) processed
 * 15000 row(s) processed
 * 15100 row(s) processed
 * WARNING: Funny characters in title StruffertK??hrmannEngelhornEtAl09
 * 15200 row(s) processed
 * 15300 row(s) processed
 * WARNING: Funny characters in title TamosiunaiteAsfourW??rg??tter09
 * 15400 row(s) processed
 * WARNING: Funny characters in title TanakaBalleineO???Doherty08
 * 15500 row(s) processed
 * 15600 row(s) processed
 * 15700 row(s) processed
 * 15800 row(s) processed
 * 15900 row(s) processed
 * WARNING: Funny characters in title UrbanoLeznikLlin??s07
 * WARNING: Funny characters in title ValentinDickinsonO???Doherty07
 * 16000 row(s) processed
 * 16100 row(s) processed
 * WARNING: Funny characters in title VerstraetenSchrauwenD??HaeneEtAl07
 * 16200 row(s) processed
 * 16300 row(s) processed
 * 16400 row(s) processed
 * 16500 row(s) processed
 * 16600 row(s) processed
 * 16700 row(s) processed
 * WARNING: Funny characters in title WikiPapers/log/Kov????csMehler09
 * 16800 row(s) processed
 * WARNING: Funny characters in title WikiPapers/log/SerinoGiovagnoliL??davas09
 * WARNING: Funny characters in title WikiPapers/log/SotoFunesGuzm????n-Garc????aEtAl09
 * WARNING: Funny characters in title WikiPapers/log/TanakaBalleineO???Doherty08
 * WARNING: Funny characters in title WikiPapers/log/ValentinDickinsonO???Doherty07
 * WARNING: Funny characters in title WikiPapers/log/WinklerH??denLadinigEtAl
 * WARNING: Funny characters in title WinklerH??denLadinigEtAl
 * 16900 row(s) processed
 * 17000 row(s) processed
 * 17100 row(s) processed
 * 17200 row(s) processed
 * 17300 row(s) processed
 * 17400 row(s) processed
 * 17500 row(s) processed
 * WARNING: Funny characters in title BrouilletCond??BealEtAl99.pdf
 * 17600 row(s) processed
 * WARNING: Funny characters in title Carrillo-ReidTecuapetlaIb????ez-SandovalEtAl09.pdf
 * WARNING: Funny characters in title CepedaWuAndr??EtAl07.pdf

service.sh always fails (with FAIL)
Running $RUN_CMD from the command-line directly, and then backgrounding it and disowning it, works just fine. That bash script you borrowed from another software package simply does not work on my OS, however. The shell script looks portable and expertly written; nontheless it depends on subtle run conditions that are not easy to debug and do not appear to be useful. Modern linux operating systems do not need a shell script to start/stop/restart processes - that is a job for the system to handle. I start the process in the following way. Note that after the long java command comes  This tells bash to background the process and disown it. You should put that in a file in /etc/init.d so that it gets started with your system.


 * .../EzMwLucene/server# /usr/lib/jvm/java-6-sun-1.6.0.10/jre/bin/java -cp "/var/www/ccnlab/extensions/EzMwLucene/server/ezmwlucene.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/jetty-6.1.14.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/jetty-util-6.1.14.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/servlet-api-2.5-6.1.14.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/commons-codec-1.3.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/commons-httpclient-3.1.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/commons-logging.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/FontBox-0.1.0-dev.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/lucene-core-2.4.0.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/lucene-highlighter-2.4.0.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/PDFBox-0.7.3.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/poi-3.5-beta3-20080926.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/poi-scratchpad-3.5-beta3-20080926.jar" net.sourceforge.ezmwlucene.service.EzMwLuceneService & disown %

Starting EzMwLucene server... 2009-05-08 16:22:48.039::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2009-05-08 16:22:48.259::INFO: jetty-6.1.14 2009-05-08 16:22:48.498::INFO: Started SocketConnector@0.0.0.0:8080 EzMwLucene started!


 * service.sh really should report success or failure in its output. It does put error messages into the error log though. I got this working by doing the following:
 * Made a lucene user/group for the system, with the lucene user having a shell of /bin/bash and a home directory of /apps/lucene/home/lucene (ownership lucene:lucene 600)
 * I put the server directory at: /apps/lucene/server (ownership of files/directories is root:root, except for ezmwlucene.properties which is root:lucene; everything is world readable except for ezmwlucene.properties, which is 640)
 * I set the following variables in service.sh:
 * EZMWLUCENE_HOME="/apps/lucene/server"
 * EZMWLUCENE_LOG="/apps/lucene/home/lucene/ezmwlucene.log"
 * EZMWLUCENE_USER="lucene"
 * I installed java 1.6 instead of 1.5
 * I created /data/lucene (ownership lucene:lucene 600)
 * In ezmwlucene.properties, I set the following (notice I didn't specify a port for the database url, this tells it to use a socket):
 * lucene.index = /data/lucene/index
 * server.port = 8080
 * mediawiki.name = 
 * mediawiki.databaseUrl = jdbc:mysql://localhost/
 * mediawiki.databaseUser =
 * mediawiki.databasePassword =
 * mediawiki.imagesUrl = https://fully.qualified.servername/wiki/images/
 * mediawiki.localImagesUrl = file:///data/wiki/images/
 * I ensured that the lucene user could read /data/wiki/images and all images below (my permissions are fairly strict)
 * setfacl -R -d -m user:lucene:rX /data/wiki/images
 * setfacl -R -m user:lucene:rX /data/wiki/images
 * I ensured that I could pull images from https://fully.qualified.servername/wiki/images/:
 * curl -k "https://fully.qualified.servername/wiki/images/5/58/Test.pdf"
 * I put the following into LocalSettings.php:
 * $wgEzMwLuceneQueryUrl = 'http://localhost:8080/query';
 * $wgEzMwLuceneIndexUrl = 'http://localhost:8080/index';
 * $wgSearchType = 'EzMwLuceneSearchEngine';
 * require_once("extensions/EzMwLucene/EzMwLucene.php");
 * $wgDebugLogGroups["EzMwLucene"] = "/tmp/ezmwlucene.log";
 * I applied the patch to EzMwLuceneIndexer.php, as listed in one of the comments above


 * After doing the above, I started service.sh (check /apps/lucene/home/lucene/ezmwlucene.log for errors to ensure it started properly, and check netstat to ensure it is actually running). Then I started uploading pdfs and docs and such to test with. Check /tmp/ezmwlucene.log, /apps/lucene/home/lucene/ezmwlucene.log, and your web server logs to ensure that updates are being sent to the daemon, the daemon can pull URLs from the web server, and that the daemon is successfully accessing the files.

Uploading a new pdf does not cause it to get indexed
When I ran loader.sh there was a pdf on the wiki with the title: "Family history of heart attack as an independent predictor of death due to cardiovascular disease". Searching for this title returned the pdf. I then proceeded to delete the pdf, which deleted it from the index. I then re-uploaded it again which did not cause it to show up again in the index.


 * This worked for me, but it wasn't exactly easy to get it working. Check your logs througoughly. --Ryan lane 14:44, 11 May 2009 (UTC)

files are not searched by default
It appears that every single user will have to set their preferences to search the file namespace. $wgNamespacesToBeSearchedDefault = array(   NS_MAIN => true,    NS_FILE => true, ); This is from the MW FAQ but it doesn't actually work.. This did work for all users however:

UPDATE user SET user_options = REPLACE(user_options, 'searchNs14=0', 'searchNs14=1');

Unfortunately, while this changes the preference it does not cause the File namespace to be actually searched

Here's an IRC log on this issue which suggests upgrading past 48811 might help:

this seems to have no effect whatsoever on my wiki! http://www.mediawiki.org/wiki/Manual:$wgNamespacesToBeSearchedDefault i even set NS_MAIN => false it still searches main! what version? 48811 search preferences don't seem to matter either long story huh, i just tested that it works on Wikipedia. wtf but the short answer is that changes to preference defaults will only apply to existing users after r48811 I fixed the bug myself lol ok, i guess i'll upgrade a revision? i try to stay in sync w/ WP you'll need to upgrade a few revisions
 * Rdsmith4 has quit (".")

Indeed - upgrading to MediaWiki trunk in addition to using $wgNamespacesToBeSearchedDefault works.

PDF logo is pointless!


--Alterego 00:22, 10 May 2009 (UTC)


 * When combined with the PDF handler, is it possible that this is less pointless? --Ryan lane 13:37, 11 May 2009 (UTC)