Extension talk:EzMwLucene

Alterego's bugs
These are the bugs I've found with this extension using Kubuntu Jaunty Jackal and MediaWiki 1.16alpha (r50407) --Alterego 00:31, 10 May 2009 (UTC)

Problem with Searching Japanese Characters
If I search Japanese characters, I always get "No page text matches" even though the pages exist.

If one of the Search results (by searching English words) contains Japanese characters, they all show as question marks (?????????).

Is this a bug, or a problem of my configuration? Any idea would be greatly appreciated.

Searching for an empty string
2009-05-09 18:21:05.580::WARN: /query net.sourceforge.ezmwlucene.helper.LuceneHelperException: org.apache.lucene.queryParser.ParseException: Cannot parse ' AND (namespaceid:6)': Encountered " ")" ") "" at line 1, column 1. Was expecting one of:  ...   "+" ...    "-" ...    "(" ...    "*" ...     ...     ...     ...     ...    "[" ...    "{" ...     ...     ...    "*" ...

at net.sourceforge.ezmwlucene.helper.LuceneQueryHelper.query(LuceneQueryHelper.java:228) at net.sourceforge.ezmwlucene.web.QueryServlet.doQuery(QueryServlet.java:123) at net.sourceforge.ezmwlucene.web.QueryServlet.doGet(QueryServlet.java:105) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.handler.HandlerList.handle(HandlerList.java:49) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse ' AND (namespaceid:6)': Encountered " ")" ") "" at line 1, column 1. Was expecting one of:  ...   "+" ...    "-" ...    "(" ...    "*" ...     ...     ...     ...     ...    "[" ...    "{" ...     ...     ...    "*" ...

at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:175) at net.sourceforge.ezmwlucene.helper.LuceneQueryHelper.query(LuceneQueryHelper.java:158) ... 18 more

Loading Special:Search
The following error is repeated 25 times after first loading Special:Search (searches do work, however)



This can be supressed by setting

error_reporting = E_COMPILE_ERROR|E_ERROR|E_CORE_ERROR

in php.ini


 * Thanks! This method also fixes my warning messages below on the Special:Search page.


 * Warning: DOMDocument::loadXML [domdocument.loadxml]: Premature end of data in tag PREFIXTERM line 105 in Entity, line: 157 in $IP/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 116


 * Warning: DOMDocument::loadXML [domdocument.loadxml]: Premature end of data in tag TERM line 104 in Entity, line: 157 in $IP/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 116


 * Warning: DOMDocument::loadXML [domdocument.loadxml]: Premature end of data in tag QUOTED line 103 in Entity, line: 157 in $IP/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 116

Distribution does not include README/INSTALL files
The documentation for this distribution should be included in the package itself, not just on SourceForge which is extremely inconvenient. http://sourceforge.net/docman/display_doc.php?docid=179038&group_id=254413


 * It also be acceptable to put the documentation on this wiki page, and have the README link here. It would be really nice to have everything hosted in wikimedia's svn. --Ryan lane 14:43, 11 May 2009 (UTC)

loader.sh cannot open all files because of mishandling unicode characters
61 grey:.../EzMwLucene/server# sh loader.sh java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/d/dd/BrouilletCondÃ©BealEtAl99.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/c/c7/Carrillo-ReidTecuapetlaIbÃ¡Ã±ez-SandovalEtAl09.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/d/d8/CepedaWuAndrÃ©EtAl07.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/4/4d/Dbf2007.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/f/fe/DÂ’AngeloDeZeeuw08.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999 java.lang.NullPointerException at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194) at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182) at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) at net.sourceforge.ezmwlucene.AttachmentExtractor.getPdfText(AttachmentExtractor.java:259) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:227) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:183) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.lang.NullPointerException at org.apache.poi.hslf.model.SimpleShape.getClientRecords(SimpleShape.java:322) at org.apache.poi.hslf.model.SimpleShape.getClientDataRecord(SimpleShape.java:307) at org.apache.poi.hslf.model.TextShape.getPlaceholderAtom(TextShape.java:547) at org.apache.poi.hslf.model.Sheet.getPlaceholder(Sheet.java:408) at org.apache.poi.hslf.model.HeadersFooters.isVisible(HeadersFooters.java:244) at org.apache.poi.hslf.model.HeadersFooters.isHeaderVisible(HeadersFooters.java:148) at org.apache.poi.hslf.extractor.PowerPointExtractor.getText(PowerPointExtractor.java:173) at org.apache.poi.hslf.extractor.PowerPointExtractor.getText(PowerPointExtractor.java:144) at net.sourceforge.ezmwlucene.AttachmentExtractor.getPowerpointText(AttachmentExtractor.java:315) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:221) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:183) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/9/9f/Pauli_oreilly.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/9/99/PisellaGrÃ©aTiliketeEtAl00.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/3/3d/RamÃ­rez-LugoZavala-VegaBermÃºdez-Rattoni06.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) Your document seemed to be mostly unicode, but the section definition was in bytes! Trying anyway, but things may well go wrong!

Special:Upload broken on new install


This can be fixed by applying the following patch:

--- EzMwLuceneIndexer.php.old  2009-05-08 16:26:28.000000000 -0500 +++ EzMwLuceneIndexer.php      2009-05-08 16:26:00.000000000 -0500 @@ -38,7 +38,7 @@              $xml .= " <![CDATA[{$revision->getRawText}]]> "; /* If the article is in the Image namespace, add the file url */ if ($article->getTitle->getNamespace==6) { -                      $file =  wfFindFile($article->getTitle); +                      $file = wfFindFile($article->getTitle,false,0,true); $xml .= " <![CDATA[{$file->getFullUrl}]]> "; }              $xml .= ' ';


 * Indead of applying the "patch" with some command like "$ patchphp EzMwLuceneIndexer.php < the_code_above.php", I just simply edited the EzMwLuceneIndexer.php file, and replace the single line of "-" with the line of "+".
 * It works!

Extension is mispackaged
The correct way to pack a tarball is such that when you unpack it, it contains the following directory structure: EzMwLucene_1.0/ client/ server/ It is rude to put client and server at the top level of the package.

More unicode mishandling

 * WARNING: Funny characters in title SchrauwenD???HaeneVerstraetenEtAl07
 * 14000 row(s) processed
 * 14100 row(s) processed
 * 14200 row(s) processed
 * WARNING: Funny characters in title SerinoGiovagnoliL??davas09
 * 14300 row(s) processed
 * 14400 row(s) processed
 * 14500 row(s) processed
 * 14600 row(s) processed
 * 14700 row(s) processed
 * 14800 row(s) processed
 * 14900 row(s) processed
 * 15000 row(s) processed
 * 15100 row(s) processed
 * WARNING: Funny characters in title StruffertK??hrmannEngelhornEtAl09
 * 15200 row(s) processed
 * 15300 row(s) processed
 * WARNING: Funny characters in title TamosiunaiteAsfourW??rg??tter09
 * 15400 row(s) processed
 * WARNING: Funny characters in title TanakaBalleineO???Doherty08
 * 15500 row(s) processed
 * 15600 row(s) processed
 * 15700 row(s) processed
 * 15800 row(s) processed
 * 15900 row(s) processed
 * WARNING: Funny characters in title UrbanoLeznikLlin??s07
 * WARNING: Funny characters in title ValentinDickinsonO???Doherty07
 * 16000 row(s) processed
 * 16100 row(s) processed
 * WARNING: Funny characters in title VerstraetenSchrauwenD??HaeneEtAl07
 * 16200 row(s) processed
 * 16300 row(s) processed
 * 16400 row(s) processed
 * 16500 row(s) processed
 * 16600 row(s) processed
 * 16700 row(s) processed
 * WARNING: Funny characters in title WikiPapers/log/Kov????csMehler09
 * 16800 row(s) processed
 * WARNING: Funny characters in title WikiPapers/log/SerinoGiovagnoliL??davas09
 * WARNING: Funny characters in title WikiPapers/log/SotoFunesGuzm????n-Garc????aEtAl09
 * WARNING: Funny characters in title WikiPapers/log/TanakaBalleineO???Doherty08
 * WARNING: Funny characters in title WikiPapers/log/ValentinDickinsonO???Doherty07
 * WARNING: Funny characters in title WikiPapers/log/WinklerH??denLadinigEtAl
 * WARNING: Funny characters in title WinklerH??denLadinigEtAl
 * 16900 row(s) processed
 * 17000 row(s) processed
 * 17100 row(s) processed
 * 17200 row(s) processed
 * 17300 row(s) processed
 * 17400 row(s) processed
 * 17500 row(s) processed
 * WARNING: Funny characters in title BrouilletCond??BealEtAl99.pdf
 * 17600 row(s) processed
 * WARNING: Funny characters in title Carrillo-ReidTecuapetlaIb????ez-SandovalEtAl09.pdf
 * WARNING: Funny characters in title CepedaWuAndr??EtAl07.pdf

service.sh always fails (with FAIL)
Running $RUN_CMD from the command-line directly, and then backgrounding it and disowning it, works just fine. That bash script you borrowed from another software package simply does not work on my OS, however. The shell script looks portable and expertly written; nontheless it depends on subtle run conditions that are not easy to debug and do not appear to be useful. Modern linux operating systems do not need a shell script to start/stop/restart processes - that is a job for the system to handle. I start the process in the following way. Note that after the long java command comes  This tells bash to background the process and disown it. You should put that in a file in /etc/init.d so that it gets started with your system.


 * .../EzMwLucene/server# /usr/lib/jvm/java-6-sun-1.6.0.10/jre/bin/java -cp "/var/www/ccnlab/extensions/EzMwLucene/server/ezmwlucene.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/jetty-6.1.14.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/jetty-util-6.1.14.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/servlet-api-2.5-6.1.14.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/commons-codec-1.3.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/commons-httpclient-3.1.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/commons-logging.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/FontBox-0.1.0-dev.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/lucene-core-2.4.0.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/lucene-highlighter-2.4.0.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/PDFBox-0.7.3.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/poi-3.5-beta3-20080926.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/poi-scratchpad-3.5-beta3-20080926.jar" net.sourceforge.ezmwlucene.service.EzMwLuceneService & disown %

Starting EzMwLucene server... 2009-05-08 16:22:48.039::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2009-05-08 16:22:48.259::INFO: jetty-6.1.14 2009-05-08 16:22:48.498::INFO: Started SocketConnector@0.0.0.0:8080 EzMwLucene started!


 * service.sh really should report success or failure in its output. It does put error messages into the error log though. I got this working by doing the following:
 * Made a lucene user/group for the system, with the lucene user having a shell of /bin/bash and a home directory of /apps/lucene/home/lucene (ownership lucene:lucene 600)
 * I put the server directory at: /apps/lucene/server (ownership of files/directories is root:root, except for ezmwlucene.properties which is root:lucene; everything is world readable except for ezmwlucene.properties, which is 640)
 * I set the following variables in service.sh:
 * EZMWLUCENE_HOME="/apps/lucene/server"
 * EZMWLUCENE_LOG="/apps/lucene/home/lucene/ezmwlucene.log"
 * EZMWLUCENE_USER="lucene"
 * I installed java 1.6 instead of 1.5
 * I created /data/lucene (ownership lucene:lucene 600)
 * In ezmwlucene.properties, I set the following (notice I didn't specify a port for the database url, this tells it to use a socket):
 * lucene.index = /data/lucene/index
 * server.port = 8080
 * mediawiki.name = 
 * mediawiki.databaseUrl = jdbc:mysql://localhost/
 * mediawiki.databaseUser =
 * mediawiki.databasePassword =
 * mediawiki.imagesUrl = https://fully.qualified.servername/wiki/images/
 * mediawiki.localImagesUrl = file:///data/wiki/images/
 * I ensured that the lucene user could read /data/wiki/images and all images below (my permissions are fairly strict)
 * setfacl -R -d -m user:lucene:rX /data/wiki/images
 * setfacl -R -m user:lucene:rX /data/wiki/images
 * I ensured that I could pull images from https://fully.qualified.servername/wiki/images/:
 * curl -k "https://fully.qualified.servername/wiki/images/5/58/Test.pdf"
 * I put the following into LocalSettings.php:
 * $wgEzMwLuceneQueryUrl = 'http://localhost:8080/query';
 * $wgEzMwLuceneIndexUrl = 'http://localhost:8080/index';
 * $wgSearchType = 'EzMwLuceneSearchEngine';
 * require_once("extensions/EzMwLucene/EzMwLucene.php");
 * $wgDebugLogGroups["EzMwLucene"] = "/tmp/ezmwlucene.log";
 * I applied the patch to EzMwLuceneIndexer.php, as listed in one of the comments above


 * After doing the above, I started service.sh (check /apps/lucene/home/lucene/ezmwlucene.log for errors to ensure it started properly, and check netstat to ensure it is actually running). Then I started uploading pdfs and docs and such to test with. Check /tmp/ezmwlucene.log, /apps/lucene/home/lucene/ezmwlucene.log, and your web server logs to ensure that updates are being sent to the daemon, the daemon can pull URLs from the web server, and that the daemon is successfully accessing the files.

Uploading a new pdf does not cause it to get indexed
When I ran loader.sh there was a pdf on the wiki with the title: "Family history of heart attack as an independent predictor of death due to cardiovascular disease". Searching for this title returned the pdf. I then proceeded to delete the pdf, which deleted it from the index. I then re-uploaded it again which did not cause it to show up again in the index.


 * This worked for me, but it wasn't exactly easy to get it working. Check your logs througoughly. --Ryan lane 14:44, 11 May 2009 (UTC)

files are not searched by default
It appears that every single user will have to set their preferences to search the file namespace. $wgNamespacesToBeSearchedDefault = array(   NS_MAIN => true,    NS_FILE => true, ); This is from the MW FAQ but it doesn't actually work.. This did work for all users however:

UPDATE user SET user_options = REPLACE(user_options, 'searchNs14=0', 'searchNs14=1');

Unfortunately, while this changes the preference it does not cause the File namespace to be actually searched

Here's an IRC log on this issue which suggests upgrading past 48811 might help:

this seems to have no effect whatsoever on my wiki! http://www.mediawiki.org/wiki/Manual:$wgNamespacesToBeSearchedDefault i even set NS_MAIN => false it still searches main! what version? 48811 search preferences don't seem to matter either long story huh, i just tested that it works on Wikipedia. wtf but the short answer is that changes to preference defaults will only apply to existing users after r48811 I fixed the bug myself lol ok, i guess i'll upgrade a revision? i try to stay in sync w/ WP you'll need to upgrade a few revisions
 * Rdsmith4 has quit (".")

Indeed - upgrading to MediaWiki trunk in addition to using $wgNamespacesToBeSearchedDefault works.

PDF logo is pointless!


--Alterego 00:22, 10 May 2009 (UTC)


 * When combined with the PDF handler, is it possible that this is less pointless? --Ryan lane 13:37, 11 May 2009 (UTC)

Is there an option for the database tables prefix?
My wiki uses prefixes for the tables in the database.

$wgDBprefix        = "mw";

Is there an option to set the prefix in the properties file?


 * Answer
 * 1) see patch offer. Unfortunately the package is currently not well maintained.
 * 2) workaround: copy the database tables $wgDBprefix{page,revision,text,image} to page, revision, text,  image before starting the loader.sh and drop/remove them afterwards.

Search Inside Documents
I don't seem to be indexing the text inside MSWord, MSExcel, etc documents, is that possible?

Prerequisites
You should list the prerequisites needed to run this extension:

1. Java-JRE 2. PHP-Curl

Newer MediaWiki versions
Given that version 1.14 changed the way MediaWiki handles files, does anyone have any experiences getting this extension to work with it? thanks

The extension also fails in 1.15 for the same reason i believe. Making symbolic links to simulate previous image folder structures seem to work. However I'm getting loads of XML tag mismatch errors when using the search :

Warning: DOMDocument::loadXML [domdocument.loadxml]: Premature end of data in tag HTML line 2 in Entity, line: 50 in /export/home/www/default/wiki/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 116 Notice: Trying to get property of non-object in /export/home/www/default/wiki/extensions/EzMwLucene/EzMwLuceneSearchEngine.php on line 147