Extension talk:EzMwLucene

= Alterego's bugs = These are the bugs I've found with this extension using Kubuntu Jaunty Jackal and MediaWiki 1.16alpha (r50407) --Alterego 00:31, 10 May 2009 (UTC)

Searching for an empty string
2009-05-09 18:21:05.580::WARN: /query net.sourceforge.ezmwlucene.helper.LuceneHelperException: org.apache.lucene.queryParser.ParseException: Cannot parse ' AND (namespaceid:6)': Encountered " ")" ") "" at line 1, column 1. Was expecting one of:  ...   "+" ...    "-" ...    "(" ...    "*" ...     ...     ...     ...     ...    "[" ...    "{" ...     ...     ...    "*" ...

at net.sourceforge.ezmwlucene.helper.LuceneQueryHelper.query(LuceneQueryHelper.java:228) at net.sourceforge.ezmwlucene.web.QueryServlet.doQuery(QueryServlet.java:123) at net.sourceforge.ezmwlucene.web.QueryServlet.doGet(QueryServlet.java:105) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.handler.HandlerList.handle(HandlerList.java:49) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse ' AND (namespaceid:6)': Encountered " ")" ") "" at line 1, column 1. Was expecting one of:  ...   "+" ...    "-" ...    "(" ...    "*" ...     ...     ...     ...     ...    "[" ...    "{" ...     ...     ...    "*" ...

at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:175) at net.sourceforge.ezmwlucene.helper.LuceneQueryHelper.query(LuceneQueryHelper.java:158) ... 18 more

Loading Special:Search
The following error is repeated 25 times after first loading Special:Search (searches do work, however)



This can be supressed by setting

error_reporting = E_COMPILE_ERROR|E_ERROR|E_CORE_ERROR

in php.ini

Distribution does not include README/INSTALL files
The documentation for this distribution should be included in the package itself, not just on SourceForge which is extremely inconvenient. http://sourceforge.net/docman/display_doc.php?docid=179038&group_id=254413

loader.sh cannot open all files because of mishandling unicode characters
61 grey:.../EzMwLucene/server# sh loader.sh java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/d/dd/BrouilletCondÃ©BealEtAl99.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/c/c7/Carrillo-ReidTecuapetlaIbÃ¡Ã±ez-SandovalEtAl09.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/d/d8/CepedaWuAndrÃ©EtAl07.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/4/4d/Dbf2007.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/f/fe/DÂ’AngeloDeZeeuw08.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999 java.lang.NullPointerException at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194) at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182) at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) at net.sourceforge.ezmwlucene.AttachmentExtractor.getPdfText(AttachmentExtractor.java:259) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:227) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:183) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.lang.NullPointerException at org.apache.poi.hslf.model.SimpleShape.getClientRecords(SimpleShape.java:322) at org.apache.poi.hslf.model.SimpleShape.getClientDataRecord(SimpleShape.java:307) at org.apache.poi.hslf.model.TextShape.getPlaceholderAtom(TextShape.java:547) at org.apache.poi.hslf.model.Sheet.getPlaceholder(Sheet.java:408) at org.apache.poi.hslf.model.HeadersFooters.isVisible(HeadersFooters.java:244) at org.apache.poi.hslf.model.HeadersFooters.isHeaderVisible(HeadersFooters.java:148) at org.apache.poi.hslf.extractor.PowerPointExtractor.getText(PowerPointExtractor.java:173) at org.apache.poi.hslf.extractor.PowerPointExtractor.getText(PowerPointExtractor.java:144) at net.sourceforge.ezmwlucene.AttachmentExtractor.getPowerpointText(AttachmentExtractor.java:315) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:221) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:183) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/9/9f/Pauli_oreilly.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/9/99/PisellaGrÃ©aTiliketeEtAl00.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) java.io.FileNotFoundException: /var/www/mediawiki/sites/ccnlab/images/3/3d/RamÃ­rez-LugoZavala-VegaBermÃºdez-Rattoni06.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (FileInputStream.java:137) at java.io.FileInputStream. (FileInputStream.java:96) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:87) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:178) at java.net.URL.openStream(URL.java:1027) at net.sourceforge.ezmwlucene.AttachmentExtractor.extract(AttachmentExtractor.java:182) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.setAttachment(MediawikiArticleFactory.java:256) at net.sourceforge.ezmwlucene.MediawikiArticleFactory.fromResultSet(MediawikiArticleFactory.java:192) at net.sourceforge.ezmwlucene.util.DatabaseLoader.load(DatabaseLoader.java:138) at net.sourceforge.ezmwlucene.util.DatabaseLoader.main(DatabaseLoader.java:57) Your document seemed to be mostly unicode, but the section definition was in bytes! Trying anyway, but things may well go wrong!

Special:Upload broken on new install


This can be fixed by applying the following patch:

--- EzMwLuceneIndexer.php.old  2009-05-08 16:26:28.000000000 -0500 +++ EzMwLuceneIndexer.php      2009-05-08 16:26:00.000000000 -0500 @@ -38,7 +38,7 @@              $xml .= " <![CDATA[{$revision->getRawText}]]> "; /* If the article is in the Image namespace, add the file url */ if ($article->getTitle->getNamespace==6) { -                      $file =  wfFindFile($article->getTitle); +                      $file = wfFindFile($article->getTitle,false,0,true); $xml .= " <![CDATA[{$file->getFullUrl}]]> "; }              $xml .= ' ';

Extension is mispackaged
The correct way to pack a tarball is such that when you unpack it, it contains the following directory structure: EzMwLucene_1.0/ client/ server/ It is rude to put client and server at the top level of the package.

More unicode mishandling

 * WARNING: Funny characters in title SchrauwenD???HaeneVerstraetenEtAl07
 * 14000 row(s) processed
 * 14100 row(s) processed
 * 14200 row(s) processed
 * WARNING: Funny characters in title SerinoGiovagnoliL??davas09
 * 14300 row(s) processed
 * 14400 row(s) processed
 * 14500 row(s) processed
 * 14600 row(s) processed
 * 14700 row(s) processed
 * 14800 row(s) processed
 * 14900 row(s) processed
 * 15000 row(s) processed
 * 15100 row(s) processed
 * WARNING: Funny characters in title StruffertK??hrmannEngelhornEtAl09
 * 15200 row(s) processed
 * 15300 row(s) processed
 * WARNING: Funny characters in title TamosiunaiteAsfourW??rg??tter09
 * 15400 row(s) processed
 * WARNING: Funny characters in title TanakaBalleineO???Doherty08
 * 15500 row(s) processed
 * 15600 row(s) processed
 * 15700 row(s) processed
 * 15800 row(s) processed
 * 15900 row(s) processed
 * WARNING: Funny characters in title UrbanoLeznikLlin??s07
 * WARNING: Funny characters in title ValentinDickinsonO???Doherty07
 * 16000 row(s) processed
 * 16100 row(s) processed
 * WARNING: Funny characters in title VerstraetenSchrauwenD??HaeneEtAl07
 * 16200 row(s) processed
 * 16300 row(s) processed
 * 16400 row(s) processed
 * 16500 row(s) processed
 * 16600 row(s) processed
 * 16700 row(s) processed
 * WARNING: Funny characters in title WikiPapers/log/Kov????csMehler09
 * 16800 row(s) processed
 * WARNING: Funny characters in title WikiPapers/log/SerinoGiovagnoliL??davas09
 * WARNING: Funny characters in title WikiPapers/log/SotoFunesGuzm????n-Garc????aEtAl09
 * WARNING: Funny characters in title WikiPapers/log/TanakaBalleineO???Doherty08
 * WARNING: Funny characters in title WikiPapers/log/ValentinDickinsonO???Doherty07
 * WARNING: Funny characters in title WikiPapers/log/WinklerH??denLadinigEtAl
 * WARNING: Funny characters in title WinklerH??denLadinigEtAl
 * 16900 row(s) processed
 * 17000 row(s) processed
 * 17100 row(s) processed
 * 17200 row(s) processed
 * 17300 row(s) processed
 * 17400 row(s) processed
 * 17500 row(s) processed
 * WARNING: Funny characters in title BrouilletCond??BealEtAl99.pdf
 * 17600 row(s) processed
 * WARNING: Funny characters in title Carrillo-ReidTecuapetlaIb????ez-SandovalEtAl09.pdf
 * WARNING: Funny characters in title CepedaWuAndr??EtAl07.pdf

service.sh always fails (with FAIL)
Running $RUN_CMD from the command-line directly, and then backgrounding it and disowning it, works just fine. That bash script you borrowed from another software package simply does not work on my OS, however. The shell script looks portable and expertly written; nontheless it depends on subtle run conditions that are not easy to debug and do not appear to be useful. Modern linux operating systems do not need a shell script to start/stop/restart processes - that is a job for the system to handle. I start the process in the following way. Note that after the long java command comes  This tells bash to background the process and disown it. You should put that in a file in /etc/init.d so that it gets started with your system.


 * .../EzMwLucene/server# /usr/lib/jvm/java-6-sun-1.6.0.10/jre/bin/java -cp "/var/www/ccnlab/extensions/EzMwLucene/server/ezmwlucene.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/jetty-6.1.14.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/jetty-util-6.1.14.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/servlet-api-2.5-6.1.14.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/commons-codec-1.3.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/commons-httpclient-3.1.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/commons-logging.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/FontBox-0.1.0-dev.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/lucene-core-2.4.0.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/lucene-highlighter-2.4.0.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/PDFBox-0.7.3.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/poi-3.5-beta3-20080926.jar:/var/www/ccnlab/extensions/EzMwLucene/server/lib/poi-scratchpad-3.5-beta3-20080926.jar" net.sourceforge.ezmwlucene.service.EzMwLuceneService & disown %

Starting EzMwLucene server... 2009-05-08 16:22:48.039::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2009-05-08 16:22:48.259::INFO: jetty-6.1.14 2009-05-08 16:22:48.498::INFO: Started SocketConnector@0.0.0.0:8080 EzMwLucene started!

Uploading a new pdf does not cause it to get indexed
When I ran loader.sh there was a pdf on the wiki with the title: "Family history of heart attack as an independent predictor of death due to cardiovascular disease". Searching for this title returned the pdf. I then proceeded to delete the pdf, which deleted it from the index. I then re-uploaded it again which did not cause it to show up again in the index.

files are not searched by default
It appears that every single user will have to set their preferences to search the file namespace. $wgNamespacesToBeSearchedDefault = array(   NS_MAIN => true,    NS_FILE => true, ); This is from the MW FAQ but it doesn't actually work.. This did work for all users however:

UPDATE user SET user_options = REPLACE(user_options, 'searchNs14=0', 'searchNs14=1');

Unfortunately, while this changes the preference it does not cause the File namespace to be actually searched

Here's an IRC log on this issue which suggests upgrading past 48811 might help:

this seems to have no effect whatsoever on my wiki! http://www.mediawiki.org/wiki/Manual:$wgNamespacesToBeSearchedDefault i even set NS_MAIN => false it still searches main! what version? 48811 search preferences don't seem to matter either long story huh, i just tested that it works on Wikipedia. wtf but the short answer is that changes to preference defaults will only apply to existing users after r48811 I fixed the bug myself lol ok, i guess i'll upgrade a revision? i try to stay in sync w/ WP you'll need to upgrade a few revisions
 * Rdsmith4 has quit (".")

Indeed - upgrading to MediaWiki trunk in addition to using $wgNamespacesToBeSearchedDefault works.

PDF logo is pointless!


--Alterego 00:22, 10 May 2009 (UTC)


 * When combined with the PDF handler, is it possible that this is less pointless? --Ryan lane 13:37, 11 May 2009 (UTC)