Topic on Extension talk:CirrusSearch

Searching content of PDF files/PdfHandler-generated text files

3
Tommyheyser (talkcontribs)

I think I got CirrusSearch/Elastica/ElasticSearch running and using it as the main search engine. I also have PdfHandler running on MW 1.31.1 and it's able to generate thumbnails and text files of the uploaded PDF. However, I still don't see contents of the PDF in the search results. Is there something I've missed? A configuration setting or something that I need to enable to include the text of the PDF in the search index?

I hope it's okay, I've cross-posted a similar question in Extension talk:PdfHandler.

Tommyheyser (talkcontribs)

I'm going through the settings.txt doc right now, but if anyone want to help point me to specifics, it'd be much appreciated.

Tommyheyser (talkcontribs)

Okay, not sure what happened, but since I'm running MW on Windows Server (sorry, forgot to mention this before), the standard PdfHandler extension with my "workaround" wasn't working 100%. Thumbnail creation was okay and I thought the pdftotext was working fine, but apparently not. There's a link to a Windows Server version of the PdfHandler extension on the Extension:PdfHandler page.

I used that instead, ran maintenance/update.php, refreshImageMetadata.php, rebuildImages.php as well as extensions/CirrusSearch/maintenance/forceSearchIndex.php as per the https://phabricator.wikimedia.org/source/extension-cirrussearch/browse/master/README file and now it seemed to work and PDF contents are showing up in search results.

Reply to "Searching content of PDF files/PdfHandler-generated text files"