Topic on Extension talk:PdfHandler

Searching content of PDF files

2 comments • 03:01, 9 May 2019 4 years ago

2

Tommyheyser (talkcontribs)

I'm sure this topic has come up before many times and from what I've found through searching were usually along the line of "just use PdfHandler" and not much details. I've gotten PdfHandler to work and it's showing the thumbnails on the File pages as well as creating text files of the pdf in the images folder. How does the MW built-in search engine, or other search engine (I got CirrusSearch/Elastica/ElasticSearch running) make use of the text files.

Is there a configuration setting I need to turn on for MW to recognise the generated text files when indexing contents?

I'm asking because I still don't see the content of the PDF in the search results, either using MW built-in search engine or CirrusSearch.

I hope it's alright that I'm posting this here. I've posted a similar question to this one in the Extension talk:CirrusSearch page as well.

Reply 22:34, 8 May 2019 4 years ago

Tommyheyser (talkcontribs)

Okay, not sure what happened, but since I'm running MW on Windows Server (sorry, forgot to mention this before), the standard PdfHandler extension with my "workaround" wasn't working 100%. Thumbnail creation was okay and I thought the pdftotext was working fine, but apparently not.

I tried using SeongMoon version of PdfHandler, ran maintenance/update.php, refreshImageMetadata.php, rebuildImages.php as well as extensions/CirrusSearch/maintenance/forceSearchIndex.php as per the https://phabricator.wikimedia.org/source/extension-cirrussearch/browse/master/README file and now it seemed to work and PDF contents are showing up in search results.

Reply 03:01, 9 May 2019 4 years ago

Reply to "Searching content of PDF files"