Topic on Extension talk:CirrusSearch

RacingRalf (talkcontribs)

What do I have to do to setup CirrusSearch with the ability to index the content of pdf files?

Using MW 1.27, Elasticsearch 1.7.5, CirrusSearch 1.27 and the actual version of Elastica. Search within wiki pages seems working. But I'm not able to search for content in pdf files.

Thank you!

DCausse (WMF) (talkcontribs)

CirrusSearch will index PDF content if the PdfHandler extension is installed: Extension:PdfHandler.

You may have to run some maintenance scripts to refresh the data of existing PDF in your wiki (please check the PdfHandler documentation)

Jonnnius (talkcontribs)

Is there any chance to include docx, doc, odt etc. for indexing as well?

Dgennaro (talkcontribs)

I am also interested in indexing MS office files. Is this something that is doable?

DCausse (WMF) (talkcontribs)

I don't think this is Cirrus specific, cirrus will index any content that has support from a Media Handler. The question would be more "Is there a MediaHandler extension that supports microsoft documents like PdfHandler?".

I'd suggest asking this question on the mediawiki-l mailing list. This question has already been asked in the past but with no clear answers (see https://lists.wikimedia.org/pipermail/mediawiki-l/2016-September/thread.html#45836)

I did a quick search not was not able to find an extension like that... so unless some code is hidden in another extension I'm afraid that someone would have to develop this extension.

BluAlien (talkcontribs)

There was Extension:FileIndexer that worked very well indexing pdf, doc, docx, xls, ppt ecc. unfortunately it was removed and abandoned due to security issues. I think it was very usefull and I'm really astonished that none has been developed to replace it.

Reply to "Search pdf files"