Topic on Help talk:CirrusSearch

Problem indexing pdf documents that include cyrilic characters

2
LveFunc (talkcontribs)

Hello guys,


Lately i've been up to a task of deploying local MediaWiki. Everything went smooth until it came to indexing inside of pdf files that contain characters other that US ascii. Doing '?action=cirrusDump' and looking at 'file_text' field shows that all cyrillic characters are getting dropped while latin characters are preserved. Folks at ru.wikipedia.org somehow managed to do it but i couldn't find solution online. I would be very thankful if somebody could point out why that happens and how i could potentially solve this problem.

My configuration is:

MediaWiki - 1.36.1

PHP - 7.4.22 (apache2handler)

PostgreSQL - 13.3

ICU - 66.1

Elasticsearch - 6.5.4

PDF Handler - c9705a8

AdvancedSearch - c8a42b8

CirrusSearch - 6.5.4 (ab802b7)

Elastica - 6.1.3 (9f6e66a)

My elasticsearch configuration:

analysis-icu

extra MediaWiki plugin

ingest-attachment

DCausse (WMF) (talkcontribs)

Hi,

CirrusSearch does not manipulate the text it receives from Extension:PdfHandler. I would check if this extension is working properly especially that the tooling it depends on (set via $wgPdftoText, likely to be pdftotext) is properly extracting the text you expect.