Topic on Help talk:CirrusSearch

Code used to parse the text as in cirrus dump

2
80.12.85.103 (talkcontribs)

Could someone provide the code or the means to parse the wikicode as in the "text" attribute within the elasticsearch pages - or where to better inform me ? I've been working with NLP dataset generation from wikipedia dumps, but I can't get satisfactory results with most of the parsers I've tested (mwparserfromhell, wikitextparser, mediawiki-parser). I would need to have the same text as in the cirrus dump but keeping the internal links. Thank you for any information!

EBernhardson (WMF) (talkcontribs)

The text used in the CirrusSearch dumps comes from the allText value created by WikiTextStructure::extractWikitextParts.

For the most part the processing takes the html output from mediawiki's wikitext parser, strips out elements matching a set of css selectors identifying some of the non-content and auxiliary parts of a page, and then strips all the tags out of the remaining content.

Unfortunately I'm not aware of a way to get the bulk html content of the wiki, you may need to use the mediawiki parser, and that still may have difficulties depending on template and lua usage.

Reply to "Code used to parse the text as in cirrus dump"