Could someone provide the code or the means to parse the wikicode as in the "text" attribute within the elasticsearch pages - or where to better inform me ? I've been working with NLP dataset generation from wikipedia dumps, but I can't get satisfactory results with most of the parsers I've tested (mwparserfromhell, wikitextparser, mediawiki-parser). I would need to have the same text as in the cirrus dump but keeping the internal links. Thank you for any information!
Topic on Help talk:CirrusSearch
The text used in the CirrusSearch dumps comes from the allText value created by WikiTextStructure::extractWikitextParts.
For the most part the processing takes the html output from mediawiki's wikitext parser, strips out elements matching a set of css selectors identifying some of the non-content and auxiliary parts of a page, and then strips all the tags out of the remaining content.
Unfortunately I'm not aware of a way to get the bulk html content of the wiki, you may need to use the mediawiki parser, and that still may have difficulties depending on template and lua usage.