Topic on Talk:Requests for comment/Text extraction

Dantman (talkcontribs)
If extracts will be integrated into core custom extraction classes could go to a separate extension (e.g. WikimediaTextExtraction); otherwise they could be part of the main extraction extension.

Personally even if this was implemented in a TextExtraction extension instead of core (though I think it should be implemented in core) I wouldn't want Wikimedia specific stuff in the generic MediaWiki extension. ie: I'd prefer that in both situations WMF would have a WikimediaTextExtraction extesion.

Such timing is less than optimal, I propose to extract text during LinksUpdate and store it in page_props.

page_props is for storage of indexed and queryable data that results from the canonical parse run. ie: Something should only ever be stored there when there is also an equivalent parser cache entry.

page_props is for data you want to be able to query for not for storage. Since you're not going to be making SQL queries trying to match extraction results the extraction data should be stored in the parser cache using either ParserOutput::setExtensionData or adding a new prop + methods to ParserOutput instead.

Alternatively if you want to do this completely separate from the parser cache the proposed DataStore would probably be the best method of storage.

MaxSem (talkcontribs)

We don't need text extracts in parser output:

  • I want to make extract retrieval a batch opertaion - it would never be like that if it only came with ParserOutput.
  • You need to generate an extract once per revision, not on every parse.
MZMcBride (talkcontribs)

Some wikis, such as Wiktionaries, rely heavily on templates. I'm not sure you can only generate an extract once... if templates change and the resulting page output changes, you'll need to re-generate an extract, right? Plus there will be incremental improvements to the extractor itself, which people will want to benefit from without needing to make dummy edits to pages.

Reply to "Some notes"