Talk:Parsoid/Extension API

About this board

Accessing template parameters from a tag handler

6
Summary by SSastry (WMF)

Filed https://phabricator.wikimedia.org/T347507 for this feature request

Alex44019 (talkcontribs)

Is there a way to, within a tag handler, access parameters passed to the template the tag is in? This was possible in the legacy parser as the template's frame was passed around. However, in Parsoid it seems a legacy tag handler receives a general (or a "fake"?) frame with no arguments access, and an extension tag handler doesn't seem to have any methods to do so.

SSastry (WMF) (talkcontribs)

No, right now, there isn't. We never did because it was never needed in all the use cases we were working on so far. What extension are you working on right now where this is used? But, we can look into it and see what we can expose.

Cscott (talkcontribs)

The way that Scribunto modules and extensions can crawl the parse frame and use stuff from parent contexts always seemed vaguely like a misfeature to us. It makes reusing page fragments and tracking dependencies between them very difficult, and seems to have been mostly used in kludgey workarounds like T331906. My current thinking about the issue is at T122934#9196348; the tl;dr is that we should have a "data context" for every page which can be *explicitly* passed into subpages/templates where needed, but that we should *not* recreate the ability to implicitly crawl the frame.

Alex44019 (talkcontribs)

I've been considering porting PortableInfobox onto Parsoid.

The extension adds a (legacy) <infobox> tag, whose contents are custom XML nodes that describe the infobox's composition and inputs - an example template may look like this:

<infobox layout="tabular">
	<title source="name">
		<default>{{PAGENAME}}</default>
	</title>
	<image source="image">
		<caption source="caption" />
	</image>
	<data source="role">
		<label>Role</label>
	</data>
	<data source="cost">
		<label>Unlock requirements</label>
		<default>0/Unplayable[[Category:Unplayable aircraft]]</default>
		<format>{{{cost}}} credits/prestige</format>
	</data>
</infobox>

Each source attribute is the parameter from which a row draws data: the "Role" row draws its value from the {{{role|}}} parameter, the image draws its file from {{{image|}}} and caption from {{{caption|}}}, and so on. The only "direct" reference to a parameter in this snippet is in the "Unlock requirements" row (which draws from {{{cost}}} via the format, but only if the parameter is not empty when the data node is evaluated).

Naturally, given those parameters aren't delivered explicitly this approach does not work at all when rendering with Parsoid. If I'm not mistaken, this relies on immediate PPTemplateFrame_Hash's getArguments output.

I do agree that, if this had been a parser function and not a tag, drawing those arguments from seemingly nowhere is, to keep this reply short, bad. But this case is more of an alternative reference format rather than drawing data from nowhere, given all attributes are (explicitly) linked through the source attributes and default/format nodes.

SSastry (WMF) (talkcontribs)

Ya, I can see how this might be useful for this use case. We may not get to this right away since we are currently focused on having Parsoid read views for wikimedia wikis complete, but I'll file a phab task and we'll get to this. This information will likely be exposed via the ParsoidExtensionAPI object -- the specific details of how this is implemented will need to factor in considerations that Scott raises. This new API method will not let you crawl the frame but can give access to the enclosing template parameter strings where applicable.

Arlolra (talkcontribs)

Albeit cumbersome, the above would probably work with {{#tag:infobox|...}} in the template.

What about the ParserOptionsRegister hook

2
Summary by SSastry (WMF)

Tracked on the main page in the section where parser hooks are mapped to Parsoid functionality.

FO-nTTaX (talkcontribs)

The ParserOptionsRegister hook has not been mentioned anywhere. Will it still be able to override defaults in Parsoid like it was in the old Parser?

SSastry (WMF) (talkcontribs)

We'll investigate usage and figure out what / how we can support it. Offhand, it looks like something that should be supportable.

Question about StructuredNavigation extension with custom content model and related extension tag

6
Summary by SSastry (WMF)

Parsoid API has adequate support for this extension and use case

SamanthaNguyen (talkcontribs)

Hello Parsoid developers,

I was looking at the Wikitech-l email regarding Parsoid, so I would like to mention my use case. I have an extension I maintain called StructuredNavigation. Here is it how it works:

  1. It implements a custom content handler that extends the default JSON content handler, in the Navigation namespace. This is where data for "navigations" go, which is intended as another way of writing navboxes with JSON instead.
  2. The user writes JSON in the Navigation namespace.
  3. The user uses the MW extension-registered parser tag, which requires one attribute, which must be a valid title name in the Navigation namespace.
  4. The JSON is retrieved from that page in the Navigation namespace, which is then transformed to HTML with CSS loaded with ResourceLoader.

So for example:

  1. A page is created at Navigation:Dontnod Entertainment.
  2. A wiki page is created, with the content: <mw-navigation title="Dontnod Entertainment" />.
  3. They press submit.
  4. The JSON is retrieved, processed, and then transformed to an HTML navigation box on final output.

I'm aware that this use case of how this parser tag is being used is probably unique among many extensions. From what I could tell, this use case didn't follow under any of the ones already listed. If the team can confirm that it does not follow under any of the 4 listed, is this something that could also be planned to add support for?

To note, this parser tag doesn't accept any text inside; it is a self-closing parser tag. If the title that the user passes doesn't exist, there is simply no HTML rendered instead.

SSastry (WMF) (talkcontribs)

Thanks for checking in about this.

Parsoid supports ContentModelHandler extensions and they need to extend the Wikimedia\Parsoid\Ext\ContentModelHandler class. So, in your case, the code that implements that ContentModelHandler support will need to implement the toDOM and fromDOM methods. For example, the check the JSON extension.

As for the parser tag, when Parsoid calls your tag's sourceToDom method (which you will need to implement), that code will need to do whatever you are doing now to retrieve the content of the Navigation:$titleAttr page and process it - that is probably going to invoke your content model handler's toDOM method. So, that will be return value of the sourceToDom method (you will probably also update the ParserOutput object to register modules, etc.).

That is a somewhat handwavy response for now, but yes, I don't see anything in your use case that won't be supported.

If we discover any API gaps, we can figure out and fix those gaps.

SSastry (WMF) (talkcontribs)

One question is whether the ParsoidExtensionAPI object has API methods for you to retrieve the Navigation namespace content (if that is needed). And, once you retrieve it, the other question is whether you are going to go through Parsoid to convert your non-wikitext content or short-cut it internally by calling your extension's code.

If the former, we need to make sure that ParsoidExtensionAPI::extTagToDOM method has options to specify the content model OR we need to provide an alternate method to convert non-wikitext content models to DOM.

SSastry (WMF) (talkcontribs)

But to go back to your comment, we could consider your ext-tag as either type 1 or type 3 in that categorization depending on whether we treat custom content models as "wikitext" or not.

SamanthaNguyen (talkcontribs)

Thank you SSastry, that helps. It looks like it does; I could use ParsoidExtensionAPI::makeTitle() by passing the user title into the first parameter, and pass in the Navigation namespace ID into the second parameter.

I would most likely do the latter (calling my own code), since I am already doing that right now.

Thanks for the clarification :)

SamanthaNguyen (talkcontribs)

For clarification, currently the extension:

  1. attempts to construct a Title object from that passed title string by the user (will return early if Title::exists() returns false)
  2. Then it constructs a WikiPage from the Title object
  3. Then retrieves the Content from the WikiPage, by calling WikiPage::getContent()

So in short/tldr, the former method to use Parsoid doesn't seem to be needed for this extension.

Summary last edited by SSastry (WMF) 15:43, 14 August 2020 3 years ago

not yet supported, but will be supported (watch the page for updates).

Lucas Werkmeister (WMDE) (talkcontribs)

Is there a Parsoid extension API version of Parser::setFunctionHook()? I looked around a bit but couldn’t find anything.

SSastry (WMF) (talkcontribs)

Not yet. But, we'll add some form of this for certain.

Lucas Werkmeister (WMDE) (talkcontribs)

Alright, thanks.

Summary by SSastry (WMF)

Config now uses ObjectFactory spec

Anomie (talkcontribs)

getConfig() seems early to instantiate processors that may not actually be needed. I'd expect either:

  • getConfig() signals which kinds of processors the extension will provide, and some other method is called to supply those processors when they actually are needed.
  • getConfig() includes ObjectFactory specs, which are instantiated when Parsoid actually needs them.

Something else to consider is whether an extension might want to provide multiple processors for a transformation. It may be more logical to do that than to have to do multiple transformations within just one class.

SSastry (WMF) (talkcontribs)

Good thought about eager instantiation of processors. I'll have to ponder which one is more appropriate.

As for multiple processors, I'm trying to understand the use case. This actually touches upon another bit that I am in the process of adding to the page which is the matter of ordering of the global dom processors across extensions. Not sure how MediaWiki core handles ordering issues among hooks, but we haven't yet figured out how to tackle that. We allude to this problem in a long code comment in DOMPostProcessors.php but haven't really thought through it.

But, assuming all processors registered by an extension are run at the same time, the extension can internally orchestrate the ordering and which processors it wants to run instead of registering multiple processors and having the API orchestrate the order. One use case I can imagine for your proposal is if extensions get a mechanism to specify priority, then processors registered by the same extension might get interleaved with those of other extensions. But, barring that, it seems simpler to provide a single entry point per global DOM transformation.

Anomie (talkcontribs)

MediaWiki mostly ignores ordering issues among hooks, unfortunately. As you observed elsewhere, it's usually the case that hooks don't actually collide. And for Parser.php hooks in particular, extensions most often just maintain internal state and produce output during the first pass rather than producing placeholder output and clean it all up in a later pass, which is exactly what we don't want for Parsoid. But the ordering question seems more relevant to the "Can domProcessors generate new DOM that might need processing?" topic rather than this one.

As a use case for multiple processors... Maybe MobileFrontend might serve as an example. One processor that runs through all the links to mangle them from "xx.wikipedia.org" to "xx.m.wikipedia.org", one to reorder the lead paragraph and infobox, one to hack out navboxes, and so on. It might make for cleaner code for those to actually be separate processors, rather than having one processor that does all of those things at once (or one processor that internally calls multiple processors, with every extension reinventing its own way of doing that).

From Parsoid's point of view, MobileFrontend having multiple processors would be no different from multiple different extensions having one processor each. The only difference would be that "wt2htmlPostProcessor" would hold an array of implementations (usually a 1-element array) rather than specifying only one.

Access to ParserOutput?

6
Summary by SSastry (WMF)

Hook methods ( in ExtensionTagHandler and DOMProcessor ) and/or ParsoidExtensionAPI need to expose the ParserOutput instance since extensions currently rely on accessing it and updating state.

Roan Kattouw (WMF) (talkcontribs)

Will code using these new hooks have access to the ParserOutput object? This is used by a lot of tags/parser functions to add tracking categories, page_props, RL modules and other metadata.

SSastry (WMF) (talkcontribs)

Good question. The current draft doesn't. But, we could potentially expose both ParserOptions and ParserOutput objects via ParsoidExtensionAPI. @CAnanian (WMF) is working on refactoring the core classes and so the specifics of what methods and properties those classes export might change between now and then. But, that is a detail we can ignore here.

Alternatively, we could proxy the desired functionality through ParsoidExtensionAPI object.

I am not certain yet whether proxying is better OR direct exposure of ParserOptions and ParserOutput classes is better. Thoughts? For example, with the Sanitizer object, we started off with proxying and are now leaning towards direct exposure of the Sanitizer class' API.

Roan Kattouw (WMF) (talkcontribs)

For ParserOutput I personally lean towards direct exposure, because as far as I remember it it's already pretty tailored to things that make sense to access/store while parsing and that are cacheable. If there are things in there that don't make sense in a Parsoid world (and are more specific to the wikitext parser world), then maybe that's an issue, but if not then I think you're probably better off not reinventing the numerous wheels that ParserOutput has acquired over time.

SSastry (WMF) (talkcontribs)

My initial thought here is to proxy desired functionality since (a) there are possibilities of ParserOptions and ParserOutput exposing more configurations than is usable by extensions directly (b) backdoor access to parsing functionality depending on what they expose (c) expands the compatibility interface that we will have to maintain and cannot refactor / modify freely.

But, depending on what the ParserOutput refactor yields, it might be possible to have it be narrow / abstract enough to not have these pitfalls.

SSastry (WMF) (talkcontribs)

Heh! "edit conflict" :-) Yes, depending on the specific details of what the interface exposes, direct exposure can be better. We'll review code and update. Thanks for flagging this gap.

SSastry (WMF) (talkcontribs)

I just looked at that class, and it has 100 odd methods and other public properties. So, that spells the end of any proxying desires. Narrowing interfaces any further is probably best done at a future date. But, for now, it does make sense to expose the ParserOutput class.

Extension tags' "class"

3
Summary by SSastry (WMF)

ObjectFactory spec implemented

Anomie (talkcontribs)

If we want to support DI, rather than just a "class" it should probably be an ObjectFactory spec.

SSastry (WMF) (talkcontribs)

Noted. I'll update this detail in the second pass. We might have to fix our code first and / or wait till we start moving Parsoid extension implementations out of Parsoid codebase and into the extension's repos.

SSastry (WMF) (talkcontribs)

Scott's latest patch (in review and soon to be merged) supports this spec.

"styles" (and other modules)

4
Summary by SSastry (WMF)

Clarified the language to avoid confusion.

Anomie (talkcontribs)

Why are styles being declared via getConfig()? It seems unlikely that there will actually be styles that need to be added to every page.

I'd expect that styles would be added added as needed when the wikitext is processed, much as how DataAccess::parseWikitext() or ::preprocessWikitext() currently include 'modules', 'modulescripts', and 'modulestyles' as part of their returned data.

SSastry (WMF) (talkcontribs)

I don't fully understand your comment. But, when Parsoid processes an <ext> tag, how would it know what modules to add? ... ah, are you saying since we are using the same extension.json mechanism as core, the modules will be handled by the DataAccess interface that Parsoid uses?

Anomie (talkcontribs)

Maybe I misunderstood.

It sounds to me like the document is currently saying that all modules added have to be declared up front, and will be added whether or not the <ext> tag actually appears in the wikitext. "When Parsoid process the <ext> tag" is exactly how I'd think modules should be added. The specific mechanism (DataAccess or something else) doesn't matter to me.

Tgr (WMF) (talkcontribs)

I'd imagine if the extension tag does not appear Parsoid never invokes the hooks for that extension and does not add any modules. But yeah, I can easily imagine cases where the modules need to be selected dynamically (TemplateStyles would have been an example, had head styles been considered more performant than body styles).

Tgr (WMF) (talkcontribs)

Method naming seems pretty confusing. We have toDOM and fromDOM, but parseHTML and toHTML/innerHTML (which return the same thing even though the naming would suggest that the latter drops the top node; also toHTML hardly warns you that the DOM object is going to be corrupted), parseWikitextToDOM / parseExtTagToDOM and serializeHTML / serializeChildren (which are supposed to be the mirror image of each other but follow different naming schemes)...

IMO it would be a lot nicer if you just stanadardized on somethingToSomething - ExtensionTag::domToWikitext, ExtensionTag::wikitextToDom, ParsoidExtensionAPI::htmlToDom, ParsoidExtensionAPI::domToHtml, ParsoidExtensionAPI::domToHtmlInPlace (or something similar that makes it very clear that this changes the DOM), ParsoidExtensionAPI::wikitextToDom, ParsoidExtensionAPI::extensionTagToDom, ParsoidExtensionAPI::domToWikitext, ParsoidExtensionAPI::domToExtensionTag...

SSastry (WMF) (talkcontribs)

Ya, naming is pretty bad, isn't it? I have started to look at it but your 'somethingToSomething' standardization suggestion is pretty good. One reason we ended up with the 'toDOM', 'fromDOM' naming is because if extensions implement content handlers other than wikitext (ex: JSON), wikitextToDom, domToWikitext doesn't cut it. However, Arlo was proposing in a different context (Parsoid.php API for internal use) if this should be called 'contentToDom', 'domToContent'. Maybe that is an option. Thoughts on that bit?

Tgr (WMF) (talkcontribs)

Yeah, that sounds good too. sourceToDOM might work as well.

SSastry (WMF) (talkcontribs)

Ya, that sounds better.

Can domProcessors generate new DOM that might need processing?

9
Anomie (talkcontribs)

For example, Cite would certainly need to be collecting all the <ref> "placeholders" from the DOM, injecting ref numbers (and setting hrefs and ids if not whole new nodes) into them, and then producing some new <ol> and <li> nodes (and nodes for the backlinks) to inject into the DOM for the <references> tag. It might even need to generate new DOM for error messages, like "reference Foo was used but never defined". What if some other extension wants to transform all the <ol>, or collect all the anchors in the page, or all error messages, or something? If that extension's processor happens to run before Cite's, it wouldn't find the ones Cite adds.

And it's possible that Cite might want to be even smarter: if there's no <references> tag, there's not much point in doing toDOM on the content of all the <ref>s. Or if multiple <ref>s collide, there's not much point in doing toDOM on both when only one will be used. So it might like to wait on doing the toDOM for each ref's contents until it knows that ref will actually be going into the page output. Is that allowed? Or does it have to do toDOM on all the refs' contents anyway even though some might be thrown away?

Anomie (talkcontribs)

I see the documentation for tags touches on this, by mentioning that Parsoid's implementation of Cite uses "sealFragment" to have the contents of the reference in a map that seems to not be part of the parent document. So if the sealed fragment contains content that some other extension's domProcessor needs to process...? Or, for that matter, the same extension's domProcessor (e.g. nested refs).

SSastry (WMF) (talkcontribs)
SSastry (WMF) (talkcontribs)
SSastry (WMF) (talkcontribs)

Not quite ... I think you are asking about DOM fragments in the map, not just in internal data-mw attributes. I think that is probably a bug / gap in Parsoid right now. Interestingly, we found all these gaps during the porting and were using hybrid testing and had to sniff out all the places HTML was hiding so that we could properly update offsets. But, we didn't get the sealed fragments bit covered in the extension API itself.

SSastry (WMF) (talkcontribs)

The first one is the hook ordering / global-transforms ordering problem that I mentioned in the other topic .. and which we need to resolve separately. Haven't thought about it but need to first understand what the current behavior is. https://github.com/wikimedia/parsoid/blob/90d0f45209175f8313540c15a5be37a658fcc0a1/src/Wt2Html/DOMPostProcessor.php#L254-L312 is a longish comment hinting at one possibility at how to solve this.

As for the second one, we could conceivably support this lazy processing scenario. And, it is possible that Cite can do it today without changes by adding additional smarts. It would for example, have to deal with `shiftDSROffsets` potentially. But, to be conservative, I will say, that we haven't considered this lazy processing scenario carefully, but I think it is doable since the model we are going for here is to be able to take the output of an extension and plop it into the top level document. It shouldn't matter how or when the output was generated as long as suitable DSR offset shifts are handled properly.

Anomie (talkcontribs)

Yes, that long comment is exactly what I was asking about! I'm satisfied to see it's already on the radar.

Tgr (WMF) (talkcontribs)

You can't really do lazy processing if you want to support partial renderings / context-free-ness, can you? If a page with a bunch of refs gets transcluded into another one which has a references tag, that should work without re-rendering the transcluded page.

Anomie (talkcontribs)

Why would that be a problem for lazy processing? You'd just have to make sure that the "map" containing the unprocessed wikitext for each ref came along with the transclusion somehow, so when the domProcessor runs over the trancluding page's DOM and finds those refs that it can get still their wikitext to process.

And that's assuming the transclusion works by pulling in a serialized DOM rather than processing the transcluded page's wikitext to DOM afresh for the transclusion.

Reply to "Can domProcessors generate new DOM that might need processing?"