User talk:Daniel Kinzler (WMDE)/MCR-PageUpdater

About this board

Using the PageUpdater to make dummy revisions

One comment • 18:16, 11 June 2018 5 years ago

1

Duesentrieb (talkcontribs)

I have hacked up a quick code experiment for consolidating the creation of null (dummy) revisions with the "red path". Seems to work nicely.

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/439632

Reply 18:16, 11 June 2018 5 years ago

Reply to "Using the PageUpdater to make dummy revisions"

Breakdown

7 comments • 14:59, 8 June 2018 5 years ago

7

Anomie (talkcontribs)

saveRevision(RevisionSlotsUpdate) aka doEditContent() to create update a page by creating a new revision

The method would need more than RevisionSlotsUpdate, at least as it was last I looked. User, edit summary, etc. It might make most sense to put that data into the same object, in which case "PageEditData" might make more sense as a name than "RevisionSlotsUpdate".

(or not, in case of a null-edit).

That has a separate entry point.

we may also want a version of saveRevision that directly takes a RevisionRecord, leaving PST to the caller.

That would follow the same path as the "Null rev" dark-green arrow. It could potentially be exactly the same as that path, depending on the exposed interface.

saveDummyRevision() aka null revision: like saveRevision(), but with no (changed) content.

Not really "like saveRevision()", but as noted above might be the same as the version of saveRevision() that takes a RevisionRecord.

renderRevision() takes a (saved?) RevisionRecord and returns output for it. Similar to preview(). Used for plain views, history views, diff views, etc.

I see no reason to require it be saved. Maybe we'll discover one later, but until then there seems to be no point in restricting it.

Either way it might also be used with a RevisionArchiveRecord.

purge() re-generate any derived data. Re-parsing is optional.

If we go by the diagram, you'd need to get a ParserOutput. That might be able to be taken from ParserCache, I suppose, although the common on-wiki use case I'm aware of is usually specifically refreshing the ParserCache data.

check user permissions, tokens, limits, etc

Tokens are a very UI-level thing, often handled by HtmlForm or ApiMain. Permissions and (probably) limits are more directly related and reusable.

construct an unsaved RevisionRecord from RevisionSlotsUpdate, while apply PST and slot inheritance (and making use of cached data). This component should also provide access to rendered output for the full revision (as well as individual slots). Such output should be created on-demand, and should be cached aggressively.

No, providing the rendered output should be a separate component. In the diagram, "PST, make rev" is the first sentence, and "Render Revision" is the second sentence.

This separation is mainly because various entry points need the latter without the former.

write secondary data to the database.

That's not really a "functionality". The functionality is taking the rendering (i.e. the ParserOutputs) to collect the updates/jobs and scheduling them, while the actual writing of the data is the responsibility of each individual update/job.

Reply 17:04, 31 May 2018 5 years ago

Duesentrieb (talkcontribs)

It might make most sense to put that data into the same object, in which case "PageEditData" might make more sense as a name than "RevisionSlotsUpdate".

It's a possibility, but gluing these things together feels bad to me. While I see your point of having an object that basically represents the data sent by the user when performing an edit, I'd still want a value object that just represents the changed slots.

(or not, in case of a null-edit). That has a separate entry point.

Null revisions have their own entry point, null edits don't. And I'd like to handle null revisions in the same way, by introducing EDIT_FORCE_SAVE or something.

[saveRevision that directly takes a RevisionRecord] would follow the same path as the "Null rev" dark-green arrow. It could potentially be exactly the same as that path, depending on the exposed interface.

Yes, that's the idea. And the path for creating a revision based on a RevisionSlotsUpdate would feed into that path., as would saveDummyRevision().

No, providing the rendered output should be a separate component. In the diagram, "PST, make rev" is the first sentence, and "Render Revision" is the second sentence.

The actual rendering will be done elsewhere, via RevisionRenderer. Constructing a RenderedRevision is pretty light weight. It's needed as the same time as creating PST content (for feeding edit filter hooks), it needs the PST content as input, and it can be taken from the stashed edit, along with the PST output.

So, for practical reasons, I'd keep them together for now. Though I do agree with you that conceptually, there is no need to do that. I just see no practical way to disentangle them without looking up the stash twice, of changing how stashed data is fed to the factory thingy.

This separation is mainly because various entry points need the latter without the former.

Entry points that start with a RevisionRecord would just use a RevisionRenderer directly. That doesn't mean the revision factory can't also use a RevisionRenderer.

That's not really a "functionality". The functionality is taking the rendering (i.e. the ParserOutputs) to collect the updates/jobs and scheduling them, while the actual writing of the data is the responsibility of each individual update/job.

Well, the actual' writing of the data is done by the HDD controller on the DB server. The functionality is whatever the interface contract tells the calelr will happen. If it's implemented directly or passed through five layers of abstraction isn't really relevant. Anyway, I have changed the wording a bit.

Reply 16:21, 4 June 2018 5 years ago

Anomie (talkcontribs)

Null revisions have their own entry point, null edits don't. And I'd like to handle null revisions in the same way, by introducing EDIT_FORCE_SAVE or something.

The effect of a null edit is what's labeled in the diagram as the light green Purge path. Although actual processing is more like following Black/Grey or Red and finding out at the "Revision" oval that the new revision's content is identical to the current revision's, so instead of processing that new revision further it submits the current revision via light-green and returns.

Null revisions shouldn't go in via the Black/Grey or Red paths at all. Although once you get to the "Revision" oval on the Black or Red paths the subsequent processing is identical to the Dark Green path. As you noted just after the bit I quoted here. ;)

Constructing a RenderedRevision is pretty light weight.

I suppose that's true if you make the combined ParserOutput lazy-loaded too. In that case we'd need to be sure that gets constructed before getting to the "Save to Stash" step, because the whole point of the stash is to pre-cache the expensive parsing needed to get the combined ParserOutput for the secondary updates.

It's needed as the same time as creating PST content (for feeding edit filter hooks)

It may be needed for some edit filter hooks, sure. Those don't necessarily need to run from the same "module" that creates the PST content though.

The diagram branches, but code flow will be linear. Code flow for the Black and Red paths might be something like → PST → Make RevisionRecord → Make RenderedRevision → Call edit filters → Save revision → Update RenderedRevision → Schedule secondary updates. For the Grey path we'd have → Get RevisionRecord and RenderedRevision from stash → Call edit filters → Save revision → Update RenderedRevision → Schedule secondary updates. For Dark-Green we get the RevisionRecord as input → Make RenderedRevision → Call edit filters → Save revision → Update RenderedRevision → Schedule secondary updates.

Then, too, code flows for the Yellow and Purple (Magenta) paths probably skip the "Call edit filters" bit. Or maybe we have more than one edit filter hook, some that get called always and some that only get called before saving. Currently we have parser hooks for the former, but MCR is making "render" different from "parse".

And don't forget the null edit flow: → PST → Make RevisionRecord → Find out it's the same content as the current revision → Throw away the new revision and send the current revision via the light-green path instead.

Reply 19:00, 6 June 2018 5 years ago

Duesentrieb (talkcontribs)

Null revisions shouldn't go in via the Black/Grey or Red paths at all. Although once you get to the "Revision" oval on the Black or Red paths the subsequent processing is identical to the Dark Green path. As you noted just after the bit I quoted here. ;)

I'd like to make null (dummy) Rev a special case of the red path. What's unclear to me right now is whether the creation of a dummy revision should trigger a re-parse and other updates. In most cases, that is redundant. Unless the content depends on the revision ID.

I suppose [Constructing a RenderedRevision is pretty light weight] is true if you make the combined ParserOutput lazy-loaded too.

That is indeed the idea.

In that case we'd need to be sure that gets constructed before getting to the "Save to Stash" step,

Sure, the stash code would just call getRevisionParserOutput(), which creates it if it's not yet there.

Then, too, code flows for the Yellow and Purple (Magenta) paths probably skip the "Call edit filters" bit.

I suppose we can figure that out when factoring EditControlelr out of EditPage.

Currently we have parser hooks for the former, but MCR is making "render" different from "parse".

Well, "parse" has been misleading sine the introduction of ContentHandler six years ago...

Reply 15:28, 7 June 2018 5 years ago

Anomie (talkcontribs)

I'd like to make null (dummy) Rev a special case of the red path.

Although the Red path begins with data to be passed to EditRevisionFactory, while currently null revisions begin with a RevisionRecord from RevisionStore->newNullRevision().

What's unclear to me right now is whether the creation of a dummy revision should trigger a re-parse and other updates. In most cases, that is redundant. Unless the content depends on the revision ID.

Or the revision timestamp, or the revision user. Possibly other things depending on the reason for the dummy revision, e.g. the title if it was a move or the protection settings (e.g. via {{PROTECTIONLEVEL}}) if it was a protection or unprotection.

I suspect that if making the dummy revision doesn't trigger the re-parse and other updates then most of the calling code will have to do it anyway.

Reply 14:59, 8 June 2018 5 years ago

Tgr (WMF) (talkcontribs)

Purge doesn't actually regenerate most derived data, it just updates the HTML caches (assuming we are talking about action=purge and not e.g. null edits). The current implementation doesn't really use any of the code paths discussed here, it just clears caches and leaves it to the next view to re-render.

Reply 00:19, 1 June 2018 5 years ago

Anomie (talkcontribs)

API action=purge can reparse. Null edits are a bit of a strange case, since they go through the whole edit pathway and effectively convert to a purge in the middle.

Reply 19:34, 1 June 2018 5 years ago

Reply to "Breakdown"

Further refactoring

7 comments • 15:12, 7 June 2018 5 years ago

7

Anomie (talkcontribs)

It might be useful to divide the descriptions of the various components into low-level (with correspondence to the green boxes or blue ovals) and high-level (with correspondence to the various arrows), and/or make more direct reference to which pieces of the diagram each class is handling and which pieces of the diagram are handled by which classes.

From your description of EditController and PageUpdater, it seems like you're thinking that the three arrows currently going from the "Sections, Edit conflicts" box to the "PST, Make rev" box would instead go through the "Data input" blue-oval, with the result that the Red dataflow and the Purple, Black, and Yellow would use the same method to enter the "PST, Make rev" box? If that's what you mean, it sounds reasonable to me. I note only the Black could actually reuse the Red dataflow itself though, since Yellow and Purple don't have the same endpoints.

EditRevisionFactory (better name pending) builds an unsaved RevisionRecord from a RevisionSlotsUpdate, along with a RenderedRevision [...] It tries to take PST content and rendered content from th edit stash.

Why "along with a RenderedRevision"?

I wouldn't put the stash handling in there either. The high-level class for handling the Black/Grey dataflow should check the stash and follow Black or Grey as appropriate, and EditRevisionFactory shouldn't have to know anything about the stash.

I know that's a bit of a break from what WikiPage::doEditContent() does now. IMO that's ok.

DerivedPageDataUpdater takes a (safed) RevisionRecord and a RenderedRevision, and takes care of updating any secondary data and cached artifacts (optionally, recursively). It may re-use or re-render ParserOutput as appropriate.

It's a technicality, but it takes care of collecting the secondary data updates and scheduling them. The updates/jobs themselves actually do the updating.

"re-rendering" should be done by RevisionRenderer, probably via lazy callbacks from the RenderedRevision.

the job of discarding ParserOutput objects that depend on the revision ID may be handled by RenderedRevision. Perhaps it could have a setRevisionId() or updateRevision() method for this.

That seems like a lot of logic to be putting inside RenderedRevision. Why not have the code pass the existing RenderedRevision (for the unsaved revision) and the saved RevisionRecord back to RevisionRenderer, which would return a new RenderedRevision that has state copied from the old RenderedRevision when appropriate? Or I suppose you could have RevisionRenderer mutate the passed-in RevisionRecord if you want.

It feels odd that I'm telling you this instead of the other way around.

RenderedRevision should not know RevisionRenderer directly to avoid circular dependencies. We can make RevisionRenderer implement a ParserOutputCombiner interface to isolate them.

At the level of concrete objects, RenderedRevision::__construct() should be //given// $this when RevisionRenderer creates one. Personally I don't see much point to have a separate partial interface for RevisionRenderer to implement instead of just letting RenderedRevision not call the unneeded methods.

PageStore is a stateless service that can load PageRecords from the page table, and can update the page table when pages are created, updated, or deleted.

I note that "PageRecord" isn't defined anywhere in here.

Reply 17:52, 31 May 2018 5 years ago

Duesentrieb (talkcontribs)

I udpated the text to more directly refer to the diagram and fixed some other things you mentioned.

From your description of EditController and PageUpdater, it seems like you're thinking that the three arrows currently going from the "Sections, Edit conflicts" box to the "PST, Make rev" box would instead go through the "Data input" blue-oval, with the result that the Red dataflow and the Purple, Black, and Yellow would use the same method to enter the "PST, Make rev" box?

Indeed. More precisely, Black (edit) feeds into red (saveRevision), and lime (purge) is mostly the middle part of red (saveRevision). Green (null aka dummy rev) would just be a special case of red.

EditRevisionFactory (better name pending) builds an unsaved RevisionRecord from a RevisionSlotsUpdate, along with a RenderedRevision [...] It tries to take PST content and rendered content from th edit stash.

Why "along with a RenderedRevision"?

I wouldn't put the stash handling in there either.

I'd love to push stash handling to the black/yellow flows resp the "Sections" block. But since PST content and rendered output come from the stash, the stash needs to be checked by the code that otherwise creates those things. Or where you thinking to have a setPreComputedData() method on the EditRevisionFactory object, to be called by the EditControlelr? That's not very pretty, but would work too, I think.

As to why EditRevisionFactory should also create a RenderedRevison: because otherwise, we'd need a separate factory for that, and I don't see the point. EditRevisionFactory does PST, so this is the earliest time we can construct a RenderedRevision. EditRevisionFactory also triggers hooks that potentially need the RenderedRevision. But we could pull this "up" into the controller for the red flow, which is PageUpdater. I'll have to check the code, but I think that should work fine as well.

Why not have the code pass the existing RenderedRevision (for the unsaved revision) and the saved RevisionRecord back to RevisionRenderer, which would return a new RenderedRevision that has state copied from the old RenderedRevision when appropriate? Or I suppose you could have RevisionRenderer mutate the passed-in RevisionRecord if you want.

To me it seems odd to have RevisionRenderer know about that. RevisionRenderer's job is to compose revision output from slot output, and to provide a lazy wrapper around that process. The output-depends-on-revision-id stuff seems to be an arcane detail of the caching logic - and the caching logic is in RenderedRevision. So in the name of encapsulation, any knowledge about when and why to discard the cached data should be in there as well.

At the level of concrete objects, RenderedRevision::__construct() should be //given// $this when RevisionRenderer creates one. Personally I don't see much point to have a separate partial interface for RevisionRenderer to implement instead of just letting RenderedRevision not call the unneeded methods.

Indeed: RevisionRenderer::renderRevision should pass $this to RenderedRevision::__construct(). But since RevisionRenderer knows RenderedRevision, the opposite should not be true. In my experience, circular static binding is just evil. Always. If it doesn't do any harm immediately, it'll burn down your hose when you don't look.

"Micro interfaces" don't do any harm - indeed, "narrow interfaces" are generally considered a good thing, they are easier to mock, replace, and combine. I agree that implementations shouldn't be split into too many parts - knowledge locality is a thing. Splitting the implementations should be avoided if it means spreading knowledge abotu implementation details across multiple classes.

Reply 15:50, 4 June 2018 5 years ago

Anomie (talkcontribs)

I'd love to push stash handling to the black/yellow flows resp the "Sections" block. But since PST content and rendered output come from the stash, the stash needs to be checked by the code that otherwise creates those things. Or where you thinking to have a setPreComputedData() method on the EditRevisionFactory object, to be called by the EditControlelr?

It depends on what exactly is needed to key the stash, and what information exactly is in the stash.

We might key the stash on the token returned by ApiStashEdit, by having the JS that made the API call inject the it into the edit form. We'd validate that the stash entry isn't stale by including in it the page's current revision ID at stash time plus hashes of the data making up the "User input", which we can compare to the now-current revision ID and the data in the current "User input". Then the rest of the data in the stash entry would be whatever state is necessary to recreate the RevisionRecord and the RenderedRevision without having to do the potentially-expensive processing done by EditRevisionFactory.

As to why EditRevisionFactory should also create a RenderedRevison: because otherwise, we'd need a separate factory for that, and I don't see the point.

We already have RevisionRenderer, which is the factory for RenderedRevison.

The output-depends-on-revision-id stuff seems to be an arcane detail of the caching logic - and the caching logic is in RenderedRevision. So in the name of encapsulation, any knowledge about when and why to discard the cached data should be in there as well.

My mental model was that RenderedRevision holds the cache state and is externally seen as immutable (internally the cache gets populated on access, but external users can't see that). When the unsaved revision is saved which creates a new, saved RevisionRecord, RenderedRevision would never even know it. Thus the proposed method on RevisionRenderer to copy state from the old RenderedRevision based on the unsaved RevisionRecord to a new one based on the corresponding saved RevisionRecord.

If you want to make RenderedRevision's revision mutable, then sure, the cache invalidation logic should be in the setNewRevision() method that mutates it. I don't really mind that, but I thought you liked immutable value objects better than mutable ones ;)

Reply 18:28, 6 June 2018 5 years ago

Duesentrieb (talkcontribs)

Re having the creation of the RenderedRevision in the EditRevisionFactory: you are right, conceptually. I have changed the spec. How exactly this will work with EditStash is still a bit unclear to me, but we can figure that out later.

If you want to make RenderedRevision's revision mutable, then sure, the cache invalidation logic should be in the setNewRevision() method that mutates it. I don't really mind that, but I thought you liked immutable value objects better than mutable ones ;)

I do like immutable objects, but I'm not religious about it. Having a way to selectively invalidate the cache when the revision ID becomes known seems the best approach here.

Reply 15:12, 7 June 2018 5 years ago

Tgr (WMF) (talkcontribs)

This looks generally good to me (although I agree with Brad that ParserOutputCombiner seems superfluous; I'm not sure I understand the remark about circular references - is that about circular dependency, or some kind of GC optimization?).

The one thing that's missing that's been discussed in the past is page types (which would influence slot combining at the very least) - would there be some PageTypeRegistry that RevisionRenderer uses internally to pass control to the right PageTypeHandler? Or have we abandoned the idea of page-type-specific composition completely? (Or just out of scope for this?)

Reply 08:45, 1 June 2018 5 years ago

Duesentrieb (talkcontribs)

I'm not worried about circular references at runtime. Just about circular static dependencies.

As to PageTypeHandler: we'll come back to that concept when we thing about refactoring the Article class. And we may end up using it in RevisionRenderer. But I indeed consider that out of scope for this document.

Reply 15:52, 4 June 2018 5 years ago

Anomie (talkcontribs)

I think it's about having RenderedRevision take a RevisionRenderer, and RevisionRenderer having a method that returns a RenderedRevision even though RenderedRevision doesn't use that method. But Daniel might correct me.

I think we agreed to not worry about page types for now. If it turns out we do need different combining logic by page type then it can be added in easily enough as you suggest.

Reply 19:39, 1 June 2018 5 years ago

Reply to "Further refactoring"

There are no older topics