Edit Review Improvements/Architecture discussion 2016-10-21

From mediawiki.org

See User:Sbisson (WMF)/PublicEventStream for background.

Things to talk about. Different approaches to ERI stream:

  • Changeprop rule or set of rules. Message for new revisions, API calls to augment that revision, publish that message.
    • Main thing unsure about this approach, when you're after the fact, there is no real way to get the information as it was at the time of the event. For example, edit count, page protection. Most information is not versioned, e.g. length of revision
    • Another thing to consider is this is a YAML configuration file inside of changeprop. That doesn't have any support for things that are different from one wiki to the next. When they do the ORES pre-caching, each wiki has different models. The per-wiki config is done in YAML with string matching. Per-wiki config could be cleaner.
    • Code is on GitHub, we're not sure how deployments are done. People on the team who are deployers should be able to get access if they don't have it already, though. We would prefer Gerrit: https://github.com/wikimedia/change-propagation
  • Twist to this approach is to add what we need that can't be accessed afterwards to the original event. (User editcount, user registration timestamp, page protection)
    • Some arguments have been made about performance, but we don't agree it's an issue. (User things are in memory, protection is a cheap DB query)
      • MS: If these two things are in-memory, we'll include those. If they're not in memory, we can get it from the API afterwards. In terms of product, we can determine reliable and not reliable, and decide how to handle. It might not be that important to have.
      • MF: We can drop certain things that could be misleading, e.g. page protection, and consumers can do their own API calls.
      • MS: There's a difference between desirable and necessary. The ideal would be to add information to the event.
  • Performance: If there is a performance problem, what is the effect of the performance problem (it doesn't delay page load, what does it delay?)
  • ORES extension: In job queue job, after fetching the score, publish our feed directly to ORES extension.
    • RK: Some of the analytics people have the attitude, Collaboration shouldn't modify shared stuff. They want us to use our infrastructure, but not add stuff to the shared topic.

SB: If this project goes anywhere, we're going to want to keep adding fields.

  • SB: We also want to modify RecentChanges page, and set thresholds, it's been said that these thresholds could be configurable per wiki. We want to include the same information in S:RC and in the feed.

MS: What would be the best, and what would be the easiest? SB: Coupling inside the ORES extension. Do we have to have it on the non-ORES wikis?

MS: We could say, our stream has no meaning without ORES and non-ORES new things have to go in RCStream. SB: There's a problem with this fragmentation. MF: Don't want to touch RCStream when it's on the old infrastructure.

Very basic architectural issue is location of eventlogging-eventbus. You can not change that schema, because it's user-facing and shared. Nothng enforces any schema in the way out.

SB: Be as relaxed as possible in what you accept, and strict in what you emit.

RK: There's two schemas (see diagram, 1 and 6). MF: The whole idea is there are multiple 6's. RK: Changing 1 runs into opposition about 6.

RK: Strategy 1: Argue that we can add arbitrary things to 1, as long as they're not added to 6.

  • MF: revisionfirehose is 1, revisioncreate is 6.
  • RK: This should be plan A.

RK: What is plan B? SB: My preference is ORES extension, since it's shared code with other ORES special page stuff.