Wikimedia Developer Summit/2017/ReviewStream

From mediawiki.org

ReviewStream: improving edit-review tools through a better data feed

Introduction[edit]

  • Edit Review Improvements is a project of the Collaboration Team, which is building ways to improve edit review in general and, in particular, to reduce the negative effects current edit-review processes can have on new editors to the wikis.
  • Most edit-review and patrolling tools were designed to safeguard content quality and fend off bad actors—both vitally important missions.
  • But a body of research suggests that these processes can have the unintended consequence of discouraging and even driving away good-faith new editors. Particularly when they involve semi-automated tools (Huggle, RTRC, etc.)
  • As a first step to providing a better review process for good-faith newcomers who are making mistakes, ERI is focusing on helping reviewers find users who are a) in good faith, b) newcomers and c) making mistakes.
    • Most notably by productizing ORES --which includes a good-faith test as well as a damaging test.
      • We’ve also added a “Newcomer” test.
    • Two efforts that will launch this quarter:
      • RC Page Improvements --building a whole new filtering interface for the Recent Changes page,
        • which will likely be rolled out to other review pages, like Watchlist.
  • ReviewStream--our subject today. An effort to find vandalism fighters where they live (which is not on RC page.)

ReviewStream[edit]

To the information currently in RCStream, ReviewStream adds additional data designed to improve the edit-review process. (We’ll look at that in a minute. )[edit]

  • By directly incorporating data that currently has to be looked up in separate processes, ReviewStream is designed to make life easier for creators of downstream edit-review tools and to make their tools faster.
  • At the same time, by making them easier to include and understand, we hope it will encourage  inclusion of features that will help new users.
  • Specifically tools to determine: Intent (good faith), Quality (damage) and Newcomer status.
  • Example: very early stage Pau’s Huggle Designs

Meeting Goals and Roles[edit]

  • In terms of the Dev Summit Session Guidelines, this meeting is for “Problem-solving: surveying many possible solutions”
    • That means we’ve come here to get feedback and ideas.
  • Assign Roles: Note-taker(s), Remote Moderator, Advocate (optional)

Attendees[edit]

  • Jmatazzoni
  • Mattflaschen
  • WikiPathways - Third party wiki for biomedical community.  Not a lot vandalism (not many 14-year old boys working in biomed).  Do have issues with early contributors and handling them well.
  • Andrew Otto - Stream part specifically.  We’re working on a service for exposing streams of data.  I’ve done a lot of work on internal stream data.
  • Dan Andreescu - Same.
  • Francisco
  • Leon - As a volunteer, counter-vandalism is one of my primary focuses.
  • Danny Horn - Product Manager at WMF, wanted to know what your team is doing.
  • James Hare - Mostly interested in seeing what kind of data is produced out of this, how is it best disseminated.
  • Pau Giner - I’m a designer and working with Collaboration team.  I’m interested in how to help reviewers meet their needs, listen to you all.
  • Matthew Flaschen - Software Engineer on the Collaboration team and a long-time Wikipedia editor.
  • Kaldari - On the CommTech team, interested in all kinds of reviewing tools and various software that can be used to support them.
  • Roan Kattouw - Software engineer, heard we need to do something on other side when on VE.

Meeting Notes[edit]

Possible Questions

  • Review the handout, “What’s in the ReviewStream Feed?” — does this list of data inspire ideas for features you’d like to have in, say, an anti-vandalism tool?
    • JM: Need a sense of what edits will probably be helpful.
    • JM: Standardizing categories.  Four levels of ORES categories, in ranges (e.g. probably good).  These will become something used for tools like Huggle, e.g. 3 levels of good faith (probably good to likely bad).  Newcomer test. Brand-new, learner, after-learner.
    • JM: Distinguishing characteristic is to recognize new editors in good faith, but struggling.  Or bad faith, but struggling.  Some of these other things are in RCStream, but most no.
    • JM: Of reasonably readily available, what could we get in?  Are there things you would love to have?
    • WikiPathways - Idea that an edit may resolve a tag, e.g. needs sources.  Those could be put together.
    • Roan - It would be useful if some software detected e.g. add categories, but still no categories template.  Bringing this to the attention of a human.
    • Kaldari - One thing that might be useful is a list of whether it had triggered some abuse filters.
    • Roan - Since the AbuseFilter logic runs before the edit is saved, we could in theory track this info.
    • Roan - If we think of Huggle as a consumer, what bits of data should be in the stream vs. doing an API query.
    • Andrew Otto - For a revert shortly after, e.g. if you have a stream of revision creates, then that same tool also consumes revert stream, it could combine those two in the interface and change the color.
    • Roan - How does that connect to the data that we stream out?
    • Andrew Otto - Is that a way to do it, in the UI later.
    • Roan - It’s in the order of milliseconds.  In the code flow of MW, by the time the tags are known, it’s already been sent out.  Refactoring this to be the other order is not that simple.
    • Matt - Can’t it also be tagged an arbitrary amount of time later?
    • Roan - Yes, in theory, but not that common.
    • Roan - Yes, we do delay for ORES.
    • MusicAnimal - Back to AF, one thing is not just which are being triggered, but also which have triggered before.  If they’ve triggered addition of bad words, knowing the user’s history, you want to take a closer look.
    • Roan - Proposing to keep statistics of which users have triggered AF and how many times.
    • MusicAnimal - Categorization of AbuseFilters.  For instance, the bad faith category.  When these are triggered, it’s almost always accurate.  When it finally gets through, you see it in RS.  You should know that they’ve triggered those in the past.
    • Roan - You’re distinguishing between two different things.  1. History of user triggers. 2. How many times did they get prevented?
    • MusicAnimal - Also, this could be their 5th attempt.  None of the software exposes this information, but it’s in the log.
    • Roan - The filter log does track this.
    • MusicAnimal - Might also take into account time.
    • JM - A little bit of edit history for that user.
    • MusicAnimal - In particular filter log history for bad faith users.  Indication they’re trying to get around edit.
    • Kaldari - Wonder if it makes sense in part of this tool, or a higher level tool.  People might want to customize this.  Might depend on COI vs. sockpuppets, etc.
    • MusicAnimal - If these filters were categorized, you could say that regardless of categories.
    • Pau - Followup question: It would be great to know if someone was hit by filters related to vandalism.  Would you prefer this to be done by the system and just tell you this is vandalism, or do you want this specific information (AF triggers).
    • MusicAnimal - Both I guess, but if they’ve been hitting this bad faith filters, they’ve already surrendered their good faith.
    • JM - Ores only sees one-by-one.
    • PG - If we have two options: 1. Integrated to make better prediction, or.
    • MusicAnimal - We still need categorization, bad-faith etc.
    • Roan - Separately from that, if we put this info into machine learning input, would you still be interested in the low-level data?
    • MusicAnimal - Yes
    • Matt - I think it could, particularly this session’s edits.  The history of this edit is relevant to whether this edit is good-faith.
    • JM - You’d want to see if they tripped it recently.
    • MusicAnimal - All good.
    • Alex - Talk page activity
    • James Hare - Be careful of attaching data about user to stream.  Seeing if the username is blue or red.  If it’s red it implies a bad edit.  Anon implies it’s bad.  People do like this profiling thing.  What I’m concerned about is people or machines reaching decisions about the edit based on past history, not the content of the edit.
    • Alex - I’m thinking of things I manually.  I’m not sure how I automate that.
    • MusicAnimal - There’s a whole system of warnings.  It starts from level 1, goes to 4.  That would be good to know.  Huggle parses the talk page to see what level they’re currently at.
    • Andrew - They parsed presumably.
    • MusicAnimal - They also have their own database.
    • Andrew - This might be too much but this all sounds like a reputation score, which might be useful but also problematic.
    • James - Yes, there are ethics concerns.  History can help, but without precautions, it leads to profiling for no good reason. I’m sympathetic to giving humans access, but not to robots making calls without knowledge.
    • Danny - Robots are just presenting information, humans make decisions.
    • JM - ORES doesn’t look at history.
    • Roan - To clarify, even if ORES did receive these inputs (history, etc.), the way it obtains its ground truth from humans is showing thousands of edits devoid of context.  There is a layer separating human bias.  I think the initial model is whether the user is anonymous or not.
    • Dan - Going to back up James’s point.  I think blue/red implies something psychologically.
    • Andrew - It could be more precise if you have a reputation score with a confidence attached to it.
    • Dan - My point is that if blue/red is sufficient to bias people, some might not take a score properly.
    • Danny - If people make judgement calls based on what they have available, blue/red does indicate something, this might help to surface useful info (rather the dumbest, more useful information)
    • James - It’s useful to have more granular data that is a predictor, rather than whether they have a user page.
    • MusicAnimal - I personally don’t think we need to worry, just like it’s on you to source it and prove it’s factually accurate.  If you’re just being trigger-happy, we’re going to revoke whatever rights we gave you to do that.  A bot hopefully wouldn’t get that far.
    • Andrew - If this hypothetical reputation score was smart enough, it could take into account history.  Take into account it’s not some random new person.
    • Dan - There are video games with secret reputation scores.  They don’t bias against each other.  They use the reputation score to match people against each other.
    • James - It would be very interesting to have this secret data on Wikipedia.
    • Dan - It’s game performance in these cases.
    • Pau - With Huggle, trusted users appear in blue.  Not sure how.  I think it would be interesting to see.  Interested to see how that works.  Another perspective is to try to present this as not evaluating users, but the contributions they make.  Being aware users can evolve.
  • Which edit review and anti-vandalism tools should we prioritize as candidates for switching to ReviewStream and incorporating the new tools?
      • Joe - Let me try a different question.  I’ve been trying to get data on which are widely-used/high-impact.  Not that simple.
      • ORES only has a few languages, we should focus on wikis with AI.
    • Huggle
      • MusicAnimal - Huggle is the most important and has a good model.  It’s based on English which constantly has vandalism.  Spoke to Amir, they can see the last few minutes on Hebrew, less than one minute on English.  Not sure if this was an adaptation of the current RC feed.  You’re going to need a live stream to keep out.
      • Roan: More or less a live stream, behind second or two.
      • MusicAnimal: Huggle is behind too.
      • Andrew: That was a question I was going to ask.  The more latency, the cooler things you can add.  A second, a minute.
      • Roan: First stage, no reason to make it slower than ORES (it’s parallel).   We could also pull again after few seconds.
      • Dan: If we have acceptable latency, there can still be issues like reverts.
      • MusicAnimal - Oddly enough Huggle 2 used to take into account conflicts, but no longer does.  It does have page history,   When you look at page history and see a problem there, you need to protect it.  What often happens if students will mass-vandalism the page on their school.  You would rather protect the page.  If you have that right there in the UI, you can act.
      • Roan - Protection status at the time of the edit is one thing we’re considering including.  A revert stream could be a thing we could do.  This would also be useful in the revision stream.  Don’t know the state of the art on this.  Don’t know how expensive revert discussion is?  Not clear on undo then modify.  For rollback it’s a clear-cut case.  If you filter revisioncreate you would have a revert stream.  We do notify for undo currently.  For multiple edit undos we don’t do anything.
      • Matt: You really want protection status at the time of the edit, but at the time of review.
      • Roan: There is a use case for protection status at time of edit.  You can check current status with API queries as well.
      • Joe: What is most important after Huggle?
        • A: MusicAnimal - Stiki
      • RTRC
    • STIki
      • MusicAnimal - Stiki.   This goes off ClueBot’s scoring, so it can look at borderline cases.  ClueBot won’t revert the same user twice on the same page.  It will show up in Stiki.  Edits that happened hours ago will happen in Stiki.  Let’s assume you’re at the keyboard in Huggle, you didn’t revert the one that happened an hour ago.  It’s quite good.
      • Roan - Does Stiki get these benefits from ClueBot?
      • MusicAnimal - Yes
      • Roan - But close to reverting.
      • MusicAnimal - When CB goes down Sticki does too.  I’m guessing it doesn’t drink from the firehose.
      • MusicAnimal - Sticki has its own database (?). West Andrew would be a wonderful person to talk to about this.  If you go to Sticki homepage, his info is there.  He’s studied the data.  People are marking some edits as good.  You can do analysis and theoretically build something like ORES from that data.  That’s why he’s retaining that data.
    • LiveRC (popular on non-English wikis)
    • What else?
      • Alex - We’re focusing almost exclusively on bad actors.  I’m interested in good users too.  What about showing things to people before they click save.
      • JM: A lot of people bring that up.  To clarify, we’re talking about vandalism here because we’re focused on the feed.  Yes, we’ve thought about that.  Were you in the earlier session?
  • What is the best way to approach and then support/work with the communities that support these tools? (E.g., let them handle it all, provide design assistance, do the work for them...)
  • How might edit reviewers—and anti- vandalism fighters in particular—use the ability to detect newcomers who are in good faith but struggling? Will they care at all?
      • JM: One last question: These are not by and large WMF tools, they are written by community.  I’m not sure what this engagement should look like.  1. Will the community care about this (anti-vandalism particularly.  I think they will care, they are aware of research.  They don’t think “not my problem”, and given they also get other benefits, in addition to protecting users, do people have ideas about best way to work with communities?  Is it okay for WMF to just step in?
      • Danny Horn - This happens to us several times a year.  How do you approach these people if you want to work on a tool?
      • JM: This is a project based on research, but not necessarily community demand.
    • Matt: There is community demand to protect new users, but not necessarily from Huggle folks.
    • MusicAnimal - This would be really unique in combining power of ORES and functionality of Huggle.  Especially if we also surface things more suggestive about good edits or likelihood of a good edit.  People will be enthusiastic.  At least initially it’s going to be really difficult to pull people away from Huggle.
    • JM: We’re not going to pull them away, we’re going to add something.
    • MA: I do think it’s something new and valuable.
    • Dan Andreescu (DA): In terms of stepping in, you end up owning stuff you touched.  One of our prime directives is to do what the community can’t do.  Tools like this and focusing on them increases what the community can do.  If that’s the focus, people going to get real jobs becomes less of a problem.
    • JM: That absolutely was the idea of ReviewStream.
    • DA: Talking to these people to see their ideal scenario.
    • Ryan Kaldari (RK): A lot of times tools that aren’t actively maintained, the maintainers are glad to have people work on them.
    • DA: We launched page view API, we had a buggy client on purpose, we’re not going to fix the bugs.  Then someone fixed the bugs.  That seemed like a good strategy.
    • RK: If you can put out a prototype for how to use this stream, and steal the code, I think that’s the most useful thing we can do.
    • MusicAnimal - That’s how my work started.  I saw Marcel earlier and again thanked him.
    • AO: I’ve never used Huggle, there’s not a lot of community demand to protect new users.  Isn’t that what Huggle’s about?
    • [That’s Snuggle].
    • JM: Snuggle is more of an editor-reviewing tool.  It asks you to classify editors.  It sounds like a friendlier Huggle but it’s not.
    • AO: Maybe show “this is an edit, maybe they need a hug”.
    • JM: Something we may use is the summary of an editor’s history.
    • AO: I’ve never heard or thought of a rep system for users in Wikipedia, but is this something people have talked about.
    • RK: From my perspective, my anecdote is the only times I’ve heard this come up is when people were asking us not to implement their worst fears of what that could be.  People were concerned about harassment angle.  I’ve only heard people stigmatize any attempt as doing things related to rep.  Haven’t heard people speak up in favor of.  I’m sure there’s ways to reduce that impact.
    • DA: Aaron designed WikiCredit, it’s based on content survival.
    • MF: That could also be subtle vandalism.
    • DA: It’s not public by default.
    • JM: Aaron said ORES could rank users, but we’ve chosen not to do that.  I think that’s probably the right decision, people are very wary of robots making these decisions.
    • JM: Really appreciate you coming by, please stay in touch.