Riffing off of Technology is lagging behind social progress, is the idea of a "progress catalyst" (better name ideas?) -- that by reducing some barriers to innovation in the space of quality control tools, ORES opens the doors for new standpoints, new objectivities (operationalizations), and therefor the expression of values that have until now been silenced by the difficulty and complexity of managing a realtime prediction model.
Progress catalyst: Standpoints that haven't been operationalized now can be
One way of thinking about this (and I think there are relationships to the two points above as well) is "what affordances does ORES provide?" As "progress catalyst" ORES affords the leveraging of prediction models to the community.
I'd like to turn this discussion towards the term "conversation" because I have found that it helped explain what I'd hoped to happen when building ORES. I'd like to put forth the idea of a "technological conversation". I see this process as better described by "access" than "affordance". When I say "technological conversation", I imagine a the expression of ideas through designed "tools" and that new "tools" will innovate in response to past "tools". (anyone know of any lit comparing innovation markets to a conversation and tracking design/affordance memes between, say, phone apps or something like that?)
Back before ORES, there were affordances that allowed the use of prediction models to the community, but one needed to engage in a complex discipline around Computer Science to do so effectively. The obvious result of this is that only computer scientists built tools that used prediction models to do useful stuff. Their professional vision was enacted and the visions/values/standpoints of others was excluded because they were not able to participate.
OK. Now looking at this like a conversation... Essentially, the only people who were able to participate at first were the computer scientists who valued efficiency and accuracy -- so they built prediction models that optimized these characteristics of Wikipedia quality control work (cite the massive literature on vandalism detection). We've seen that this has been largely successful -- their values were manifested by the technologies they built. E.g. when ClueBot NG goes down it takes twice as long to revert vandalism (cite Geiger & Halfaker 2013, "When the Levee Breaks"). These technologies have somewhat crystallized and stagnated design-wise -- we have a couple of auto-revert bots and a couple of human-computation systems to clean up what the auto-revert bots can't pick up. (We can see the stagnation in the complete rewrite of Huggle that implemented the same exact interaction design.) Snuggle is a good example of another Computer Scientist trying his hand at moving the technological conversation forward. While full of merits, this was more of a paternal approach of "I'll give you the right tool to fix your problems." While I believe that Snuggle helped push the conversation forward, it didn't open the conversation to non-CS participants.
OK onto the progress catalyst. To me, ORES represents a sort of stepping-back from the problem I want to solve (efficient newcomer socialization and support) and embracing the idea that progress is the goal and that I can't be personally responsible for progress itself. Us CS folk couldn't possibly be expected to bring all of Wikipedians' standpoints to a conversation about what technologies around quality control/newcomer socialization and other social wiki-work should look like. So how do we open up the conversation so that we can expand participation beyond this small set of CS-folk? How about we take out the primary barrier that only the CS-folk had crossed? If we're right, non-machine-learning-CS-folks will start participating in the technological conversation and with them, we'll see values/standpoints that us CS folk never considered.
One thing that makes this really exciting is the risk it entails. You lose control when you open a conversation to new participants. Up until now, I've been a relatively dominant voice re. technologies at the boundaries of Wikipedia. I have a set of things I think are important -- that I'd like to see us do ("Speaking to be heard"). But by opening things up, I enable others to gain power and possibly speak much more loudly than I can. Maybe they'll make newcomer socialization worse! Maybe they'll find newcomer socialization to be boring and they'll innovate in ways that don't help newcomers. That's the risk we take when we "Hear to speech". I'm admitting that I don't know what newcomer socialization & quality control ought to look like and I'm betting that we can figure it out together.
Been away for a while.
In the meantime, I proposed the m:Wikimedia Foundation Scoring Platform team to get some resources for ORES in the next fiscal year (starting July 1st). The process is going well, so I'm hopeful. I also worked with Ladsgroup to do a bunch of stuff for ORES Staeiou and I also ran a couple of workshops (Building an AI Wishlist [list] and Algorithmic dangers and transparency) at the dev summit. I got a few great conversations in, but I didn't have much time to focus on this paper.
I'm hoping to start a few new threads by CSCW so maybe we could catch up there. Here's a really loose train of thought I'm going to pick up later:
nudges, A/B tests, and "the ineffectiveness of Growth experiments"
As currently written, it sounds like we are setting up a straw man argument here, and I don't yet see what this review of previous interventions motivates. Where are we going with this?
OK so my thoughts are here that "we've tried the obvious solutions" that are dominant in the literature. This is why I point to "nudge". I want to use that as a placeholder for a set of strategies that are dominant in the HCI literature:
- Identify a desired behavioral change
- Redesign interface to "nudge" behavior
- Run an A/B test to see if nudge is effective
In nearly all cases, we have shown a lack of substantial (and maybe even significant) effect from these UI nudges (see Growth Experiments). Jtmorgan and I know that the Teahouse is a notable exception, but in a lot of ways, this is the point. The Teahouse isn't a nudge. It's a self-sustaining sub-system of Wikipedia built to empower Wikipedians to solve their own problems. It's far less removed than an interface change intended to direct behavior.
Agreed, and I think the Teahouse is a great example of a nudge vs. something else I don't have a word for yet. I don't want to make it a strawman, but I do think that there is something pretty different.
This might be a side-rant, but I have really come to find the "nudge" approach problematic and would be happy to challenge that in this paper. And I certainly got caught up in that mindset years ago, wanting to make a few small changes that would fix everything forever. It is a really appealing approach for a lot of reasons. There are a few nudges that do genuinely work, and those get really high-profile publications and TED talks. But we rarely hear about all the nudges people try that don't work (publication bias, etc.). So there is this powerful belief that we can achieve social progress primarily through small, non-controversial technological changes. It's great when you find a nudge that works, but if your goal is changing long-established patterns of behavior, then a designer/developer should probably expect that 95% of their nudges won't work, rather than the opposite.
I see where you're coming from, but if we're going to critique nudges, we should engage Thaler and Sunstein more directly, and be clearer about how what we're proposing is different from a nudge. Because while Snuggle/Teahouse may not be nudges, both systems employ choice architecture in their design. And I expect that tools built on top of ORES will as well, at least based on some of the (probably outdated by now) ReviewStream concept mocks I've seen.
I like the point that @Jtmorgan is making, while a (proposed language in quotes) 'systems-level intervention' might be the only approach shown to be effective, there are lower/'nudge-level' dimensions that matter _in that systems-level design_
An analogy that's difficult to delineate, but feels intuitive, is ecology. It's common to talk about systems-level change (e.g. The role of a swamp in cleaning water nearby), but have that outcome fail if one of the 'nudge-level details is left out (e.g. The correct form of bacteria may not be able to live in certain climates, and thus won't clean the water properly).
That styling is bizarre, and air can't figure out how to fix it.
I don't think it's fair to refer to the Teahouse as a designed thing in the way that "choice architecture" imagines designed things. I see the Teahouse as 5% designed things that formed a bedrock and conveyed a specific set of ideas and 95% what people chose to do with those ideas. The 5% of designed things affect very little direct change while the 95% of what the Teahouse hosts have made the Teahouse into is critical. The behaviors of Teahouse hosts may have been intended but they were not really designed. Instead they emerged. By relying on emergence, the founders of the Teahouse were taking a risk and better on "hearing to speech" -- if we design a space that is explicitly for a certain type of behavior (with the right nudges), then from that we might see a sustainable community turn our designed things into something that fits their view of newcomer socialization and support.
Also, the Teahouse nudges seem to be once-removed. Teahouse designers aren't nudging the newcomer (except maybe with what questions to ask -- not really nudging newcomers to stick around Wikipedia though). They were lightly nudging the Teahouse hosts maybe. But it seems to me that a more apt metaphor is that the Teahouse designer is that of founders. They made fertile ground for the Teahouse to grow, but what the Teahouse became was largely up to the Hosts who would take over running it.
Jtmorgan, does this jibe? I'm not sure my knowledge of the Teahouse's history is complete enough.
@EpochFail I'm going to stop using the word "nudge" for a moment in order to draw attention to a variety of small design choices that we made when we created the Teahouse with the goal of fostering setting particular expectations, communicating particular messages, and suggesting particular courses of action for hosts and guests:
- the Teahouse welcome message - a "nicer message" intended to contrast with other, less nice messages new editors might receive on their talkpages.
- the 5 host expectations - a short list of !rules that communicate the way hosts should interact with guests at the Teahouse
- Ask a question OR create a profile - two equally-weighted calls to action on the Teahouse landing page, communicating that guests are still welcome to participate (by creating a profile) even if they don't currently have a question to ask
- Host profiles - an auto-sorted list of recently active Teahouse hosts who are willing to share a little about themselves, and will help out if contacted directly
- "sign your post with five tidles" prompt - a prompt in the Teahouse Q&A gadget that teaches new editors how to sign their posts on talkpages
- Host badges - a series of badges (basically Teahouse-specific barnstars) related to desirable behaviors that hosts can give to one another, and place on their profiles (pretty popular, for a while)
There are a lot more examples, but this makes my point I think. The point is that the Teahouse is in some ways a very designed thing, and that like most designed things it's full of nudges. Thaler and Sunstein didn't invent nudges, they just developed a theoretical framework to describe the phenomenon. The framework may make it seem like behavioral change is easier than it actually is, but that doesn't mean that small-scale interventions can't work.
I think what you're trying to get at in the criticism of nudges is that you can't expect any old small design tweak to work the way you want it to. You have to take a system/ecological perspective and understand the potential impact of that change in context, and the way people other than you will understand the change. You can't just add a "like" button to any old forum and expect that people will use it like they do on Facebook. So if we want to critique past WMF new editor engagement initiatives or any other unsuccessful design interventions in an honest way, we need to talk about how these specific changes were and were not contextually appropriate.
I agree with you that ultimately in the case of the Teahouse and pretty much any other successful, self-sustaining designed system the end users need to be able to appropriate/reshape/reinterpret the system according to their own needs and desires. But initial conditions often have a big impact on that process, and small design decisions make a big difference. If, when we created the Teahouse, we hadn't made "Welcome everyone" !rule #1, it would probably not be the friendly place it is today.
I think we're talking about two things here. The Teahouse itself is not a nudge, but design decisions (even those that are part of the Teahouse!) can cause nudges. Sure. It seems that maybe this hit a design-matters nerve? I'm certainly not trying to make the argument that design doesn't matter. Instead, I'm trying to make the argument that nudges/minor-design-changes alone are the wrong strategy for addressing a problematic cultural state like the dominant quality control culture in Wikipedia. We've tried many simple "nudges" directed at newcomers with little effect on retention (see the history of the Growth team). I think that addressing a problem like reduced/biased retention requires more than nudges to encourage newcomers to create profiles for themselves or to make copy edits rather than big contributions. It requires a culture shift. I think your "5 host expectations" is a good example of something that is totally not a nudge, but more of a purposeful, culture statement. By making "Welcome everyone" !rule #1, you weren't implementing a nudge at all. You were implementing a cultural norm.
Accountability of algorithms
I want to talk about past work on this and how it works for ORES.
Right now, ORES' primary mechanisms for accountability look a lot like the rest of software around Wikipedia. We have public work boards, public processes, public (machine readable) test statistics, and we publish datasets for reuse and analysis. We encourage and engage in public discussions about where the machine learning models succeed and fail to serve their intended use-cases. Users do not have direct power over the algorithms running in ORES, but they can affect them through the same processes that are infrastructures are affected in Wikipedia.
This may not sound as desirable as a fully automated accountability dream that allows users more direct control over how ORES operates, but in a way, it may be more desirable. I like to think of the space around ORES in which our users build false positive reports and conversations take place as a massive boundary object through which we're slowly coming to realize what types of control and accountability should be formalized through digital technologies and/or rules & policies.
At the moment, it seems clear that the next major project for ORES will be a means to effectively refute ORES's predictions/scorings. Through the collection of false positive reports and observations about the way that people use them, we see a key opportunity to enable users to challenge ORES' predictions and provide alternative assessments that can be included along with ORES' predictions. That means, tool developers who use ORES will find ORES' prediction and any manual assessments in the same query results. This is still future work, but it seems like something we need and we have already begun investing resources in bringing this together.
Here's some notes that JtsMN posted in the outline.
- Jake's Notes
- accountability thread for future discussion, with an example
- models that stop discriminatory practice against anons may have other effects
- perhaps switching to GradientBoosting from LinearSVC helps anons, and harms e.g. the gender gap
My thinking on this line is mostly as a discussion point. I think the point you make above is reasonable, and for a given Discussion section, I think subsections of "fully automated accountability dream" and "sociotechnical oversight" are both super interesting.
Also, to be clear, I very much agree that "effective refutation" is a super interesting direction for accountability.
Agreed on the sub-sections. JtsMN, how would you define the "fully automated accountability dream"? Here's what I'd do about "sociotechnical oversight".
- Sociotechnical oversight
- Thinking about boundary objects. We don't yet know what types of oversight will be necessary and how people will want to engage in it.
- So we designed open channels and employed wiki pages to let others design their means of reporting false positives and other concerns.
- Public wiki pages, web forum, mailing list, work logs, open workboard for project management and prioritization of work.
- We also worked with local confederates from different communities to help socialize ORES & ORES-related tools as well as to design a process that would work for their community. These confederates helped us with translations and to iterate on solutions with communities who we could otherwise not effectively work with.
- We learned that humans are pretty good at "seeing" into the black box.
- We saw effective oversight occur in some interesting cases (anon bias, Italian "ha", etc.)
- We saw themes emerge in how people want to engage in oversight activities and this has driven the motivation for encoding some of this process in technology -- e.g. a means to review predictions and "effectively refute" them.
- We learned certain strategies to avoid -- e.g. sending everyone to a "central wiki" to report problems and concerns didn't really work for many communities.
I'm not 100% sure what this sort of thing looks like, but I'm gonna brain-dump, and we can go from there. I think this section has to be more speculative, and less anchored in ORES experiences thus far, but I think there points at which to tie it back.
- Fully Automated Accountability At Scale?
Accountability seems to have three major factors
- Verification that the system isn't biased along certain dimensions (e.g. protected groups)
- Effect sizes
- Ability to raise new dimensions along which bias is 'claimed'/hypothesized to be occurring
As such, the question of automating accountability hinges on these factors
- There are techniques that would allow achieving the first one
- it's apparently common in ML circles to treat models as a 'black box', and seek to predict the output along different (hypothesized to be biased) dimensions
- This is broadly automating the community review process that occurred for anons and false-positives.
- it's apparently common in ML circles to treat models as a 'black box', and seek to predict the output along different (hypothesized to be biased) dimensions
- For point 2, at what point is 'a bias that has been shown' /meaningful/?
- There are clearly meaningful examples (anons)
- A rule of thumb used elsewhere is a definition of 'disparate impact'
- could this be operationalized automatically?
- This could also be one dimension in which ORES could also support standpoints
- How does addressing one dimension of bias affect the others (the intersectionality question)?
- e.g. the LinearSVC model is better for anons than GradientBoosting, but may harm gender bias efforts (if anons are almost always male, does enabling better anon participation harm the efficacy of WikiProject Women in Science?)
- The third is more of an open question:
- Should it be community driven?
- Should there be effort to automate recognition of dimensions of bias? How do we distinguish between 'bias against swear words' (statistical signal), and 'bias against anons' (harm from statistical signal), if there is no community involvement?
- While there's clearly a tension in between full automation and community participation, the question of legibility and scale is really important - as algorithmic infrastructure is formalized, what will failing the community actually look like?
- A 'fully automated accountability system', like ORES, risks operationalizing the ideologies of the builders.
- It's not clear that full automation can ever be achieved while supporting standpoints and meaningful community oversight
- Lowering the barriers to accountability (e.g. proposing new dimensions of hypothesized bias, etc.) at scale may be a fundamentally sociotechnical problem
- However, automating "due diligence" may be 'good enough automation'. This could mean:
- dimensions of accountability (e.g. protected classes)
"Technology is lagging behind social progress"
Good Q. So here's where I usually show screen shots Huggle's UI changes as a demonstration in the talks I give about this. I argue that Wikipedia's quality control systems looks almost identical to how they looked in "The banning of a vandal" paper. In that way, this set of technologies hasn't adapted to take on the new, extended standpoint we have that includes the value of newcomer socialization.
In contrast, consider the Teahouse, inspire campaigns, and studies of newcomer socialization. Consider the the WikiProjects that have organized around newcomer socialization and specifically about supporting our most at-risk editors/content. When considering all this, Snuggle, was just one interesting (yet mostly insubstantial) note in what we'd expect to be a substantial innovation period.
This observation is the basis for the idea of an innovation catalyst like ORES. "Why isn't anyone experimenting with better ways to do quality control?" My answer is that "it's hard!" The only people who build advanced quality control tools are computer scientists. ORES is positioned to change that -- to reduce barriers and to spur technological progress to match our social progress in thinking about newcomers.
Our role as technologists: Are we just encoding our own ideologies?
I asked this question out of due diligence, but it probably warrants a big discussion. There's probably degrees to which we do and do not encode our own ideologies.
E.g. I think that machine learning is important to quality control. I have somewhat of a techo-centric view of things. I also see value in "efficiency" when it comes to Wikipedia quality control. It's from this standpoint that I saw the technical conversation and the barrier of developing machine learning systems as critical. So, lots of ideology getting encoded there.
On the other hand, by not specifically building user interfaces, we make space -- we "hear to speech" (see http://actsofhope.blogspot.com/2007/08/hearing-to-speech.html). So, maybe we encode our ideologies to an extent, but we do not continue past that extent and instead make space to hear what others want to "say" through their own technological innovation.
I think it is interesting to draw a contrast between this approach and what we see coming out of Facebook/Google/Twitter/etc. and their shrink-wrapped "intelligent" technologies that fully encode a set of values and provide the user with little space to "speak" to their own values.
It's an important question to ask and discuss. A lot of the foundational scholarship in the software studies, politics of algorithms, and values in design literatures involves pointing to systems and saying, "Look! Values! Embedded in design!" Most of those canonical cases are also examples of very problematic values embedded in design. So the literature often comes across as saying that it is a bad thing to encode values in design.
I take the position that it is impossible to not encode values into systems. To say that you aren't encoding values into systems is the biggest ideological dupe of them all (and pretty dangerous). Instead, the more responsible move (IMO) is to explicitly say what your values are, give explanations about why you think they are important and valuable, and discuss how you have encoded them into a system. Then others can evaluate your stated values (which they may or may not agree with) and your implementation of your values (which may or may not be properly implemented).
Even though no traditional GUI user interfaces are built as part of the ORES core project, an API is definitely an interface that has its own affordances and constraints. But I do think it is interesting to draw a parallel to Facebook and maybe Twitter in particular -- Twitter used to be a lot more open about third party clients using their API, and lots of the innovation in Twitter came from users (retweets, hashtags). But they have tightened down the API heavily in recent years, particularly when someone provides a third party tool that they feel goes against what they think the user experience of Twitter should be.
So to wrap this up, I guess there are two levels of values in ORES: 1) the values in an open, auditable API built to let anyone create their own interfaces, and 2) the values encoded in this specific implementation of a classifier for article/edit quality. For example, you could have an open API for quality that uses a single classifier trained only on revert data and doesn't treat anons as a kind of protected class.
I think that there's another angle that I want to concern myself with -- ethically. I think it's far more ethical for me (the powerful Staff(TM) technologist) to try to enable others rather than just use my loud voice to enact my own visions. Staeiou, I wonder what your thoughts are there? It looks like this fits with value (1) and I agree. But I'd go farther than say I simply value it to say that there might be some wrongness/rightness involved in choosing how to use power in this case.
Discussion about disabling Flow board
Hey folks, there's a discussion @ Meta:Babel#Flow that, if it passes, would result in this discussion board getting disabled. It's a big mess. Your input would be valuable. I'll answer any questions you have.
A lot better!