Wikimedia Developer Summit/2017/developing-community-norms

From mediawiki.org

SESSION OVERVIEW[edit]

   Title: Developing community norms for vital bots and tools


   Day & Time: Tuesday 2017-01-10, 14:30


   Room: Hawthorne


   Phabricator Task Link: https://phabricator.wikimedia.org/T149312


   Facilitator(s): Chase, bd808


   Note-Taker(s):  Andrew Bogott


   Remote Moderator: Madhu


   Advocate: 




SESSION SUMMARY[edit]

Purpose: "Don't let your brother-in-law be the only person who keeps your community running"

Background:

Tool labs is a low-barrier-to-entry environment for running volunteer tools, bots, research, etc. The generic term is 'tool' which refers to anything that runs there.

Tool labs is bad at enforcing its open source values -- code is hard to discover, not easily shared or re-usable.

Tools are a vital resource for on-wiki content creation and curation activities. 24.67% of total edits come from tools and Wikimedia Labs, Sept-Nov 2016.

   53% of wikidata, 8.9% of enwiki edits come from Labs IPs
   3.8 Billion api request originated from Labs and Tools during the same period.


A brief outage of cluebot drastically increased reversion time for bad edits.

There are around 1500 tool accounts, maintained by all different levels of developer experience, written in a wide variety of languages. Most tools have only a single maintainer. Many tools do not use version control. Tools are very different from one another, with few universal rules.

Problems:

tool for overlaying geoloc points on google maps started to get memory errors, took a week to debug, moved to Kubernetes, still had problems. Phabricator task to rewrite it, but original author didn't have time to maintain it. No license.

Another bot that was very active on DE wikipedia (archived 2/3rds of talk pages) was still making http edits after http was deprecated (in favor of https). The auther was reachable but wasn't able to fix the bot without a java version upgrade on tool labs. The java upgrade turned out to be difficult (due to other dependencies outside of this bot.) The issue was revisited several times but the bot remained stuck with http. As the deadline approached, the author became unreachable (through no fault of their own). Bryan tried to take on fixing the tool himself, but discovered that there was no licence file and not source code -- only compressed .jar files were installed. Bryan took more steps trying to salvage the tool but was ultimately thwarted. Shutdown day 283 days after first contact. Bot was replaced, dewiki people created a new bot.

Salvaging/decompiling/rewriting abandoned bots and tools is a lot of work and doesn't scale very well -- Labs Staff doesn't have the resources to do this on a regular basis.

Solutions/Best practices:

  • Pick a license -- make this required in the form for a tools account
  • Publish the code -- at least a tarball, but ideally use version control
  • Have multiple maintainers
  • Write some documentation
  • Participate in the Community

New policies:


Both involve a volunteer committee, currently in the process of forming, with authority to make tough decisions.

Ideas to encourage other best practices? (Opened to the room)

Brandon suggests: Offer different SLAs to tools that comply with best practices Chase responds: We need to define support commitments across labs, generally. But we maybe don't have the staff needed to follow through with anything other than the base-level labs-wide SLA. (Some discussion about how to expose incentives about tool quality to users of the tools as well as tool authors and maintainers.)

Stuart suggests: private code repos (code escrow?) for users who don't want their code public Bryan responds: No code really needs to be private, people often don't want to share it because their ashamed. But, we're ALL ashamed. (Discussion about difference netween OSI license vs actual publishing of code)

Fako85 asks: I had a tool that went down but I don't get emails when it's down, I would love some automatic monitoring Bryan responds: In theory the 'bigbrother' system automatically restarts jobs. In practice, it sometimes works. Chase responds: We're terrible about monitoring and need to do something about it. Some volunteers are working on it. We need a defined release process, log aggregation, ready-made monitoring tools for users. We want it but we don't have it yet. Brandon suggests: Maybe tools best practices should include a standard for monitoring


Future plans:

  • Replace grid engine with kubernetes
  • Push-to-deploy PAAS model (where source control is integrated with the platform)
    • Openshift, Deis