User:Tgr (WMF)/external login

Draft of a proposal for adding external login to Wikimedia sites.

Wikimedia, uniquely amongst the top websites, is operated by a charity and funded by donations. This means it is able to uphold values which other sites do not for business reasons, such as user privacy. It also means it has far less resources for improving user experience, to the point where it sometimes becomes an obstacle to new editors joining. Often we can improve that by relying on the good work done by others; using external logins would be such an opportunity.

External login means allowing users to log in or sign up through a simplified procedure using an account they have at some external account provider (e.g. Google). See for example Stack Overflow's login page where they provide various login options.

What problems would this solve?

 * Wikimedia's captchas are fundamentally broken: they keep users away but allow robots in. While they can filter out the most stupid spambots, they are easily breakable with off-the-shelf tools. (T141490) At the same time, they take significant effort and often multiple tries for a human to solve, and are especially bad for people with visual impairments. Most captcha implementations (including all the opensource ones) do little better, so this is not a problem that could be solved just by spending a small amount of engineering effort. Top sites have user-friendly captchas (such as Google's reCAPTCHA) but those cannot be directly used on Wikimedia sites for various reasons. Since large sites such as Google make a far better job than we do to ensure their users are real humans and not spambot farms, we could just rely on the user having an account with them as a proof of authenticity, and skip the captcha check.
 * The same goes for IP-based throttling, which regularly causes problems at editathons. If we can confirm the user is not a spammer or sockpuppet farmer, we could apply far less restricitve limits.
 * Filling out the email address (and then confirming it) is a slowdown, which is one of the reasons we don't require it. On the other hand, users without email regularly lock themselves out of their account. Also, users who try out Wikipedia, leave for a while, and then return, often forget their username. Email addresses help somewhat (cf. T30085, but people still tend to have several of those (and sometimes use addresses like  for sorting). Using an external account provider such as Google at which almost everyone has an account (and which can provide us with a verified email address) would solve this problem.

What problems would it raise?

 * Privacy: the user might not want to divulge that the Wikimedia account and the external account are owned by the same person. Most external accounts that would be appropriate for us are tied to IRL identities, so connecting the two would deanonymize the Wikimedia user account. There are three possible threat models:
 * Wikimedia (or someone with access to internal Wikimedia data) trying to find out who owns a Wikimedia account. This cannot really be avoided: the external account provider will send the user's identity every time they use it to log in. We could use some sort of hashing to avoid storing it, which would make the threat surface minimal (although that would mean not recording email addresses). Given that Wikimedia is a non-profit and highly respectful of user privacy, I think most people would trust it not to track them even if the technical possibility is present.
 * The external identity provider (or someone with access to their internal data) trying to find out what their users' Wikimedia accounts are. The login process does not directly divulge any information to the identity provider; there are two indirect channels which need to be considered, referrers (which can tell what article the user was reading before clicking login) and correlating the timing to the account creation log. The first is easy prevent by stashing the URL in the session and then redirecting through some gateway page. The second is more tricky (but is problematic whether or not we have external login, see T21161). We would probably have to hide account creation logs (and the username in Special:ListUsers since a determined attacker could always fall back to polling that) for some random amount of time, or until the user's first edit. Assuming that most users do edit relatively soon after registration or login (minutes, maybe), on smaller wikis that could still be correlated to the request to the identity provider, so we would have to make sure the domain name is not divulged.
 * An attacker monitoring the user's web traffic, trying to identify their Wikimedia account. (As far as I can see there is no possibility of an attacker determining the external provider account.) While all communication is encrypted, they still learn the domains the user is connecting to, so they can just look for a traffic pattern like Wikimedia -> external account provider -> Wikimedia, and use the log correlation described above. This is the most concerning scenario; however the same prevention methods could be used here as well. In the end, some level of identifiability is probably unavoidable, with or without external login; an attacker can just use traffic analysis to detect when content is posted and correlate that with MediaWiki activity logs.
 * Favoring some identity providers over others. As long as we only use very large ones, this does not seem problematic (no one is going to create a Google account just so they can log in to Wikipedia, so it's not like we provide them any advantage).
 * Reliance upon external services to be able to login. If an account was created using an external provider, and that service goes down (temporarily or permanently), users will no longer be able to log in to their account. This can be mitigated by only using providers which give us a valid email address, which can be used for password reset.