Requests for comment/URL shortener

This is a request for comment about implementing a URL shortener service for use by Wikimedia projects.

Background
A URL shortener is a service that takes long URLs (such as https://en.wikipedia.org/wiki/Article) and shortens them in terms of number of characters needed to represent that URL.

There are generally two types of URL shorteners:


 * 1) http://enwp.org/foo and http://youtu.be/foo kind that do direct expansion of the URL; and
 * 2) http://ur1.ca/foo kind that convert a hash or shortened version into a longer form of the URL.

Both of these implementations generally use HTTP 301 server-side redirects.

Traditionally these types of links have only been used on (external) social media services that have arbitrary character limits (such as Twitter). However, the need for their use in other contexts is allegedly growing (more on that below).

It's also important to note that many wikis block URL shorteners as they're a spam vector (this very page can't have links to youtu.be, for example).

Use-cases

 * Links in Echo notifications that are e-mailed or broadcast via XMPP
 * Neither e-mail nor XMPP have arbitrary character limitations, do they? I don't see the use-case for a shortened URL. --MZMcBride (talk) 20:59, 17 November 2012 (UTC)
 * Links to fundraiser landing pages that are posted to social media or sent via email
 * So Twitter and identi.ca? --MZMcBride (talk) 20:59, 17 November 2012 (UTC)
 * No, not just Twitter and identi.ca. All social media (and regular media) targeted by fundraising. Kaldari (talk) 19:33, 26 November 2012 (UTC)
 * URL sharing via the Mobile App
 * For use by the Wikimedia Foundation Communications Department in tweets (linking to a third-party site can be problematic... though there's apparently ur1.ca?)
 * Not new
 * File sharing from Commons
 * What does this mean, exactly? --MZMcBride (talk) 19:29, 18 November 2012 (UTC)
 * Via 'Email a link' and 'Use This File'. Right now it gives you URLs like https://commons.wikimedia.org/wiki/File%3ACircle_of_the_Limbourg_Brothers_-_Medallion_with_the_Emperor_Augustus's_Vision_of_the_Virgin_and_Child_-_Walters_44462_-_Back.jpg. It would be nice to have a Short URL option.
 * I'd expect the Commons community not to use shortened URLs (even if available) in such a case, unless 1) the legals confirm that "attribution by URL" is also ok with any variation of the URL, 2) the WMF ensures that the service will be maintained forever (which brings us again to the costs consdierations). --Nemo 10:50, 20 November 2012 (UTC)
 * Long Gerrit and Bugzilla URLs

Domain
The main thing needed is a short domain name. This would most likely have to be donated to us since short domain names aren't cheap.
 * The Wikimedia Foundation has a pretty big budget these days. If it really wanted a short domain, it could buy one. --MZMcBride (talk) 21:16, 17 November 2012 (UTC)

List of possible domain names:
 * w.org (exists, for sale)
 * w.co (available)
 * w.ly (exists, unavailable)
 * wmf.org (exists, unavailable)
 * wmf.co (exists, for sale)
 * wmf.ly (available)
 * wi.ki (exists, maybe for sale?)
 * w.mf (available)

And then of course there's the question of protocol: HTTP v. HTTPS.
 * How is that a question? The protocol and the domain are not related to each other, and our servers support HTTPS. They will both work. Krinkle (talk) 23:12, 17 November 2012 (UTC)
 * Question was poor wording on my part. I guess I was thinking about how a separate domain requires an additional SSL certificate. And I wasn't sure it was always a given that every Wikimedia service will support both protocols.
 * In this scheme, I guess http and https would redirect to their corresponding expanded forms. Perhaps I should have written "HTTPS support" and left it at that. --MZMcBride (talk) 07:00, 18 November 2012 (UTC)

Protocol
Most URL shorteners look like domain hacks, but domain hacks are arguably just a fad. An alternate approach to domain hacks and hashing would be pushing for the implementation of a new protocol such as wiki://. So you'd have something like:


 * wiki://w/en/Barack_Obama

The part following the protocol could follow our current interwiki syntax.

However, this would be a much longer process (convincing Web browsers and the world to adopt the protocol) and would still run into the issues discussed above with regard to youtu.be and enwp.org-type URL shorteners: namely that page titles can be quite long (up to 255 bytes), so you might not ultimately save many characters.
 * Interesting idea, although it seems like the most work to actually implement. Kaldari (talk) 19:32, 19 November 2012 (UTC)

Identifier
This section describes how we'd identify the wiki and page from our short domain.

Page and wiki identifier
A short hash should be created from the titles directly. This way the link is most likely to retain its link with the subject of the article (protecting against page deletion, re-creation, redirection, merging and splitting, switching to disambiguation constructs etc.), unlike the pageid.

Unlike the permalinks MediaWiki provides in the sidebar, these wouldn't use the revision id, they always link the last revision of the target page.

The backend identifier would be an integer from our table that maps the numbers to namespace/title pairs. We'd surface this in the url as a base36 encoded string for additional shorting.

This has been implemented in Extension:ShortUrl.

Wiki identifier (1): Abbreviated wikifamily
We'd use a subdomain or path to identify the individual wiki from all Wikimedia Foundation projects:


 * site.family.shortdomain/page-hash
 * family.shortdomain/site/page-hash
 * shortdomain/family/site/page-hash

For family we'd use an abbreviation (not the full name "wikipedia") as otherwise it'd be no shorter (e.g., not  , pointing to https://en.wikipedia.org/s/av resolving to https://en.wikipedia.org/wiki/Main_Page).

The site would be the subdomain of the family (if any) such as "en", "nl", "commons", "meta" etc.

The page hash is the base36 encoded numerical id mapping to the page namespace/title recorded in the ShortUrl extensions' mapping table.

Wiki identifier (2): Map wiki-id
In the interest of it fitting in a QR code we should abstract the wikifamily and wiki site in a hash that should be up to 3 characters. We should aim for 25 characters for the entire url as our maximum.

This would allow us to use:
 * QR Version 1 with 7% error correction (25 characters)
 * or; QR version 2 with 25% error correction (29 characters)

The url scheme would look like:

Where : a 3-character base36 encoded integer referring to the mapped project ID (padded if needed), followed by a base36 encoded integer (1 or more characters) that maps to the ShortUrl extensions' table.
 * shortdomain/PPPCCCCCC e.g. http://wmf.co/001av

Here PPPCCCCCC (9 characters) would be enough to determine the wiki family and site (supporting up to 46,656 wikis) and page hash for Page ID (36^6=2,176,782,336 possible IDs).

Obfuscation and mis-use
Using a hash has a cost: they introduce a middle-man dependency. By including a shortened (hashed) URL, you obfuscate where the underlying content is. If the service is unreachable (offline, broken, down) and there's no dictionary to resolve the URL, the content can be lost or irretrievable.

URL shorteners can also be mis-used, such as being included in contexts where there is no legitimate reason to use a shortened URL (such as blog posts or in HTML). Nearly all URLs are clicked or copied and pasted.

Extension
The extension currently has some bugs. Most importantly, it has not yet found a sensible place in the user interface to promote the short URL.

The current implementation is not deployable anywhere but on wikis desperate with their long percent-encoded URLs, because it clutters the interface in an extremely annoying way and it doesn't give users any clue on where to find short URLs/what are the interface elements it adds.

Full url mapping
Alternatively, instead of using the ShortUrl extension and a deterministic path to that page id, we could set up something like lilurl (or write our own software) that would map an identifier directly to a url.

Pros: Cons:
 * Even shorter url (one combined identifier for both wiki and page)
 * Allows shortening for urls with query parameters and fragments (e.g. section in an article, diff urls, special pages query parameters).
 * The ids would have to be created on-demand.
 * No predictable path (predictable for software that is, it isn't going to be predictable for humans either way).
 * It has to be requested into the shortener service.


 * More would be needed at least to replace some of the use cases. When WMF has used shorteners before (for example, in testing Twitter, Facebook and other venues as means of getting donations or getting people to contribute to the projects), the analytics features of the shorteners has been a key feature (which is why WMF hasn't stuck to ur1.ca and similar). So having some of those features in a WMF-run service that also comes with the same privacy protection users expect from us would be much better than a barebones shortener.--Sage Ross (WMF) (talk) 00:30, 20 November 2012 (UTC)
 * Then I guess someone should start a section above about such analytics features. The most sensible/viable option would be to use none, but if they really have to be tied with the shortener then they have to be carefully planned. Otherwise, associating this proposal to privacy drama/bikeshedding seems likely to kill it. --Nemo 10:50, 20 November 2012 (UTC)

Analytics and privacy
We'll want to have some analytics capabilities, and we'll need make sure that it adheres to our privacy protection policies and expectations.

Maintenance
Who's going to maintain this service for the indefinite future? Is the Wikimedia Foundation willing to maintain this service forever? If so, who within the Foundation will be in charge of maintenance?

Note that some of the use cases above would make maintaining the service forever a legal obligation: «[...] an alternative, stable online copy that is freely accessible [...]» (Terms of use).

The Wikimedia Foundation currently has a number of services (such as OTRS) that it has difficulty maintaining. Any additional service has real costs (adding features, fixing bugs, etc.). What are the actual costs here?

Plan

 * Improve Extension:ShortUrl
 * Get a domain name
 * Develop, review and deploy forwarding script
 * Analytics?