Requests for comment/URL shortener

Yuvi Panda to fork Extension:ShortURL (moving the GitHub repo to Gerrit) and implement.

Extension:UrlShortener

This is a request for comment about implementing a URL shortener service for use by Wikimedia projects.

Background
A URL shortener is a service that takes long URLs (such as https://en.wikipedia.org/wiki/Article) and shortens them in terms of number of characters needed to represent that URL.

There are generally two types of URL shorteners:


 * 1) http://enwp.org/foo and http://youtu.be/foo kind that do direct expansion of the URL; and
 * 2) http://ur1.ca/foo kind that convert a hash or shortened version into a longer form of the URL.

Both of these implementations generally use HTTP 301 server-side redirects.

Traditionally these types of links have only been used on (external) social media services that have arbitrary character limits (such as Twitter). However, the need for their use in other contexts is allegedly growing (more on that below).

It's also important to note that many wikis block URL shorteners as they're a spam vector (this very page can't have links to youtu.be, for example).

Use-cases

 * Links in Echo notifications that are e-mailed or broadcast via XMPP
 * Neither e-mail nor XMPP have arbitrary character limitations, do they? I don't see the use-case for a shortened URL. --MZMcBride (talk) 20:59, 17 November 2012 (UTC)
 * Links to fundraiser landing pages that are posted to social media or sent via email
 * So Twitter and identi.ca? --MZMcBride (talk) 20:59, 17 November 2012 (UTC)
 * No, not just Twitter and identi.ca. All social media (and regular media) targeted by fundraising. Kaldari (talk) 19:33, 26 November 2012 (UTC)
 * URL sharing via the Mobile App
 * For use by the Wikimedia Foundation Communications Department in tweets (linking to a third-party site can be problematic... though there's apparently ur1.ca?)
 * File sharing from Commons
 * What does this mean, exactly? --MZMcBride (talk) 19:29, 18 November 2012 (UTC)
 * Via 'Email a link' and 'Use This File'. Right now it gives you URLs like https://commons.wikimedia.org/wiki/File%3ACircle_of_the_Limbourg_Brothers_-_Medallion_with_the_Emperor_Augustus's_Vision_of_the_Virgin_and_Child_-_Walters_44462_-_Back.jpg. It would be nice to have a Short URL option.
 * I'd expect the Commons community not to use shortened URLs (even if available) in such a case, unless 1) the legals confirm that "attribution by URL" is also ok with any variation of the URL, 2) the WMF ensures that the service will be maintained forever (which brings us again to the costs consdierations). --Nemo 10:50, 20 November 2012 (UTC)
 * I don't think short URLs for file usage makes too much sense but for linking it may be helpful, particularly for physical printouts where perhaps QR code (that relies on short URL) would work better than a traditional URL. I cannot imagine re-typing that long URL above on a computer let alone a phone. That 171 character URL would require a QC code version above 4. Looking at the ISO documentation, version 6 with just 7% error correction is the earliest possibility and version 9 with 25% error correction is the first reasonably reliable version. The latter is very complicated and large making it impractical for actual use (such as complying with a license attribution requirement). A url shortening would reinforce mobile wiki/qr code efforts. -- とある白い猫 chi? 13:32, 28 September 2013 (UTC)
 * Gerrit and Bugzilla URLs

Domain
The main thing needed is a short domain name. This would most likely have to be donated to us since short domain names aren't cheap.
 * The Wikimedia Foundation has a pretty big budget these days. If it really wanted a short domain, it could buy one. --MZMcBride (talk) 21:16, 17 November 2012 (UTC)

List of possible domain names:
 * w.org (exists, for sale, but expensive)
 * w.co (available)
 * w.ly (exists, unavailable)
 * wmf.org (exists, unavailable)
 * wmf.co (exists, for sale)
 * wmf.ly (available)
 * wi.ki (exists, maybe for sale?)
 * w.mf (potentially available)
 *  *.wiki (available) 

Apparently, the owner of the .wiki TLD has offered to let us use the language codes domains at that TLD (e.g. ar.wiki, en.wiki, fr.wiki), however this requires approval from ICANN(?) since these may be confused with official country-code domains.

HTTPS support/certificates
And then of course there's the question of protocol: HTTP v. HTTPS.
 * How is that a question? The protocol and the domain are not related to each other, and our servers support HTTPS. They will both work. Krinkle (talk) 23:12, 17 November 2012 (UTC)
 * Question was poor wording on my part. I guess I was thinking about how a separate domain requires an additional SSL certificate. And I wasn't sure it was always a given that every Wikimedia service will support both protocols.
 * In this scheme, I guess http and https would redirect to their corresponding expanded forms. Perhaps I should have written "HTTPS support" and left it at that. --MZMcBride (talk) 07:00, 18 November 2012 (UTC)

Protocol
Most URL shorteners look like domain hacks, but domain hacks are arguably just a fad. An alternate approach to domain hacks and hashing would be pushing for the implementation of a new protocol such as wiki://. So you'd have something like:


 * wiki://w/en/Barack_Obama

The part following the protocol could follow our current interwiki syntax.

However, this would be a much longer process (convincing Web browsers and the world to adopt the protocol) and would still run into the issues discussed above with regard to youtu.be and enwp.org-type URL shorteners: namely that page titles can be quite long (up to 255 bytes), so you might not ultimately save many characters.
 * Interesting idea, although it seems like the most work to actually implement. Kaldari (talk) 19:32, 19 November 2012 (UTC)
 * Compatibility with older browsers may make it impractical for the near future but this could be a good long term goal. -- とある白い猫 chi? 13:53, 28 September 2013 (UTC)

Identifier
This section describes how we'd identify the wiki and page from our short domain.

Page and wiki identifier
A short hash should be created from the titles directly. This way the link is most likely to retain its link with the subject of the article (protecting against page deletion, re-creation, redirection, merging and splitting, switching to disambiguation constructs etc.), unlike the pageid.

Unlike the permalinks MediaWiki provides in the sidebar, these wouldn't use the revision id, they always link the last revision of the target page.

The backend identifier would be an integer from our table that maps the numbers to namespace/title pairs. We'd surface this in the url as a base36 encoded string for additional shorting.

This has been implemented in Extension:ShortUrl.

Wiki identifier (1): Abbreviated wikifamily
We'd use a subdomain or path to identify the individual wiki from all Wikimedia Foundation projects:


 * site.family.shortdomain/page-hash
 * family.shortdomain/site/page-hash
 * shortdomain/family/site/page-hash

For family we'd use an abbreviation (not the full name "wikipedia") as otherwise it'd be no shorter (e.g., not  , pointing to https://en.wikipedia.org/s/av resolving to https://en.wikipedia.org/wiki/Main_Page).

The site would be the subdomain of the family (if any) such as "en", "nl", "commons", "meta" etc.

The page hash is the base36 encoded numerical id mapping to the page namespace/title recorded in the ShortUrl extensions' mapping table.

Wiki identifier (2): Map wiki-id


Sample version 1, 2, 3 QR codes to demonstrate complexity increase. In the interest of it fitting in a QR code we should abstract the wikifamily and wiki site in a hash that should be up to 3 characters. We should aim for 25 characters for the entire url as our maximum.

This would allow us to use:
 * QR Version 1 with 7% error correction (25 characters)
 * or; QR version 2 with 25% error correction (29 characters)

The url scheme would look like:


 * shortdomain/PPPCCCCCC e.g. http://wmf.co/001av
 * 7 chars (http://) + X chars (short domain length) + 1 char (/) + 9 chars (PPPCCCCCC) = 17; X = 25 - 17 = 8 chars left for short domain length.

Where : a 3-character base36 encoded integer referring to the mapped project ID (padded if needed), followed by a base36 encoded integer (1 or more characters) that maps to the ShortUrl extensions' table.

Here  (9 characters) would be enough to determine the wiki family and site (supporting up to 36^3=46,656 wikis) and page hash for Page ID (36^6=2,176,782,336 possible IDs).


 * Possible alternate mapping suggestion ( part): Requests for comment/URL shortener/WikiMap

Obfuscation and mis-use
Using a hash has a cost: they introduce a middle-man dependency. By including a shortened (hashed) URL, you obfuscate where the underlying content is. If the service is unreachable (offline, broken, down) and there's no dictionary to resolve the URL, the content can be lost or irretrievable.

URL shorteners can also be mis-used, such as being included in contexts where there is no legitimate reason to use a shortened URL (such as blog posts or in HTML). Nearly all URLs are clicked or copied and pasted.

Extension
This RFC doesn't require using the ShortUrl extension. Though, depending on the outcome of the RFC, if we end up taking the page hash approach we can use it as a base (saves a bit of development time).

Either way, the following issues have to be dealt with (either in that extension, or in something new):
 * Support special pages (should work fine in Extension:ShortUrl's Title-based system, it just fails to support them right now)
 * Need to find a sensible place in the user interface to promote the short URL.

The current implementation is not deployable anywhere but on wikis desperate with their long percent-encoded URLs, because it clutters the interface in an extremely annoying way and it doesn't give users any clue on where to find short URLs/what are the interface elements it adds.

Full url mapping
Alternatively, instead of using the ShortUrl extension and a deterministic path to that page id, we could set up something like lilurl (or write our own software) that would map an identifier directly to a url.

Pros: Cons:
 * Even shorter url (one combined identifier for both wiki and page)
 * Allows shortening for urls with query parameters and fragments (e.g. section in an article, diff urls, special pages query parameters).
 * The ids would have to be created on-demand.
 * No predictable path (predictable for software that is, it isn't going to be predictable for humans either way).
 * It has to be requested into the shortener service.


 * More would be needed at least to replace some of the use cases. When WMF has used shorteners before (for example, in testing Twitter, Facebook and other venues as means of getting donations or getting people to contribute to the projects), the analytics features of the shorteners has been a key feature (which is why WMF hasn't stuck to ur1.ca and similar). So having some of those features in a WMF-run service that also comes with the same privacy protection users expect from us would be much better than a barebones shortener.--Sage Ross (WMF) (talk) 00:30, 20 November 2012 (UTC)
 * Then I guess someone should start a section above about such analytics features. The most sensible/viable option would be to use none, but if they really have to be tied with the shortener then they have to be carefully planned. Otherwise, associating this proposal to privacy drama/bikeshedding seems likely to kill it. --Nemo 10:50, 20 November 2012 (UTC)

Analytics and privacy
We'll want to have some analytics capabilities, and we'll need make sure that it adheres to our privacy protection policies and expectations.

Maintenance
Who's going to maintain this service for the indefinite future? Is the Wikimedia Foundation willing to maintain this service forever? If so, who within the Foundation will be in charge of maintenance?

Note that some of the use cases above would make maintaining the service forever a legal obligation: «[...] an alternative, stable online copy that is freely accessible [...]» (Terms of use).

The Wikimedia Foundation currently has a number of services (such as OTRS) that it has difficulty maintaining. Any additional service has real costs (adding features, fixing bugs, etc.). What are the actual costs here?

Plan

 * Improve Extension:ShortUrl
 * Get a domain name
 * Develop, review and deploy forwarding script
 * Analytics?

Tim's implementation suggestion

 * A MediaWiki extension.
 * Have a special page UI similar to lilurl etc.: ask the user to submit a long URL, get a small URL back
 * Accept only valid input URLs under WMF-controlled domains, to avoid the maintenance overhead which would come from widespread non-WMF use.
 * Also provide an API module, so that JS can fetch and display a small URL for the current page.
 * Host the redirects at a short domain name, to be purchased.
 * Use a rewrite rule to map short URLs to special page requests, for redirection.
 * Implement using a MySQL table with an autoincrement ID. The ID is converted to a larger base for use in the short URL, similar to Extension:ShortURL.
 * Use base 62 (uppercase, lowercase, digits) or higher.

The idea is to avoid any conceivable use case for external URL shorteners. External URL shorteners are a privacy and reliability concern, and so we should replace them with something in-house. It may be true that many uses of URL shorteners are inappropriate; it may even be that they are entirely redundant and should be discouraged in the strongest terms. However, discouraging them on this RFC is not going to stop them from being used. It's a small project, there are clear benefits, so we should just do it.