Requests for comment/URL shortener

This is a request for comment about implementing a URL shortener service for use by Wikimedia projects. The current plan is to do what Tim said to do.

HTTPS support/certificates
And then of course there's the question of protocol: HTTP v. HTTPS.
 * How is that a question? The protocol and the domain are not related to each other, and our servers support HTTPS. They will both work. Krinkle (talk) 23:12, 17 November 2012 (UTC)
 * Question was poor wording on my part. I guess I was thinking about how a separate domain requires an additional SSL certificate. And I wasn't sure it was always a given that every Wikimedia service will support both protocols.
 * In this scheme, I guess http and https would redirect to their corresponding expanded forms. Perhaps I should have written "HTTPS support" and left it at that. --MZMcBride (talk) 07:00, 18 November 2012 (UTC)

Identifier
This section describes how we'd identify the wiki and page from our short domain.

Page and wiki identifier
A short hash should be created from the titles directly. This way the link is most likely to retain its link with the subject of the article (protecting against page deletion, re-creation, redirection, merging and splitting, switching to disambiguation constructs etc.), unlike the pageid.

Unlike the permalinks MediaWiki provides in the sidebar, these wouldn't use the revision id, they always link the last revision of the target page.

The backend identifier would be an integer from our table that maps the numbers to namespace/title pairs. We'd surface this in the url as a base36 encoded string for additional shorting.

This has been implemented in Extension:ShortUrl.

Wiki identifier (1): Abbreviated wikifamily
We'd use a subdomain or path to identify the individual wiki from all Wikimedia Foundation projects:


 * site.family.shortdomain/page-hash
 * family.shortdomain/site/page-hash
 * shortdomain/family/site/page-hash

For family we'd use an abbreviation (not the full name "wikipedia") as otherwise it'd be no shorter (e.g., not  , pointing to https://en.wikipedia.org/s/av resolving to https://en.wikipedia.org/wiki/Main_Page).

The site would be the subdomain of the family (if any) such as "en", "nl", "commons", "meta" etc.

The page hash is the base36 encoded numerical id mapping to the page namespace/title recorded in the ShortUrl extensions' mapping table.

Wiki identifier (2): Map wiki-id


Sample version 1, 2, 3 QR codes to demonstrate complexity increase. In the interest of it fitting in a QR code we should abstract the wikifamily and wiki site in a hash that should be up to 3 characters. We should aim for 25 characters for the entire url as our maximum.

This would allow us to use:
 * QR Version 1 with 7% error correction (25 characters)
 * or; QR version 2 with 25% error correction (29 characters)

The url scheme would look like:


 * shortdomain/PPPCCCCCC e.g. http://wmf.co/001av
 * 7 chars (http://) + X chars (short domain length) + 1 char (/) + 9 chars (PPPCCCCCC) = 17; X = 25 - 17 = 8 chars left for short domain length.

Where : a 3-character base36 encoded integer referring to the mapped project ID (padded if needed), followed by a base36 encoded integer (1 or more characters) that maps to the ShortUrl extensions' table.

Here  (9 characters) would be enough to determine the wiki family and site (supporting up to 36^3=46,656 wikis) and page hash for Page ID (36^6=2,176,782,336 possible IDs).


 * Possible alternate mapping suggestion ( part): Requests for comment/URL shortener/WikiMap

Obfuscation and mis-use
Using a hash has a cost: they introduce a middle-man dependency. By including a shortened (hashed) URL, you obfuscate where the underlying content is. If the service is unreachable (offline, broken, down) and there's no dictionary to resolve the URL, the content can be lost or irretrievable.

URL shorteners can also be mis-used, such as being included in contexts where there is no legitimate reason to use a shortened URL (such as blog posts or in HTML). Nearly all URLs are clicked or copied and pasted.

Extension
This RFC doesn't require using the ShortUrl extension. Though, depending on the outcome of the RFC, if we end up taking the page hash approach we can use it as a base (saves a bit of development time).

Either way, the following issues have to be dealt with (either in that extension, or in something new):
 * Support special pages (should work fine in Extension:ShortUrl's Title-based system, it just fails to support them right now)
 * Need to find a sensible place in the user interface to promote the short URL.

The current implementation is not deployable anywhere but on wikis desperate with their long percent-encoded URLs, because it clutters the interface in an extremely annoying way and it doesn't give users any clue on where to find short URLs/what are the interface elements it adds.

Full url mapping
Alternatively, instead of using the ShortUrl extension and a deterministic path to that page id, we could set up something like lilurl (or write our own software) that would map an identifier directly to a url.

Pros: Cons:
 * Even shorter url (one combined identifier for both wiki and page)
 * Allows shortening for urls with query parameters and fragments (e.g. section in an article, diff urls, special pages query parameters).
 * The ids would have to be created on-demand.
 * No predictable path (predictable for software that is, it isn't going to be predictable for humans either way).
 * It has to be requested into the shortener service.


 * More would be needed at least to replace some of the use cases. When WMF has used shorteners before (for example, in testing Twitter, Facebook and other venues as means of getting donations or getting people to contribute to the projects), the analytics features of the shorteners has been a key feature (which is why WMF hasn't stuck to ur1.ca and similar). So having some of those features in a WMF-run service that also comes with the same privacy protection users expect from us would be much better than a barebones shortener.--Sage Ross (WMF) (talk) 00:30, 20 November 2012 (UTC)
 * Then I guess someone should start a section above about such analytics features. The most sensible/viable option would be to use none, but if they really have to be tied with the shortener then they have to be carefully planned. Otherwise, associating this proposal to privacy drama/bikeshedding seems likely to kill it. --Nemo 10:50, 20 November 2012 (UTC)

Analytics and privacy
We'll want to have some analytics capabilities, and we'll need make sure that it adheres to our privacy protection policies and expectations.

Maintenance
Who's going to maintain this service for the indefinite future? Is the Wikimedia Foundation willing to maintain this service forever? If so, who within the Foundation will be in charge of maintenance?

Note that some of the use cases above would make maintaining the service forever a legal obligation: «[...] an alternative, stable online copy that is freely accessible [...]» (Terms of use).

The Wikimedia Foundation currently has a number of services (such as OTRS) that it has difficulty maintaining. Any additional service has real costs (adding features, fixing bugs, etc.). What are the actual costs here?

Plan

 * Improve Extension:ShortUrl
 * Get a domain name
 * Develop, review and deploy forwarding script
 * Analytics?

Tim's implementation suggestion

 * A MediaWiki extension.
 * Have a special page UI similar to lilurl etc.: ask the user to submit a long URL, get a small URL back
 * Accept only valid input URLs under WMF-controlled domains, to avoid the maintenance overhead which would come from widespread non-WMF use.
 * Also provide an API module, so that JS can fetch and display a small URL for the current page.
 * Host the redirects at a short domain name, to be purchased.
 * Use a rewrite rule to map short URLs to special page requests, for redirection.
 * Implement using a MySQL table with an autoincrement ID. The ID is converted to a larger base for use in the short URL, similar to Extension:ShortURL.
 * Use base 62 (uppercase, lowercase, digits) or higher.

The idea is to avoid any conceivable use case for external URL shorteners. External URL shorteners are a privacy and reliability concern, and so we should replace them with something in-house. It may be true that many uses of URL shorteners are inappropriate; it may even be that they are entirely redundant and should be discouraged in the strongest terms. However, discouraging them on this RFC is not going to stop them from being used. It's a small project, there are clear benefits, so we should just do it.