Talk:Requests for comment/Clean up URLs

Wary; some notes
I'm naturally wary of such changes. :) A few notes:


 * There's only "no conflict" between "robots.txt" file and "Robots.txt" article due to the historical accident that we by default force the first letter of a page name to capitalized. Please do not rely on this being true in the future; we may well "fix" that one day, and we'd possibly want to rename those files.
 * If we do some sort of massive URL rearrangement, it could break third-party users of our HTML output (including parsed-by-the-API HTML output). For instance I know this would break handling of article-to-article links in the current Wikipedia mobile apps (they would no longer recognize the URLs as article pages, and would probably load them in an external browser instead). This would at the least require some careful planning and coordination.
 * If we're making a rearrangement of URLs, we'll probably have a fun ..... shift... in search engine rankings etc. It might be disruptive.
 * Regarding the index.php .... the primary problem with simply changing everything over to /Article_URL?with=a&query=string is that our robots.txt would no longer be able to exclude those links from spidering. Using a separate prefix means we can very easily chop off all our standard-generated links-with-querystrings that need to be dynamically generated, and make sure that spiders don't squash our servers into dust.
 * Using action paths (eg /edit/Article_name, etc) would provide a nice readable URL without damaging that. However this existing support doesn't cover the case of things like old revisions ('?oldid=123') etc, which default to $wgScript

-- brion (talk) 18:24, 16 September 2013 (UTC)


 * I personally think that forcing the two articles about robots.txt and favicon.ico to be capitalized is an acceptable trade-off. This does not prevent us from using lower-case titles in general (which we already support).
 * I agree that we'd have to coordinate with third-party users. Some of the users of the PHP parser's HTML are also preparing to use Parsoid output, which uses relative URLs everywhere. One of these users is Google. Since we are in contact with the Google folks and can contact other search engines too we can probably avoid issues with ranking changes.
 * Re robots.txt: At least Google, MSN, Slurp (Yahoo) and Yandex support globbing. I have been using this with success for many years, and sites like Quora do the same. -- Gabriel Wicke (GWicke) (talk) 19:08, 16 September 2013 (UTC)


 * Keep in mind that there may be other conflicts in the future besides just robots.txt, favicon.ico, and internal stuff. For example RFC 5785 defines a namespace for new "Well Known" URIs and it's already been picked up by some standards. We never know what kind of new standard we might want to implement in the future. And if we ever decide to implement one which simultaneously got notable enough for Wikipedia to write an article and add a redirect for then there would be a conflict. And unlike robots.txt that would be a real conflict. Because the well-known standard prefix happens to be  and   is the same in upper and lower case strings. Not to mention the fact that even if we didn't implement any of those standards there would still be an undesirable conflict where implementations of that standard try to parse Wikipedia articles because they happen to sit at the exact URL they are expecting and happen to return a 200 OK code. Daniel Friesen (Dantman) (talk) 08:00, 17 September 2013 (UTC)


 * If RFC 5785 gets widely adopted then that would basically avoid future name clashes. Only /.well-known/ including the trailing slash would conflict, and an article about the standard can well be called /.well-known. Lets hope it gets adopted for future sitemaps etc. -- Gabriel Wicke (GWicke) (talk) 17:15, 17 September 2013 (UTC)

YES
Cleaning up our URLs this way would make me and so many others who don't understand our uniform resource locators happy. The two current changes outlined within the scope of this RFC are a classic example of an "implementation mental model" problem: when an interface presented to users follows the way it is implemented technically, rather than in a way users naturally expect. Steven Walling (WMF) &bull; talk   00:14, 17 September 2013 (UTC)

Separating page URLs from resources/actions
Another approach to the w/ problem might be to put page names on the root, for example change en.wikipedia.org/wiki/Main_Page to en.wikipedia.org/Main_Page, and all other URLs (.php entry points, images/, actions) on another domain, for example changing en.wikipedia.org/w/api.php to say en.wp-resources.org/api.php or wmf-resources.org/enwp/api.php... - Wonder (talk) 03:02, 17 September 2013 (UTC)


 * Putting pretty urls and other resources on other domains is practically impossible and doesn't fix the issues:
 * Resources such as robots.txt and favicon.ico are universal. Even if you add another domain root urls still conflict with these resources.
 * You cannot simply change live URLs like these. Even if you make the new API location wmf-resources.org/enwp/api.php the url en.wikipedia.org/w/api.php must still point to the API because there are piles of things still pointing at this URL. So the original issues haven't gone away.
 * You cannot place the API and site on different domains. We use the API within the live site for things like watchlist updates. Moving the API to a different domain will break site features that use the API because of cross-origin restrictions. Even if we implement CORS 1.83% of WP's traffic doesn't implement CORS in any way at all and 9.65% of it doesn't implement CORS in a way we can use.
 * Using two domains will break sessions. Login sessions will be tied to the en.wikipedia.org domain. As a result any resource served to the user from the other domain won't have the session. While this won't be an issue for things like images – unless of course the wiki is private and using img_auth – this is critical for things like the API. Not only will there be CORS issues with the API but the API won't even have the user session so even in a CORS supporting browser things like watchlist toggling will break. And logins done with API action=login will have cookies tied to the wrong domain if they are intended for any sort of AJAX login or intended to present desktop views to the user (I wonder if the mobile site would fit under this category).
 * Daniel Friesen (Dantman) (talk) 04:04, 17 September 2013 (UTC)

Statement of problem
Hi. I still think this request for comments is missing a clear statement of what problem is trying to be addressed. As I understand it, the vast majority of requests currently are to example.com/wiki/Foo (using the implicit "view" action). What is the purpose of rewriting URLs of other actions? They get far lower traffic and we generally don't want to (for example) cache history pages (action=history) or edit screens (action=edit). Why would we rewrite URLs to have a uniform prefix? What problems are we seeing right now that would be addressed by this change? --MZMcBride (talk) 00:13, 22 September 2013 (UTC)
 * There are two applicable bugs linked in the RFC, and probably more not mentioned. I support this, as long as we commit to making all the old URLs still work through rewrites.  We do need to think through the possible issues.


 * As far as caching, I see that as secondary. Some pages with query strings could probably be cached (e.g. maybe printable=yes), but that should be a separate conversation.  Also, I wonder if Varnish or Squid can be made to ignore parameter order (see "multiple query parameter" point in the lede). I don't know enough to answer that question. Superm401 - Talk 11:44, 27 September 2013 (UTC)

robots.txt
robots.txt seems like a part of this that needs careful research. We need to verify that the well-behaved bots (can't do anything about ones that just ignore robots.txt) responsible for most of our traffic obey the glob Disallow mentioned. Also we need to make sure that pages like https://en.wikipedia.org/wiki/Who%3F_%28novel%29 are interpreted correctly. In other words, that robots.txt engines don't decide that's equivalent to https://en.wikipedia.org/wiki/Who?_(novel) and thus blocked. Superm401 - Talk 11:48, 27 September 2013 (UTC)

I also have a separate issue with the idea of blindly blacklisting /wiki/*? I sent an email about.


 * Daniel Friesen (Dantman) (talk) 12:43, 27 September 2013 (UTC)