MediaWiki r81074 - Code Review

Jump to: navigation, search
Repository:MediaWiki
Revision:r81073‎ | r81074 (on ViewVC)‎ | r81075 >
Date:00:13, 27 January 2011
Author:happy-melon
Status:deferred (Comments)
Tags:
Comment:
(bug 235) parser function for conversion of units of measurement.

[[Template:Convert]] on enwiki is a behemoth of a construction that just about manages to do this sort of conversion, taking {{convert|5|mi|km}} and outputting "5 miles (8 km)", etc. To port this to another wiki requires copying over three and a half thousand subtemplates. The additional load produced by including numerous copies of this template is measurable on large pages on enwiki, and it eats voraciously into the template limits.

This revision introduces {{#convert: 5 mi | km }}, outputting "8 km" or thereabouts. See http://www.mediawiki.org/wiki/User:Happy-melon/Convert for more details, or look at the examples in the parser tests.

In a very rough profile, comparing 50 calls to {{convert}} verses the same 50 calls to the wrapper template shown at the link above, the parser function implementation reduces page load time by 72%, preprocessor node count by 83%, post-expand include size by 86% and template argument size by 97%. More detailed profiling would probably reveal places where extra caching could improve performance further.

The primary reason for putting it in ParserFunctions instead of its own extension is availability: PFs are already available across the cluster, and it's accepted as an essential extension for any wiki wishing to emulate or mirror WMF content. One less separate extension installed on the cluster is one less extension which has to be matched by reusers.

It's still missing a lot of units, which I ran out of patience to copy from {{convert}}; I thought I'd get some feedback on the infrastructure first.
Modified paths:

Diff [purge]

Loading diff…

Sign-offs

UserFlagDate
Aaron Schulzinspected20:44, 15 August 2011

Follow-up revisions

Rev.Commit summaryAuthorDate
r81085Temporarly disabled due to r81074. This revision adds 217 (!) new messages an...raymond08:41, 27 January 2011
r81095Follow-up r81074 CR:...happy-melon17:34, 27 January 2011
r81189Follow-up r81074:...happy-melon13:54, 29 January 2011
r81190Follow-up r81074: succumb, though it breaks my heart to do so (:P), to Americ...happy-melon15:45, 29 January 2011
r81191Follow-up r81074: better handling of whitespace, or lack thereof, between the...happy-melon15:59, 29 January 2011
r81517Follow-up r81074: turns out {{convert}} is (mostly) case-sensitive after all,...happy-melon14:28, 4 February 2011
r82230Follow-up r81074, r81517: implement SI prefixes as a separate structure. You...happy-melon10:27, 16 February 2011
r95525Follow-up r81074: fix typos in some unit names and messages ("mili" instead o...happy-melon21:22, 25 August 2011
r96499Follow-up r81074:...happy-melon21:06, 7 September 2011
r96502Follow up r81074: implement a language map per Tim's suggestion (no more Amer...happy-melon21:28, 7 September 2011
r96505FU 96499, r81074: fix parser tests to reflect the change in default dialect, ...happy-melon21:39, 7 September 2011

Comments

#Comment by Nikerabbit (talk | contribs)   08:43, 27 January 2011

Few comments:

  • The file should be in utf-8 encoding
  • Why doesn't getConvertParser use autoloader?
  • Extra tabs in ParserFunctions.i18n.magic.php
  • getLocalisedName() is never used, but if it will, returning the object returned by wfMessage would allow more flexible use.

This only works in PHP 5.2 and newer:

 +	if( $title instanceof Title ){
 +		$msgText = "[[$title|$msgText]]";

Maybe that and the few lines around it could be refactored into method?

#Comment by ^demon (talk | contribs)   14:13, 27 January 2011

I thought we discussed trunk can feel free to break 5.1 compat?

#Comment by Nikerabbit (talk | contribs)   14:23, 27 January 2011

I thought there was supposed to be a version that required 5.1 or newer before that.

#Comment by ^demon (talk | contribs)   14:26, 27 January 2011

Last I remember discussing, we said 1.17 should still retain 5.1 compat, but trunk was fine for breaking.

This doesn't have to be merged to the 1.17 branch.

#Comment by Happy-melon (talk | contribs)   17:10, 27 January 2011

I'm firmly in the camp that we can break back-compat in this fashion, but there's no particular reason why it needs to; especially in an extension which might well be upgraded independently of the core software. I wouldn't bat an eyelid about doing this in core trunk, but since it's so easy to avoid here, I'll do so.

#Comment by Happy-melon (talk | contribs)   17:34, 27 January 2011
The file should be in utf-8 encoding

/me hits PhpStorm with a hammer until it coughs up UTF8 encoding...

Why doesn't getConvertParser use autoloader?

That function was cloned directly from getExprParser() above; I've implemented autoloader for both of them in r81095.

Extra tabs in ParserFunctions.i18n.magic.php

They were deliberate to separate the 'subwords' out, but meh... :D

getLocalisedName() is never used, but if it will, returning the object returned by wfMessage would allow more flexible use.

It was supposed to be used in the "dimension mismatch" error; now implemented.

#Comment by Siebrand (talk | contribs)   12:27, 27 January 2011

Unit names are in enGB and should be in enUS, because our default localisation is in American English. Example: metre -> meter.

#Comment by Happy-melon (talk | contribs)   15:46, 29 January 2011

Done, with great bitterness (:P) in r81190.

#Comment by Siebrand (talk | contribs)   21:29, 29 January 2011

We could of course dump enGB and create enUS...

#Comment by Happy-melon (talk | contribs)   23:08, 29 January 2011

I'd call "en" 'English (international)', which would tend to mean that the spelling which is used in every English-speaking country apart from one, would be the one considered "international"...? :P

#Comment by Raymond (talk | contribs)   16:23, 27 January 2011

I have set the new messages to "ignore" on translatewiki to avoid wasted time of our translators in case of a revert. Any chance to get an OK from a lead developer in the near future?

#Comment by Happy-melon (talk | contribs)   17:08, 27 January 2011

Eventually... :P ~~~~

#Comment by Bryan (talk | contribs)   10:48, 28 January 2011

You really shouldn't hardcode the SI prefixes. The SI prefixes are consistent, so you can easily deduce that kPa means 10^3 Pa. Similarly, a square inch is 0.0254^2 meters, so you only need to hardcode the inch to meter conversion and all other can be derived from that.

Furthermore all exceptions should derive from MwException.

#Comment by Happy-melon (talk | contribs)   16:42, 28 January 2011

But a kn is not a kilonewton, it's a knot (actually this is itself a redefinition; the official abbreviation for the knot is 'kt', which conflicts with kiloton); a pc is not a picocoulomb, it's a parsec; and yd is neither a yoctoday nor a yoctodebye. All of those could be handled by forcing case-sensitivity, but that's a significant deviation from the current behaviour of {{convert}}. Then there's the question of what base units should take prefixes, and which ones: a kilomile makes sense, but a nanomile should actually be a nautical mile; should we accept nanotons of TNT or nanolightyears?

If this only handled SI units, there would be no problem, as the SI is carefully thought through to avoid such conflicts; the huge number of non-SI units in play, for which no such thought has been given, makes it much more dangerous to employ automated extraction of scales and powers like this. An acre, for instance, is not the square of one imperial unit, but 1 chain * 1 furlong. In general, in the process of making it as object-oriented as possible I tried to make it a bit more intelligent than just pure number-crunching, but I'm not sure if this would be too intelligent for its own good.

The ConvertError code was cloned directly from ExprError, so if one needs fixing, so does the other.

#Comment by Bryan (talk | contribs)   17:54, 28 January 2011

Fair enough, I would say that case sensitivity is not an unreasonable requirement for a unit, but I can see the argument of {{convert}} compatibility.

(I have nevery heard of kn as abbreviation for knot btw, as far as I'm aware it is always kt, but I'm aviation biased.

#Comment by Happy-melon (talk | contribs)   22:11, 28 January 2011

Indeed; kt is the approved abbreviation; the convert template instead used kn precisely because it chose to ignore case sensitivity...

#Comment by The ed17 (talk | contribs)   05:12, 1 February 2011

As a ships person on enwiki, I'll be ecstatic when this is implemented, because we tend to convert a lot of units. Even if you have to input "kn" to get knots, can it output "kt" on pages so we have the correct abbreviation?

#Comment by Happy-melon (talk | contribs)   13:55, 3 February 2011

Actually, it seems {{convert}} is (mostly) case-sensitive after all, so this might be possible, with some care.

#Comment by Tim Starling (talk | contribs)   04:00, 6 September 2011

The reason I made ExprError derive from Exception instead of MWException is because ExprErrors are never allowed to propagate to the calling code, they are always caught and converted to an error message. So they are private to the extension and do not need the MediaWiki-specific formatting code in MWException.

#Comment by Duplicatebug (talk | contribs)   14:22, 29 January 2011

return "$string $unit";: hardcoded whitespace is bad in some languages. Use MediaWiki:Word-separator. Or move the number into the message, and allow a second message (suffix -link) with wikilink inside the message (Lego is not nice...), but than you need also a message for link and abbr...

#Comment by Happy-melon (talk | contribs)   15:59, 29 January 2011

Actually, there shouldn't be whitespace added there at all, it should use whatever's in the source string. Done in r81191.

#Comment by SimonTrew (talk | contribs)   16:54, 2 February 2011

There are two similar characters for Angstrom: U+OOC5 LATIN CAPITAL LETTER A WITH RING ABOVE and U+212B ANGSTROM SIGN. Can this be handled similarly to how "mu" is handled? (Some fonts do render them a little differently.) ~~~~

#Comment by Bawolff (talk | contribs)   19:42, 3 February 2011

Mediawiki normalizes U+212B to U+00C5 before this extension would even see it, so I don't think that's an issue.


However, in the regexes, things like \x00C5 need to be \x{00C5} i believe.

#Comment by SimonTrew (talk | contribs)   03:05, 4 February 2011

Thank you for that pertinent reply. I am not familiar with WikiMedia, being mosly on Wikipedia. All is good then, and I hope my question was not out of order, at least we made a precedent together! My best wishes.

#Comment by SimonTrew (talk | contribs)   03:15, 4 February 2011

And apparently I can't even tell the difference between MediaWiki and WikiMedia, my bad. I have had quite a few odds and sods with the Template:Convert template though, which I use quite regularly, indeed have added a few little units to it myself, and am a little worried ok this will deal with one-to-one units but how does it deal with for example mm into in cm and vice versa, or km into mi ch (miles and chains as used on British railways and pretty much everywhere else the British Empire set foot, even Belgium bizarelly enough). It is that kind of quagmire you are going to get bogged down into with an all-fits-one solution. I don't think you are proposing that really but Template:Convert will not go away, just the jumping through hoops for simple arithmetic will go but the parsing will probably be half and half between that template and this extension. I note, for example, that the deduction of precision is nothing like as good as Template:Convert. User:Jimp is master there and you should appreciate his views. He would be quite happy to ditch it all I am sure bu after years of struggling he knows well enough one size does not fit all.

#Comment by Happy-melon (talk | contribs)   14:50, 4 February 2011

Bipartite units are indeed a challenge, but can still be dealt with much more efficiently using this function. For instance,

{{#expr: {{#convert: {{{1}}} | R }} + {{#convert: {{{2}}} | R }} }}

Combines two values and expresses them in the base unit of the relevant dimension, so it would convert {{foo|2 ft|3 in}} into "0.68" (metres), which can then be input to another layer of {{#convert:...}} functions to get the desired output. The design of this function is precisely to avoid trying to build a complete solution; on-wiki templates still have a part to play in presentation and layout. But that should be achievable through one template, not three thousand.

The algorithm used to establish precision here is different, but that doesn't necessarily make it inferior; and the algorithm used in {{convert}} has numerous holes and inconsistencies (an inevitable consequence of the limited parsing support). I've started a discussion on en:Template talk:Convert to compare the algorithms used, to see if there's anything which could be improved.

#Comment by Happy-melon (talk | contribs)   20:13, 16 February 2011

I think all the existing comments on this have been addressed, so setting back to new.

#Comment by Siebrand (talk | contribs)   08:31, 17 August 2011

There are typos in messages and key names with regards to "milli" that is somtimes written as "mili". I fear this may also cause functional issues.

#Comment by Happy-melon (talk | contribs)   21:23, 25 August 2011

Fixed in r95525.

#Comment by SimonTrew (talk | contribs)   06:38, 18 August 2011

I'm not sure that dimensional analysis is helpful — actually it is sometimes positively harmful. People for example measure their tyre pressure in pounds per square inch, although dimensionally it is foot-pounds per square inch, nobody actually says that in real life. (Sweeping statement there I know.) Another interesting one is miles per gallon versus litres per hundred kilometres, where it is a reciprocal conversion not a multiplication. In the UK, everyone including the auto companies quotes in miles per gallon, even though in the small print it will be in l/100km, and we buy our fuel in litres. I note also the difference in US and UK spelling for liter/litre, meter/metre etc, and wonder how this is coped with. It doesn't matter so much for EN:WP but for WIkimedia which has to support all languages perhaps is relevant — as I said above, as a wikignome on EN:WP I am not very well up on the scope of Wikimedia, so I can imagine wiser people than I have already thought of this kind of thing. ~~~~

#Comment by Tim Starling (talk | contribs)   03:50, 6 September 2011

Actually the measure of tyre pressure is pounds-force per square inch, not foot-pounds per square inch. See w:Pounds per square inch. Dimensional analysis is a useful technique, it would have identified your error here. Dimensional analysis can cope with reciprocals.

Spelling differences are best dealt with by using abbreviations, which are consistent and international. A style guide I once read recommended that abbreviations should always be used in preference to the unit names, since this convention avoids an unfortunate ambiguity between rad as an abbreviation for radians, and rad the unit name which is abbreviated to rd. Also they are shorter. However {{convert}} on Wikipedia uses full names by default so I suppose we should follow suit here.

It seems kind of strange to adopt the spelling conventions for metric units from the most obstinate anti-metric country in the world as our default. The Wikipedia templates do not take this approach, they use the British spelling by default, and allow an "sp=us" parameter to access the American spelling.

I suggest adding a method for configuring the language variant to be used on a wiki for unit name display. This would allow the new parser function to match the style used on Wikipedia and avoid the need for #language=en-gb to be added to every invocation. Setting "fixme" status for this feature request.

Ideally it should provide a map of base languages to sensible defaults, to provide Wikipedia-like functionality to small wikis without configuration, e.g.:

$wgPFUnitLanguageVariants = array(
   'en' => 'en-gb'
);

If $wgContLang is present as a key in the array, the language in the value of that element would be used by default for ConvertParser::$language.

#Comment by Tim Starling (talk | contribs)   04:57, 6 September 2011

Please add doc comments to the functions that lack them.

On line 421:

					self::$legalDimensions[$var],
					self::$legalDimensions[$var2],

Presumably this is meant to be $dim and $dim2. I get an "illegal offset type" warning because $var is an object, with test case {{#convert:1 kn|km/h}}. And with the same test case, on line 728:

	return $this->prefix
		? $this->conversion * self::$prefixes[$this->prefix][0]
		: $this->conversion;

This is broken with compound units, where $this->prefix is an array.

{{#convert:1 km| mi}} gives "1 mile". The correct answer to one significant figure would be 0.6 miles, so something is going wrong with the precision calculations here. Note that when you express a value between 1 and 2 with one significant figure, it's conventional to add another decimal place, e.g. 1.2 instead of 1.

I'm sorry to say, but these kinds of errors suggest that this module has not been sufficiently well-tested to justify immediate deployment to Wikimedia. I think registration of the #convert parser function should be made optional, and disabled by default for now, unless a lot of extra work is done.

Status & tagging log

  • 01:28, 13 June 2012 Krinkle (talk | contribs) changed the tags for r81074 [removed: nodeploy]
  • 13:36, 22 December 2011 Petrb (talk | contribs) changed the status of r81074 [removed: new added: deferred]
  • 00:07, 13 September 2011 Aaron Schulz (talk | contribs) changed the tags for r81074 [added: nodeploy]
  • 21:41, 7 September 2011 Happy-melon (talk | contribs) changed the status of r81074 [removed: fixme added: new]
  • 03:50, 6 September 2011 Tim Starling (talk | contribs) changed the status of r81074 [removed: new added: fixme]
  • 21:23, 25 August 2011 Happy-melon (talk | contribs) changed the status of r81074 [removed: fixme added: new]
  • 08:31, 17 August 2011 Siebrand (talk | contribs) changed the status of r81074 [removed: new added: fixme]
  • 20:13, 16 February 2011 Happy-melon (talk | contribs) changed the status of r81074 [removed: fixme added: new]
  • 22:57, 4 February 2011 Bawolff (talk | contribs) changed the status of r81074 [removed: new added: fixme]
  • 09:55, 1 February 2011 Reedy (talk | contribs) changed the status of r81074 [removed: fixme added: new]
  • 17:10, 27 January 2011 Raymond (talk | contribs) changed the status of r81074 [removed: new added: fixme]
  • 16:23, 27 January 2011 Raymond (talk | contribs) changed the status of r81074 [removed: fixme added: new]
  • 12:27, 27 January 2011 Siebrand (talk | contribs) changed the status of r81074 [removed: new added: fixme]