Extension:Transliterator

From MediaWiki.org

Jump to: navigation, search

             

Manual on MediaWiki Extensions
List of MediaWiki Extensions
Crystal Clear action run.png
Transliterator

Release status: beta

Implementation  Parser function
Description Allows transliteration of words in entries based on rules in MediaWiki space.
Author(s)  Conrad.IrwinTalk
Last Version  1.1.0 (02:05, 30 July 2009 (UTC))
MediaWiki  only tested with 1.16+
License GPL 3
Download SVN

check usage (experimental)

An extension to allow transliteration based on ad-hoc schemes stored in the wiki's MediaWiki space (by default under the [[Mediawiki:Transliterator:]] pseudo-namespace.

Contents

[edit] Usage on pages

Most people will want to use this

{{#transliterate:<mapname>|<word>}}

Which will generate the transliteration of <word> based on the map found in [[MediaWiki:Transliterator:<mapname>]].

[edit] Usage in templates

[edit] Formatting output

The extension supports several extra parameters to help template-authors integrate this easily into their code. At a basic level, it allows customisation of the output using the third parameter.

{{#transliterate:<mapname>|<word>| ($1)}}

Will output nothing if there is no page at [[MediaWiki:Transliterator:<mapname>]], or " (<transliteration>)" if the map exists. This should allow template-authors to avoid doing some form of {{#if: statement to see whether a transliteration can be created.

[edit] User-supplied transliterations

The fourth parameter allows you to set a user-override on the output of a transliteration. This has two uses, one where the transliteration that is generated is incorrect, and two, where the map does not yet exist for a language.

{{#transliterate:<mapname>|<word>| ($1)|{{{tr|}}}}}

Will output  ({{{tr}}}) if {{{tr}}}} is not blank. If {{{tr}}} is blank and [[MediaWiki:Transliterator:<mapname>]] exists, it will output  (<transliteration>) as before. And if {{{tr}}} is blank, and [[MediaWiki:Transliterator:<mapname>]] does not exist, it will output nothing.

[edit] Failure to transliterate

The final parameter allows for an "error" message to be displayed instead of a blank output in the two cases above. This can be useful where the transliteration is mission-critical, but should be used sparingly.

{{#transliterate:<mapname>|<word>||{{{tr|}}}|Please specify a transliteration!}}

[edit] Creating maps

[edit] Simple Overview

Every line of the map file should contain a rule like α => a or ae => æ, the longer the rule, the higher its priority. The special characters ^ and $ can be used to match only at the start and end of a word, so ^α => (a or α$ => a). Most of the time the rules are case-insensitive, so if you include π => p, you don't need to include Π => P, however for multiple-character-rules you may need to duplicate them, i.e. ij => kl and IJ => KL are both needed, as the ij => kl provides the automatic rule of Ij => Kl. Lines that start with a # are ignored, and the first line can be <sensitive> to make the rules case-sensitive or <decompose> to use NFD instead of matching letters. CAVEAT: This assumes a word is one or more Unicode characters, and uses the Unicode case-mappings which may not be perfect for all languages.

[edit] Syntax

Blank lines and lines that start with # are ignored. Other lines should be a rule in the form left-hand-side=>right-hand-side, with the exception of the first one that may contain flags instead of a rule. Whitespace is removed from the beginning and end of lines, and before and after the => symbol, so a => b is exactly equivalent to a=>b.

[edit] Transliteration process

Rules are applied one-by-one and the next rule starts matching at the character after the previous rule. The rules are matched in length order, so the longest possible rule is used.

  • If the match starts on the start of a word (the previous character wasn't a letter, but this one is) then the rule ^x will match, and take priority over x.
  • If the match ends at the end of a word (this character is a letter, the next one isn't) then the rule x$ will match and take priority over x.
  • If no match is possible, and the first character is upper-case, then it is converted to lower-case, and the longest applicable rule is used.
  • If no rule with a left-hand side matches, then the default rule, with nothing on the left-hand-side, => ? is used and any occurrences of $1 in the right-hand-side are replaced by the current first character.
  • If no rules have matched, and there is no default rule, then one character is passed through unchanged.

[edit] Flags

The first line of a map can contain either or both of:

<decompose>
Divides strings into sections using the Unicode form NFD instead of the default letter-based form. This is particularly useful for languages such as Korean, or for situations in which the diacritics in the transliterated form match exactly the diacritics in the original. NOTE: the letter form is not quite NFC as combining diacritics are never split from their base even when there is no pre-combined character for them.
<sensitive>
Turns off the automatic lower-casing for the first letter of each rule with no upper-case match, this can be useful when you have very specific requirements and would rather nothing matched "by accident".

[edit] HTML entities

The right-hand-side and left-hand-side of rules have HTML entities decoded, this allows for diacritics to be entered in a form that is easy to edit, and for the characters that make up the syntax of the maps to be escaped in the rare cases that you will want to use them. As the HTML entities are also decoded by your web-browser, you will not see a difference unless you edit the page. For example, the HTML entities for "^", "$", ">" and " " are &#x5E;, &#x24;, &gt; and &#x20; (note that the common &nbsp; is not a normal space).

[edit] Possible errors

All of these error messages appear at the place which {{#transliterate}} is invoked. The maps are not parsed when they are saved.

Ambiguous rule <rule> in [[MediaWiki:Transliterator:<mapname>]]
This is caused when a map contains two rules with the same content on the left of the =>. This can never be correct, as it would leave the Transliterator to make an impossible decision as to which right-hand-side to replace the left-hand-side with.
Invalid syntax <rule> in [[MediaWiki:Transliterator:<mapname>]]
This is caused by a line that contains no "=>" and that does not start with a "#", The parser cannot decide whether you meant it to be a comment, but forgot to say, or whether you meant it to be a rule and got it wrong, so it asks for confirmation.
More than X rules in [[MediaWiki:Transliterator:<mapname>]]
In order that this extension doesn't create massive maps that could potentially consume the server's memory, it limits itself in size. The limit in number of rules is configurable as below. There is no real solution to this problem, unless you work out a better set of rules (with some multi-character sequences there are ways of using the longest-first property to leave out some repetitious rules).
Rule <rule> has more than X characters on the left in [[MediaWiki:Transliterator:<mapname>]]
Due to the algorithm used to transliterate, having long rules on the left both increases the size of the map, and increases the maximum time that may be taken in transliteration. If you find yourself wanting to break this limit, the chances are that your language cannot be transliterated automatically.

[edit] Advanced customisation

A synonym for the call {{#transliterate:}} can be added by editing [[MediaWiki:transliterator-invoke]], if you customize this message the original {{#transliterate}} will still work.

The namespace in which maps are put can be customized. By default it is "Transliterator:", but if you'd prefer a different place, edit [[MediaWiki:transliterator-prefix]]. It is not possible to move the maps outside of MediaWiki (and the chances are that doing so would be a bad idea anyway). NOTE: if you edit this message, all of your maps will need to be moved - so it is likely that once you have started using the extension you don't want to change it.

The global variable $wgTransliteratorRuleCount, by default 255, specifies the maximum number of entries in a mapping; while $wgTransliteratorRuleSize, by default 10, specifies the maximum length of the left hand side of the rule. These are totally arbitrary limits, and it may be the case that different bounds work better for you. You should set the configuration variables after requiring extensions/Transliterator/Transliterator.php as otherwise they will be overwritten by the defaults.

For the interested: The absolute maximum size of the lookup table created for each map is bounded by O($wgTransliteratorRuleCount^2 * $wgTransliteratorRuleSize + the size of the map page). The absolute maximum number of operations to transliterate something is O($wgTransliteratorRuleSize * input length). These are worst case and unlikely to appear in practice, particularly as most transliteration schemes deal with individual letters or digraphs.