Talk:Unicode normalization considerations

PHP 6 Native Unicode Support
Hey guys,

Last week, I was a volunteer for the PHP Québec 2007 that was held in Montreal. I managed to attend Andrei Zmievski's presentation of the upcoming unicode support in PHP 6 and it simply blew my mind! Not only will there be fully native unicode support but once the encoding has been declared, PHP will be able to recognize all languages simultaneously directly in your class, function or whatever. But the truly amazing part of the demonstration was that PHP recognized a function written say in greek (ltr) with an argument passed in hebrew (rtl) without ever having to declare the text direction...

OK, I'm still a newbie in the PHP world but that seemed pretty powerful to me! I'm not sure if this could be useful to solve the current issue but I'm sure it is definitely worth looking into before planning too far in the future.

Stéphane Thibault 06:55, 19 March 2007 (UTC)

Firefox 3
Hebrew vowelization seems much improved in Firefox 3. It is important to document exactly what changed and how.

Firefox 3 seems to correctly represent the vowel order for webpages in general and Wikimedia pages in particular.

The only anomaly I found is that pasting vowelized text into the edit page only shows partial vowelization. On the "saved" wiki page it appears correctly. Dovi 05:49, 18 June 2008 (UTC)

Examples when normalization should be performed and when it should not
→ 022031 "deactivate Unicode normalization via bla "

Investigating on some authors in various book catalogues I run into a problem because I used homographic tags at http://www.librarything.com/ :

http://www.librarything.com/work/9352937/book/54517382 tag Kálmán Kalocsay >>> http://www.librarything.com/catalog/gangleri&tag=K%C3%A1lm%C3%A1n%20Kalocsay http://www.librarything.com/work/9393183/book/54949365 tag Kálmán Kalocsay >>> http://www.librarything.com/catalog/gangleri&tag=Ka%CC%81lma%CC%81n%20Kalocsay

at http://pastebin.org/ I could see that the first tag was « Kálmán Kalocsay » and the second « Ka&amp;#769;lma&amp;#769;n Kalocsay ». The second exmple is using Unicode Character 'COMBINING ACUTE ACCENT' (U+0301).

I am using various computers in various places having different operating systems, browsers with different versions and different fonts installed. Sometimes it is not possible to distinguish the homographs. The chance to detect them is higher using older computers, older versions etc.

Many sites as loc.org, worldcat.com, librarything.com are using data records which are not normalized.

http://www.worldcat.org/oclc/63378583 >>> La kontrubuo de Kálmán Kalocsay al la Esperanta kulturo http://opc4.kb.nl/DB=1/PPN?PPN=801854571 >>> La kontrubuo de Kálmán Kalocsay al la Esperanta kulturo / Reinhard Haupenthal

http://pastebin.org/71649 shows that the first example is using also &amp;#769; : « Ka&amp;#769;lma&amp;#769;n Kalocsay »

a) I wonder how it should be possible to document such texts. MediaWiki will make the normalization immediately when a page is previewed or saved.

b) I tried to generate some search links for loc.org and worldcat.org because I saw many different spellings about transliteration of Yiddish authors, book titles etc. The work is meaningless when the search items are passed as UTF-8 in parameters together with template talk:Bswc. Only if properly urlencoded substrings are passed usefull wiki or html code can be generated.

Conclusion: "copy and paste" is a wonderful feature when used the context is known. But sometimes the content should be preserved and sometimes normalization makes sense. The documentation of historical data processing systems, historical digital data collections, catalogues etc. would require the partial deactivation of the Unicode normalization.

Probably the best way would be implementing such a deactivation via &lt;foobar&gt;bla&lt;/foobar&gt;. This would be a fair solution for citations. I am not shure how this should be handled for template parameters.

1) would require a large additional work in combination with copy and paste. 2) &lt;foobar&gt;&lt;/foobar&gt; would be easier when generating lists.

Best regards user:Gangleri

לערי ריינהארט 11:41, 6 January 2010 (UTC)

Longer term: Three different titles
For the longer term solution, IMHO there should be three titles (in order of increasing normalisation): For example, the displayed title could be iMonkëy 123 (extending the example on the main page), the URI title could be iMonkey_123, and the DB key would be IMONKEY123. To find an article, MediaWiki would normalise the URI given to the DB key. If the URI title for that article does not match the given URI, the user would be HTTP-redirected to the nice URI. The normalisation would be configureable on a per-wiki basis, for example, the english Wikipedia could use: whereas the German Wikipedia might choose to map umlauts to their ASCII ersatz rendering (ä=>ae, ö=>oe, etc.) for the URI. — Cfaerber 08:58, 23 September 2010 (UTC)
 * 1) the displayed title
 * 2) the URI title
 * 3) the title that forms the DB key
 * 1) NFC
 * 2) NFKC, remove diacritics, replace spaces with "_"
 * 3) upper case, remove non-alphanumeric chars