Talk:Unicode normalization considerations: Difference between revisions

Content deleted Content added
Dovi (talk | contribs)
→‎Firefox 3: new section
→‎Examples when normalization should be performed and when it should not: bugzilla:022031 "deactivate Unicode normalization via <foobar>bla</foobar>"
(One intermediate revision by the same user not shown)
Line 1: Line 1:
__TOC__
== PHP 6 Native Unicode Support ==
== PHP 6 Native Unicode Support ==
Hey guys,
Hey guys,
Line 15: Line 16:


The only anomaly I found is that pasting vowelized text into the edit page only shows partial vowelization. On the "saved" wiki page it appears correctly. [[User:Dovi|Dovi]] 05:49, 18 June 2008 (UTC)
The only anomaly I found is that pasting vowelized text into the edit page only shows partial vowelization. On the "saved" wiki page it appears correctly. [[User:Dovi|Dovi]] 05:49, 18 June 2008 (UTC)

== Examples when normalization should be performed and when it should not ==

→ [[bugzilla:022031]] "''deactivate Unicode normalization via <nowiki><foobar>bla</foobar></nowiki>''"

Investigating on some authors in various book catalogues I run into a problem because I used homographic tags at http://www.librarything.com/ :<br />

http://www.librarything.com/work/9352937/book/54517382 tag Kálmán Kalocsay
>>> http://www.librarything.com/catalog/gangleri&tag=K%C3%A1lm%C3%A1n%20Kalocsay
http://www.librarything.com/work/9393183/book/54949365 tag Kálmán Kalocsay
>>> http://www.librarything.com/catalog/gangleri&tag=Ka%CC%81lma%CC%81n%20Kalocsay

at http://pastebin.org/ I could see that the first tag was « Kálmán Kalocsay » and the second « Ka&amp;#769;lma&amp;#769;n Kalocsay » . The second exmple is using [http://www.fileformat.info/info/unicode/char/0301/index.htm Unicode Character 'COMBINING ACUTE ACCENT' (U+0301)].

I am using various computers in various places having different operating systems, browsers with different versions and different fonts installed. Sometimes it is not possible to distinguish the homographs. The chance to detect them is higher using older computers, older versions etc.

Many sites as loc.org, worldcat.com, librarything.com are using data records which are not normalized.

http://www.worldcat.org/oclc/63378583
>>> La kontrubuo de Kálmán Kalocsay al la Esperanta kulturo
http://opc4.kb.nl/DB=1/PPN?PPN=801854571
>>> La kontrubuo de Kálmán Kalocsay al la Esperanta kulturo / Reinhard Haupenthal

http://pastebin.org/71649 shows that the first example is using also &amp;#769; : « Ka&amp;#769;lma&amp;#769;n Kalocsay »

a) I wonder how it should be possible to document such texts. MediaWiki will make the normalization immediately when a page is previewed or saved.

b) I tried to generate some search links for loc.org and worldcat.org because I saw many different spellings about transliteration of Yiddish authors, book titles etc. The work is meaningless when the search items are passed as UTF-8 in parameters together with <nowiki>{{URLENCODE:foo}} </nowiki>&nbsp;<sup>[http://test.wikipedia.org/w/index.php?curid=36353 template talk:Bswc]</sup>. Only if properly urlencoded substrings are passed usefull wiki or html code can be generated.

Conclusion: "''copy and paste''" is a wonderful feature when used the context is known. But sometimes the content should be preserved and sometimes normalization makes sense. The documentation of historical data processing systems, historical digital data collections, catalogues etc. would require the partial deactivation of the Unicode normalization.

Probably the best way would be implementing such a deactivation via &lt;foobar&gt;bla&lt;/foobar&gt;. This would be a fair solution for citations. I am not shure how this should be handled for template parameters.<br />
1) <nowiki>{{foo|&lt;foobar&gt;bla&lt;/foobar&gt;}}</nowiki> would require a large additional work in combination with copy and paste.<br/>
2) <nowiki>&lt;foobar&gt;{{foo|bla}}{{foo|bla bla}}{{foo|bla bla bla}}&lt;/foobar&gt;</nowiki> would be easier when generating lists.

Best regards [[user:Gangleri]]<br />
[[User:לערי ריינהארט|לערי ריינהארט]] 11:41, 6 January 2010 (UTC)