Thread:Talk:Search/Zero Width Joiner and Zero Width Non Joiner/reply (2)

TL;DR version:

I would file a bug except that I'm not sure what the behaviour should be. I think the issue needs some discussion before an actual bug is filed, since as I see the issue, it is complicated, and there are several potential methods to resolve it.

Full comment:

My interest in these chars is in indic languages, specifically hindi.

Per the Unicode Indic joining behaviour model, there are 4 different ways in which ZWJ/ZWNJ can be used, with the resulting renderings differing.

An example case is the following four pages (the page content has the unicode sequence used):
 * w:hi:सदस्य:Siddhartha Ghai/वाङ्मय (search result)
 * w:hi:सदस्य:Siddhartha Ghai/वाङ्‍मय (search result)
 * w:hi:सदस्य:Siddhartha Ghai/वाङ‍्मय (search result)
 * w:hi:सदस्य:Siddhartha Ghai/वाङ्‌मय (search result)

(Note: The last two were created today and may not show up in search till tomorrow)

It should be noted that the rendering would differ depending on what glyphs the actual font has. So, a font designed for, say, Sanskrit may have a full conjunct glyph, whereas one for hindi may not (since sanskrit used many more conjunct forms than hindi IIRC). As for the current situation, the proprietary Mangal font that ships with Windows by default shows the above four in the same way, in the fully expanded form with explicit viram, since it doesn't contain any glyph. However, changing the font family to Lohit (the font used for hindi in ULS), the rendering for the first page differs from the other three, the first showing a conjunct glyph with the others still showing the fully expanded form. There may be cases where all four renderings differ, but I'm not aware if the behaviour model is implemented by any fonts yet or not.

Now, as far as language is concerned, the subpagename in all four is essentially the same word. The fact that the glyph may be rendered differently doesn't change how it's read (pronounced), or what it means.

So what we have effectively is four different ways to write the same word, possibly with four different renderings or one rendering depending on the font the user has.

This means that as of now, depending on the IME a particular user is using, he/she may not find in search what they were looking for and end up creating duplicate pages on the same topic. And the two titles may be rendered exactly the same for another user. Needless to say, this will leave the average user perplexed.

(Note: IIRC, I have come across one such case where a dupe was created by a newbie when he couldn't find the article that he created)

I find this to be complicated, similar to the unicode normalization issue, with various possible solutions.

Strip all ZWJ/ZWNJ from all text and pagenames and search queries
 * Solution 1:


 * Pros:


 * No chances of page duplication
 * No search issues


 * Cons:


 * No ability to force particular glyphs
 * Probably problematic for sanskrit wikisource (where ZWJ/ZWNJ may be really needed)

Strip all ZWJ/ZWNJ from pagenames and search queries
 * Solution 2:


 * Pros:


 * No chances of page duplication
 * No search issues


 * Cons:


 * No ability to force particular glyphs

Treat all four cases as one for search
 * Solution 3


 * Pros:


 * Probably easiest to implement


 * Cons:


 * Duplicate page creation remains possible
 * Even if the search functionality works, the text find and replace in the editbar, and the inbuilt find/replace feature of browsers may not work correctly.