Jump to content

Topic on Talk:Search/status

Zero Width Joiner and Zero Width Non Joiner

5
Siddhartha Ghai (talkcontribs)

Hi.

I'm curious as to what behaviour search has when an input string has a ZWJ or ZWNJ unicode character. Are results without the ZWJ / ZWNJ searched for? And what if a search doesn't contain ZWJ/ZWNJ but a page with the exact same spelling but including one of these characters in between exists?

As far as I know, search on the WMF cluster as of now doesn't treat words including ZWJ/ZWNJ the same as those not including these. I don't think this behaviour is correct, and the matter probably needs to be investigated since I think some indic language IMEs provide options for the input of these characters (to force the rendering of a particular glyph) and pages with titles containing these characters may be created.

NEverett (WMF) (talkcontribs)

So we've been holding off on these kinds of issues until we're able to get the unicode plugin for Elasticsearch deployed on the cluster. The plan is for CirrusSearch to use it (if it is installed on Elasticsearch) to take a first crack at the problem and then go from there. We're willing and able to go beyond that but we'd like to start there. The holdup is just that Elasticsearch plugins are deployed differently then most other things at WMF so we have to work up a special mechanism for them. We're moving along on that project so we should be able to start really improving things "soon".

Still, I'd love some good test cases to make sure that we're going in the right direction. I'd be thrilled if you filed a bug with some examples of things that don't work but should.

Siddhartha Ghai (talkcontribs)

TL;DR version:

I would file a bug except that I'm not sure what the behaviour should be. I think the issue needs some discussion before an actual bug is filed, since as I see the issue, it is complicated, and there are several potential methods to resolve it.

Full comment:

My interest in these chars is in indic languages, specifically hindi.

Per the Unicode Indic joining behaviour model, there are 4 different ways in which ZWJ/ZWNJ can be used, with the resulting renderings differing.

An example case is the following four pages (the page content has the unicode sequence used):

(Note: The last two were created today and may not show up in search till tomorrow)

It should be noted that the rendering would differ depending on what glyphs the actual font has. So, a font designed for, say, Sanskrit may have a full conjunct glyph, whereas one for hindi may not (since sanskrit used many more conjunct forms than hindi IIRC). As for the current situation, the proprietary Mangal font that ships with Windows by default shows the above four in the same way, in the fully expanded form with explicit viram, since it doesn't contain any glyph. However, changing the font family to Lohit (the font used for hindi in ULS), the rendering for the first page differs from the other three, the first showing a conjunct glyph with the others still showing the fully expanded form. There may be cases where all four renderings differ, but I'm not aware if the behaviour model is implemented by any fonts yet or not.

Now, as far as language is concerned, the subpagename in all four is essentially the same word. The fact that the glyph may be rendered differently doesn't change how it's read (pronounced), or what it means.

So what we have effectively is four different ways to write the same word, possibly with four different renderings or one rendering depending on the font the user has.

This means that as of now, depending on the IME a particular user is using, he/she may not find in search what they were looking for and end up creating duplicate pages on the same topic. And the two titles may be rendered exactly the same for another user. Needless to say, this will leave the average user perplexed.

(Note: IIRC, I have come across one such case where a dupe was created by a newbie when he couldn't find the article that he created)

I find this to be complicated, similar to the unicode normalization issue, with various possible solutions.

Solution 1

Strip all ZWJ/ZWNJ from all text and pagenames and search queries

Pros:
  • No chances of page duplication
  • No search issues
Cons:
  • No ability to force particular glyphs
  • Probably problematic for sanskrit wikisource (where ZWJ/ZWNJ may be really needed)
Solution 2

Strip all ZWJ/ZWNJ from pagenames and search queries

Pros:
  • No chances of page duplication
  • No search issues
Cons:
  • No ability to force particular glyphs
Solution 3

Treat all four cases as one for search

Pros:
  • Probably easiest to implement
Cons:
  • Duplicate page creation remains possible
  • Even if the search functionality works, the text find and replace in the editbar, and the inbuilt find/replace feature of browsers may not work correctly.
NEverett (WMF) (talkcontribs)

Sorry for the super duper late reply, but, here goes:

I can use case folding to flatten all four of these examples into "the same" word from search's perspective. That is, NFKC with case folding tacked on the end.

Now some choices: 1. Do this on both the analyzers that we use for text or just the less exact one. If I just do the less exact one then the words that match without normalization will bubble above those that match with normalization. And, by default, "quoting" a word will not find it normalized. I'm leaning towards adding the normalization to both analyzers for this reason. 2. Should I add this to all languages, most languages, just languages for which I don't have a good default, or just languages that ask for it? Note that I'm actually waiting on a change upstream to enable me to add things to "all" or "most" languages. 3. Other stuff?

Siddhartha Ghai (talkcontribs)

Sorry for the super duper late reply (went on a wikibreak):

I don't think applying case folding to search queries will have a major effect on projects in languages that don't have case. AFAIK, none of the indic family scripts have case. Do note though that just because the project is in an indic language doesn't necessarily mean that there won't be any content in other case-sensitive languages. There can always be discussions, Help pages and Mediawiki: namespace stuff in english. So searches related to such stuff will be affected.

The decision about whether or not to apply case folding by default could be decided on the basis of how much content on a particular project seems to be in a case-sensitive language. Finding this out, will, ofcourse, require some database queries to analyze how much content is in which script on the project.

So:

  1. I also think applying it to both analyzers would be better
  2. The change should be applied on a case-by-case basis to language projects that ask for it (Although if the change is found useful on a few language projects of the indic script family, I think it can be extended to all indic scripts).
  3. Other stuff: This resolves the search part, but not the title part. Ideally, it shouldn't be possible to create four different pages for the same title, and, if needed, the glyph to be used in the title should be controlled by a magic word or something. Not sure where to raise this point for a proper discussion. Ideas?
Reply to "Zero Width Joiner and Zero Width Non Joiner"