Topic on Extension talk:CirrusSearch

Suggestion: Provide a plain search (no analysis)

24
197.218.80.183 (talkcontribs)

Issue:

It is currently impossible to search for an exact string that contains certain symbols.

Steps to reproduce

  1. Search for content that is added by a template or contains symbols , e.g. " 〃", https://en.wikipedia.org/w/index.php?search=%22%E3%80%83%22&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B0%5D%7D&ns0=1
  2. Go to the page and use the browser search to find it.

Expected

It should be possible to find basic symbols

Actual

Certain symbols are impossible to find

Proposed solution

  • Add a plaintext keyword, e.g. "plaintext: 〃" .

This would do no analysis, no stemming, no normalization.

Notes:

Insource doesn't always work for this because it can only detect content saved to the page, it can't extract transcluded content, and the default search can't address it because it is optimized for readers, and tries its best to normalize searches. While looking for a way to escape elasticsearch strings, I came across this possible solution:

https://discuss.elastic.co/t/how-to-index-special-characters-and-search-those-special-characters-in-elasticsearch/42506. This could also improve the current limited "exact text" field in Extension:AdvancedSearch.

197.218.80.183 (talkcontribs)
CKoerner (WMF) (talkcontribs)

Hey IP, thanks for the suggestion. I've added a note to the phab task to mention this request.

197.218.84.150 (talkcontribs)

That task was only one use-case. It will not solve the general problem, see https://www.mediawiki.org/w/index.php?title=Topic:Updc1v7soubhg6bq for a real world example. The problem is that while all these transformations do help in the general case they don't always work properly for a multi-lingual platform like mediawiki. So in that instance exact search will never be exact because it will always be case sensitive, case folded, and have many tokens stripped.


For instance, I randomly found a symbol (〆) while reading up an article, and searched for it. Google finds (Google:〆 site:en.wikipedia.org) many cases, while english wikipedia currently only finds a single one ( ,). The reason it even finds that character at all is because there is a redirect to it.


The generic problem can probably only be solved by a different search keyword.

TJones (WMF) (talkcontribs)

Yeah, the general case is different from the German daß/dass problem in that "non-word" symbols, like punctuation, are not going to be indexed even if we deal with ß/ss correctly.

> This would do no analysis, no stemming, no normalization.

I can see not doing stemming or normalization, but "analysis" includes tokenization, which is more or less breaking text up into words in English (and much more complex in Chinese and Japanese, for example). Would you want to skip tokenization, too?

Without tokenization a search for bot would return matches for bot, robot, botulism, and phlebotomy? Would you want to be able to search on ing te and match breaking text, but not breaking text (with two spaces between words). Would you want searches for text, text,, text., and text" to all give different results? It sounds like the answer is yes, so I'll assume that's the case.

The problem is that this kind of search is extremely expensive. For the current insource regex search, we index the text as trigrams (3-character sequences—so some text is indexed as som, ome, me (with a final space) e t (with a space in the middle), te (with an initial space), tex, and ext). We try to find trigrams in a regex being searched to limit the number of documents we have to scan with the exact regex. That's why insource regex queries with only one character or with really complex patterns with no plain text almost always time out on English Wikipedia—they have to scan the entire document collection looking for the one character or the complex pattern. But insource queries for /ing text/ or /text\"/ have a chance—though apparently matching the trigram ing gives too many results in English and the query still times out!

Indexing every letter (or even every bigram) would lead to incredibly large indexes, with many index entries having millions of documents (most individual letters, all common short words like in, on, an, to, of, and common grammatical inflections like ed). Right now you can search for the on English Wikipedia and get almost 5.7M hits. It works and doesn't time out because no post-processing of those documents is necessary to verify the hits—unlike a regex search which still has to grep through the trigram results to make sure the pattern matches.

An alternative might be to do tokenization such that no characters are lost, but the text is still divided into "words" and other tokens. In such a scenario, text." would probably be indexed as text, ., and ", and a search for text." would not match, say, context.". There are still complications with whitespace, and a more efficient implementation that works on tokens (which is what the underlying search engine, Elasticsearch, is built to do) might still match text . " and text." because both have the three tokens text, ., and " in a row. A more exact implementation would find all documents with text, ., and " in them, and then scan for the exact string text." like the regex matching does, but that would have the same limitations and time outs that the regex matching does.

Unfortunately, your use cases are just not well-supported by a full-text search engine, and that's what we have to work with. I don't think there's any way to justify the expense of supporting such an index. And even if we did build the indexes required, if getting rid of time outs and incomplete results would require significantly more servers dedicated to search.

Even Google doesn't handle the 〃 case (Google: 〃 site:en.wikipedia.org). It drops the 〃 and gives roughly the same results as site:en.wikipedia.org (it actually gives a slightly lower results count—61.3M vs 61.5M—but the top 10 are identical and the top 1 doesn't contain 〃).

Also, note that Google doesn't find every instance of 〆. The first result I get with an insource search on-wiki is Takeminakata, which has 〆 in the references. The Google results seem to be primarily instances of 〆 all by itself, though there are some others. (I'm not sure what the appropriate tokenization of 〆捕 is, for example, so it may get split up into 〆 and 捕; I just don't know.)

I'm having some technical difficulties with my dev environment at the moment, so I can't check, but indexing 〆 by itself might be possible. It depends on whether it is eliminated by the tokenizer or by the normalization step. I think we could possibly prevent the normalization from normalizing tokens to nothing—which would probably apply to some other characters such as diacritics like ¨—but preventing the tokenizer from ignoring punctuation characters would be a different level of complexity. There are also questions of what such a hack would do to indexing speed and index sizes, so even if it is technically feasible, it might not be practically feasible. I'll try to look at it when my dev environment is back online.

197.218.84.247 (talkcontribs)

>It sounds like the answer is yes, so I'll assume that's the case.

In a perfect world, yes.

>An alternative might be to do tokenization such that no characters are lost, but the text is still divided into "words" and other tokens. In such a scenario, text." would probably be indexed as text, ., and ", and a search for text."

Indeed, perfect is the enemy of good. It is acceptable to have a search that will always match full tokens separated by spaces. That's the suggested approach in the thread (https://discuss.elastic.co/t/how-to-index-special-characters-and-search-those-special-characters-in-elasticsearch/42506). It seems quite sensible to do so for even the general search. I mean, it is quite silly that the search engine is unable to search for something as simple as "c++". In such a case, one would expect it to match "c" AND "C++", and prioritize the "c++" .

There are even more cases, for instance, many people (myself included) like to sometimes learn about egyptian glyphs, and many of these convey meaning by themselves, yet searching for "☥" finds only one page, which is odd for something that can mean life. There are even weirder Egyptian symbols that I have no idea what they are called and they tend to be hard to describe. Google finds millions in sites, for en.wikipedia it currently finds (google:"☥" site:en.wikipedia.org) about 500. It a bit unfair to compare it to google because it likely has sophisticated artificial intelligence algorithms that simply translate the "☥" to Ankh, and also search it using that. Interestingly, even wikidata just drops the "☥".


Anyway, there's no need to call it exact search, maybe it should just be called "tokensearch:" or anything related to that. As long as it removes all other unnecessary normalization. An alternative would be to enhance regex search to be able to work on the transcluded text (after the html is stripped). Unfortunately, the regex alternative is likely going to be even more costly.

197.218.84.247 (talkcontribs)

Sidenote:

A pretty nifty side-effect that cirrussearch's token stripping means that it even beats google and bing by showing some sensible results when someone searches for "〆okes". Google and bing currently find nothing.

Still, it would be more sensible to add a general note informing the user whenever a special character that may be silently dropped is searched for.

TJones (WMF) (talkcontribs)

I'm hoping to think more about this and get to this tomorrow afternoon. I've got a few deadlines that need my attention, plus an opportunity to discuss it with others early tomorrow. Hope to be back in less than 24 hours!

Edit: If you are free in about 18 hours, join us to discuss this. More info on this etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours

TJones (WMF) (talkcontribs)

Sorry for the delay getting back to you. This didn't come up in our discussion today, but I was able to get my dev environment working again (lesson learned: never install major OS updates if you want to be able to get any work done).

I was able to test all three of ☥, 〃, and with the current English-language analysis chain. It's actually the tokenizer that removes them. Long ago this would have surprised me, but I've recently seen problems with other tokenizers, and I think a common tokenizer design pattern is to handle characters you care about, and ignore or break on everything else, and not really look to closely at the behavior on "foreign" characters—which causes problems in Wikipedias and Wiktionaries especially, since they are always full of "foreign" characters. Anyway, the standard Elasticsearch tokenizer doesn't seem to care about ☥, 〃, and —it doesn't just drop them, it breaks on them (so x☥y is tokenized as x and y).

I set up a whitespace tokenizer–only analyzer, and it lets ☥, 〃, and pass through fine. However, it would not satisfy your C / C++ case. C++ would be tokenized as C++ and would not match C. And of course, our earlier examples of text, text,, text., and text" would all be indexed separately, as would C++., "C++", "C++, C++", and weird one-off tokens like "第31屆東京國際影展- (which does occur in English Wikipedia).

So, while it is possible to use a whitespace tokenizer–only analyzer, I think the results would be counterintuitive to a lot of users, and I worry the required index for English Wikipedia would be huge. I'm not familiar with the super low-level implementation details of Elasticsearch, but adding extra occurrences of a token into an index generally uses less space than creating a new token, and there would be a lot of new tokens. We're already pushing the limits of our hardware (and in the middle of re-architecting our search clusters to handle it better).

To summarize: my best guess right now is that the results would disappoint lots of users (who wouldn't expect punctuation on a word to matter, or would want to find punctuation even when attached to a word)—though this is hard to test. I also think the index would be prohibitively large (especially for the number of users who would use such a feature)—the index size part is testable, but non-trivial, so I haven't done it; the number of users is unclear, but most special syntax and keywords are used quite infrequently overall, even if particular users use them very heavily.

I'm sorry to disappoint—I'm always happy when on-wiki search does something better than Google!—but I don't think this is feasible given the likely cost/benefit ratio. Though if you want to open a Phabricator ticket, you can—and pointing back to this talk page would be helpful. I can't promise we'll be able to look at it in any more depth than I already have any time soon, though.

197.218.80.248 (talkcontribs)

Hmm, too late. Hope it was a fruitful discussion...

I do appreciate that it is a complicated problem that will likely not be addressed in the next 6 months or it might simply be deemed unfeasible. One could partially address it by doing what book authors do, create a glossary of 'important' tokens, and whenever search fails it could inform the user that "hey, the token you're searching for definitely exists but search limits mean that it can't be displayed".

197.218.84.1 (talkcontribs)

I replied shortly before your previous reply so I missed the latest one. Anyway, your assessment seems pretty accurate so there is probably little benefit to filing a task. Of course, other developers might have different ideas on how it could be implemented, or even elasticsearch developers might have some tricks up their sleeves to make it feasible. It is still something that would probably only benefit third parties who aren't bogged down by millions of documents.


Personally, I'm a fan of simplicity, so if I were to code it, the emphasis would be on the differences rather than the similarities. While there are millions of documents with similar symbols, some tokens are just rare enough to make it useful. For instance, currently, this discussion is probably one of a few places (if not the only one) in wikimedia projects that actually has an "x☥y" string. It is also enough to notify the user that X exists, rather than simply say "nothing was found", and that would in fact be quite trivial, even without elasticsearch.


To put it into perspective, English wikipedia users (or bots) spend an extreme amount of time creating redirects for typos, for symbols, for many other tokens. They probably learned to do this early on to address the limitations of the search engine. Other wikis aren't so lucky, so the search there would probably be considerably worse. My guess is that only places like wiktionary which by default contains so many synonyms would fare better. Considering that wikidata sitelinks also contain a lot of aliases, it might also eventually be used to bridge the gap if the issue of vandalism and potentially completely wrong information could be properly addressed.


Anyway, thank you for your assessment, I certainly don't want to give you unnecessary work for something that is very likely to be unfeasible. The current regex search certainly addresses most use cases (except transcluded content).


TJones (WMF) (talkcontribs)

Thanks for the discussion. It's an interesting problem, and some of the stuff we talked about here will definitely go into my future thoughts about evaluating and testing analyzers.

TJones (WMF) (talkcontribs)

I thought about this some more, and came up with the idea of a "rare character" index, which, in English, would ignore at least A-Z, a-z, 0-9, spaces, and most regular punctuation, but would index every instance of other characters. I talked it over with @DCausse (WMF), and he pointed out that it could be not only possible, but would also probably be much more manageable if the indexing was at the document level. (So you could search for documents containing both and , but you could not specify a phrase like "☥ 〆" or "〆 ☥". or a single "word" like ☥☥ or our old friend x☥y.)

I also think we could test this without a lot of development work by running offline simulations to calculate how big the index would be, and even build a test index on out search test servers without writing any real code by doing a poorly-implemented version with existing Elasticsearch features. More details on the phab ticket I've opened to document all those ideas: T211824.

If you have any ideas about specific use cases and how this would or would not help with them, reply here or on Phab!

I can't promise we'll get to this any time soon, but at least it will be on our work board, mocking me, so I feel bad about not getting to it! 😁

197.218.86.137 (talkcontribs)

This seems like a reasonable outcome, and the idea is solid. For the unresolved questions:

> Do we index the raw source of the document, or the version readers see?

The raw source is already available using insource, so my suggestion is that this would only consider the reader's version.

> Do we index just the text of the document, or also the auxiliary text and other transcluded text?

Transcluded content seems like something that is definitely worthwhile. So perhaps all of the above if it is feasible.

> It is possible (even desirable) that some documents would not be in this index because they have nothing but “boring” characters in them.

Certainly desirable.

>I can't promise we'll get to this any time soon, but at least it will be on our work board, mocking me, so I feel bad about not getting to it

That's understandable. This would probably not be something used by the average user, but it will definitely make the search more complete because it highlights the important difference between a generic search engine like "google" and a specialized one that is used to identify encyclopedic / wiki content.


197.218.95.117 (talkcontribs)

Another use-case might be counter vandalism or small fixes. I seem to remember that when using VisualEditor in linux pasting something would often produce a “☁" , e.g. like this article (https://en.wikipedia.org/w/index.php?title=Rabbit,_Run&oldid=863620024). Of course emojis in articles are enough of a problem that there is a abusefilter blocking some of these (see Special:Tags, https://meta.wikimedia.org/wiki/Special:AbuseFilter/110).


So it might be a good thing if it can act as a filter, e.g. "char:" would match all instances of special characters, and "-char:" would exclude them. This seems like a general feature that would help with a lot of things, for instance "-hascategory:" would be the equivalent of Special:UncategorizedPages , or "-linksto:" would be Special:DeadendPages, and so forth.

Alternatively a separate keyword could be used if such a syntax seems odd, maybe "-matchkey:char", "matchkey:char", "-matchkey:category".

197.218.95.117 (talkcontribs)

This might make the case for a generic emoji flag, maybe "char:emoji" that would match a smaller set of these things. A couple of funny related tasks:

TJones (WMF) (talkcontribs)

I think -char: would work. -insource: and the like already work, so that shouldn't be a problem. I'm not sure about category searches.

I could see char:emoji being useful, but also really hard to implement. Here's an attempt at a general purpose emoji regex—that's pretty complicated! I can't find any widely defined Unicode regexes for emoji that are already built into Java or other programming languages. We could possibly look into it though if the time comes. I'll add it to the phab ticket. Thanks!

197.218.92.53 (talkcontribs)

> I think -char: would work. -insource: and the like already work, so that shouldn't be a problem. I'm not sure about category searches.

You probably misunderstood. The negative operator does already work, but it doesn't work in instances where someone just wants to finds all instances that exclude that keyword.

For instance, if I want to find all articles that don't contain a link (e.g. like this, Monkey (slang)will be found as a false positive) or category (e.g. https://en.wikipedia.org/w/index.php?search=monkey+-category%3A). It is downright impossible, regex might get you close, but template transclusions can add extra links or categories or whatever. In fact, a completely empty page might still have links, as interwiki links can be added by wikidata.

Similarly, if one wants to search all articles that contain any "rare" character it will be impossible, just as it is right now.

>I could see char:emoji being useful, but also really hard to implement. Here's an attempt at a general purpose emoji regex—that's pretty complicated! I can't find any widely defined Unicode regexes for emoji that are already built into Java or other programming languages.

Perhaps if it isn't feasible then the syntax of char could be defined with a separator like "|",e.g. char:😁|💃|💀 that would make it possible for users to define longer sequences without the awkward "char:x char:y" syntax.


For java there seems to be some ideas on how to deal with them:https://stackoverflow.com/a/32872406.

TJones (WMF) (talkcontribs)

That's an interesting negative use of -insource. I'm not familiar with any syntax that allows you to search for a bare keyword or its negation, so I'm not really sure what you want it to mean. (In the search you linked to, it actually just omits articles with forms of the word insource (insourced, insourcing, etc.).)

I have foolishly started 4 more threads on this topic (on 3 village pumps and on Phabricator), but the idea of searching for multiple characters, character ranges, or Unicode blocks have come up elsewhere. There are issues of making the syntax consistent (a separate on-going project is trying to revamp the search parser), determining whether a multi-character search is an implicit AND or OR, and being careful about search syntax that explodes into many individual searches on the back end that we have to worry about. If we get far enough to actually implement a rare character index, we'll have to come back to the questions of specific syntax and the initial feature set supported.


197.218.92.53 (talkcontribs)

Oh, it was copied incorrectly, the insource was meant to include a string, https://en.wikipedia.org/w/index.php?search=monkey+-insource%3A%2F%5C%5B%5C%5B%2F.

Anyway, the point of the regex above was to find pages without links or categories containing the word monkey. In practice there are none, in theory that one match occurs because regex doesn't search transcluded content, and there are different ways to create a link. I'm not exactly sure the correct terminology used for those, but to put it into concrete words, or rather pseudo code:

var excludedfilter = "-char:";
var search_results = {"pageswithout_char" : [1,2], "pageswith_char": [3,4]};

if excludedfilter == "-char:" then
   var pagesToSearch =searchresults[pageswith_char];
   return search("foo", pagesToSearch);
end 

So in essence it discards all pages that contain any rare characters. Based on existing search keywords, the only way to find all pages with rare characters would be spell them all out. Anyway, apparently there is one search keyword that works like that, "prefer-recent", compare https://en.wikipedia.org/w/index.php?search=monkey+-prefer-recent%3A&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B0%5D%7D&ns0=1 vs https://en.wikipedia.org/w/index.php?search=monkey&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B0%5D%7D&ns0=1 .


While they look quite similar the order is different, and the help page itself claims that prefer-recent can work without any specific parameters. Nonetheless, it is strange and error prone syntax, so it seems more sensible to assign another keyword, or perhaps add a new url parameter, maybe something like ?excludefilter=char|hascategory&includefilter=hastemplate.


Generally, getting feedback from various places at least (in)validates the idea, and people are more comfy in their own wikis, so a single discussion here would probably not get much feedback even if links were posted.

197.218.92.53 (talkcontribs)

Oops the pseudo code should be more like :

var excludedfilter = "-char:";
var search_results = {"pageswithout_char" : [1,2], "pageswith_char": [3,4]};

if excludedfilter == "-char:" then
   var pagesToSearch =searchresults[pageswithout_char];
   return search("foo", pagesToSearch);
end 
TJones (WMF) (talkcontribs)

I get what you are saying now. Is this a theoretical exercise, or do you have a specific use case where finding all pages without any rare characters would be useful? I can't think of any. In the case of a page with no links, you could argue that almost every page should have some links, so those are pages that need improving. Same for categories. But what's the value of finding pages with no rare characters—other than maybe as a conjunct with a more expensive search to limit its scope? (Though, I'm not sure how limiting that would be, so it makes sense to check that out in initial investigation—I'll add it to the phab ticket.)

197.218.92.53 (talkcontribs)

> Is this a theoretical exercise, or do you have a specific use case where finding all pages without any rare characters would be useful?

Well, excluding them is a theoretical exercise . However, including all pages with any rare character ( "+char:") is a more useful query especially if filtered by category. For the original use-case of this thread, if one wants to evaluate pages mentioning historical symbols, one way to find a subset of them would be to use something like that.

One could also imagine that regular wiki editors would use such an index to add new symbols to their emoji abusefilter or even track down (and cleanup) vandalism that randomly uses multiple emojis. Cloudy or other unkown emojis could be identified this way.

Right now the only way to find any of them is to deliberately search for them using regex or analyse the wiki dumps.

TJones (WMF) (talkcontribs)

I've added the editing error and vandalism use cases for emoji search to the Phab ticket.

Reply to "Suggestion: Provide a plain search (no analysis)"