Help:CirrusSearch/CompletionSuggester

Please let us know what is and is not working well with the new completion suggester. Direct bugs can filed into phabricator, surfaced on our [mailto:discovery@lists.wikimedia.org mailing list], or on irc freednode #wikimedia-discovery

__NEWSECTIONLINK__

Additional word terminators needed?
When typing in CirrusSearch the completion search only looks for pages in the main namespace and seems blind to other namespaces. Many communities utilise namespaces for a logical categorisation or behavioural difference, and this seems to be an unnecessary limitation. Similarly the suggester seems to be inhibited by a forward slash in a page name, so searching for CompletionSearcher shows nothing in the type ahead unless you have started your search term with Extension:...

Both cases almost seem not to identify the colon or the forward slash as logical word terminators in a nomenclature construct. — billinghurst  sDrewth  12:58, 21 December 2015 (UTC)
 * You're correct that the completion suggester is limited to the main namespace; this was explicitly noted in the announcement and is for purely technical reasons during the initial rollout of the beta feature. Any more full rollout of the beta feature would not have this limitation. The reason the completion suggester can't find this page is due to this very limitation; the page is not in its search index. Otherwise, colons and slashes in page titles seem to work just fine for me, provided that the page is in the main namespace. Can you give me another example of this problem so that I can verify it? Thanks! --Dan Garry, Wikimedia Foundation (talk) 04:19, 22 December 2015 (UTC)


 * I was (typically) searching at enWS for a biographical work set at a subpage, so for this example search there for the word "Bickerton". It shows results for the DNB work, though not The Dictionary of Australasian Biography/Bickerton, Alexander William.  That said, I, subsequently, did a  search for "Australasian" and neither the root page of the work, nor its subpages, show. So how far into pagename does CompletionSuggester look for a match? — billinghurst  sDrewth  04:08, 23 December 2015 (UTC)
 * To note that I did look to find a work with a shorter root pagename, and with something shorter it is able to detect a forward slash. Interesting it is something with partial split searches, eg. "Biography/c" gives some results, though "Biography/Ch" only gives one. Searching for "Legendary" looking for Australian Legendary Tales gives no success, nor the word "Mayamah" looking for Australian Legendary Tales/The Mayamah. Finding it hard to find a short enough book title that may enable looking up subpage names. — billinghurst  sDrewth  06:14, 23 December 2015 (UTC)
 * This new beta feature is still, fundamentally, a completion suggester. It's not intended that it's able to find those pages with the search query you're entering, as the page titles don't begin with the query you entered. Building out something to do that is significantly more complex, which is why we've started with what we've got here. --Dan Garry, Wikimedia Foundation (talk) 00:37, 24 December 2015 (UTC)
 * As Deskana stated we can't really do any kind of word termination with the completion suggester, it is still at it's base a prefix search same as it is without the beta feature enabled. This new algorithm has the added benefit of allowing fuzzy search results (typos) along with more programmatic control over the result sorting. Results that show up that are not fuzzy prefixes are done through user generated redirects that do match the prefix. We can do some analysis of memory usage when indexing both Australian Legendary Tales/The Mayamah and The Mayamah but i'm nervous, we are already using over 100G of java heap (>13% of the total memory available to elasticsearch in our codfw cluster) to power the existing completion beta feature without yet indexing the other namespaces and without splitting on subpages. We can look into it, but i'm doubtful we have the hardware necessary to support this use case.
 * The limitations are necessary due to the sheer number of search as you type queries we serve. More advanced usage is supported with the full text search (press enter after typing query) because we only have to run the query once, instead of once for each character the user types (in the worst case). Search as you type has to run incredibly quickly to support several thousand queries per second against a couple dozen servers. Note that another required performance limitation restricts the searches to 50 characters. Our analysis of existing query patterns shows prefix search above that length makes up a fraction of a fraction of search traffic. EBernhardson (WMF) (talk) 19:54, 28 December 2015 (UTC)
 * Thanks for the comments. Something that you might be able to consider is simpler new/distinct typeahead, based on the forward slash as the terminator even without all the fuzzy searching. As I expressed in a phabricator ticket (and in a post to the mailing list,) the Wikisources make high use of subpages. If you consider compilation works (biographies, poetry, etc.) the title of the parent work can is less important than the sub-component for this type of work. Re your commentary about the namespaces, please be aware of a very wikipedia focus to such a comment. While the wikipedias have their content in the main namespace, the sister wikis are quite different in their utilisation of content namespaces, eg. number of the Wikisources utilise an Author: ns. So maybe that concern about broadening namespace inclusions can be more focused on content namespaces and that would not broaden the WPs, though would suit the sister wikis. Actually having scope around how the sister wikis are different and their needs would be useful to be explored. — billinghurst  sDrewth  11:46, 29 December 2015 (UTC)
 * With the completion suggester we tried to keep the same behaviors regarding namespaces, it's why we excluded everything that involve writing a namespace prefix, on wikisource with the default algorithm you have to type Author: in order to switch to this content namespace. I'd like to find a solution to address your comments: all content namespaces (no need to type Author:), subpages, but this would be a breaking change. Leonardo da Vinci will suggest Author:Leonardo da Vinci on wikisource. Another problem will be to make sure that we correctly sort the suggestions in case of collisions/ambiguities between namespaces and/or subpages, and as EBernhardson said the solution will have to be very performant. DCausse (WMF) (talk) 10:41, 31 December 2015 (UTC)
 * perfectly understood, and I am hoping that I am relaying intimate experience of the Wikisource community. We believe that many people don't understand namespaces — well not in depth — so they come to our site and type a name into the search box, desiring a result, so often they will desire both what it is in our main namespace (printed biographical works) and what is in the Author: ns (compiled bibliographical and linkages) and our knowing that there can be multiple hits for the same person and not knowing which they desire. So presenting a result of a biography from the Dictionary of National Biography, the Encyclopaedia Britannica (9th or 11th ed.), ..., or a component from the Alumni Oxonienses, based on the subpage is one part of what is desired that if someone is typing Smith, .... Rather than having them presented with a short form of the title of the book that takes all the visible/presented characters where they are not getting something of purpose. We know that they can still hit the search button and come back with results so it is not about presenting perfection, it is about a usefulness of the typeahead. I understand that there are limitations, though I don't fully grep the complexities you face. I believe that I do understand the usefulness of a functioning typeahead for the WS communities and where we would like it to be. The reflections of the community is that often developments halt once they for the WPs, and sometimes that is due to the initial focus, and sometimes due to the sister communities not being suitably descriptive or persistent. I trying to ensure that we are doing enough from our side. — billinghurst  sDrewth  13:50, 31 December 2015 (UTC)

Different order
I think it can be a good idea it recognize pages put in different order, like if I'm searching for a name I can write the name in another order. Ps: maybe search in alias in wikidata could be a good idea.--Martinligabue (talk) 15:18, 21 December 2015 (UTC) PPs, can you reply me on it.wiki?

@Whoever yes i think that there should be an algorithim that can still determine search results even if the order of the words are mixed up. Lets say that i am searching about Attack on Titan- it would be really useful for those days when a person cannot remember all of the details of the word and can only remember like "anime, titans" and the algorithim would show Attack on Titan! WHOKNEWABOUTTHAT? (talk) 20:22, 27 December 2015 (UTC)

Fundamentally the search as you type serves too many queries to perform very advanced searches, such as answering `anime, titans` or answering with the words out of order. We serve over a hundred million search as you type queries in a day and our servers would melt if we asked them to perform much more than a search against the title itself. We do have an algorithm that does a pretty decent job at answering these queries though, the full text search is designed to specifically answer these queries. Searching for anime, titans on enwiki brings up Attack on Titan as the eighth result on enwiki which seems fairly reasonable. Note that even google only does autocompletion from a prefix (in their case to other searches, rather than titles) for search as you type and not a full on search query. EBernhardson (WMF) (talk) 19:29, 28 December 2015 (UTC)

great
nice great article

language selection
it would be useful to show search suggestions from other languages (if available) in a parallel drop down menu. this would allow the user to access articles from other languages that happen to be missing in the current language and would certainly improve the overall language consistency of wikipedia.

Spanish Vidente Eyda Peña (talk) 05:53, 18 January 2016 (UTC)

currently broken on hewiki
currently this gadget is broken (i.e., it does not offer anything) on hewiki.

may be a good idea to allow "fallback" - if suggestion list is empty, try to get one from the "other" (strict?) completion source, instead of showing no suggestions. either way, please find out what's the problem with hewiki search. peace - קיפודנחש (talk) 16:41, 17 February 2016 (UTC)
 * That's not good קיפודנחש. I was able to reproduce the bug as you describe, created a task and notified the developers. CKoerner (WMF) (talk) 18:56, 17 February 2016 (UTC)
 * Hello again. I wanted to let you know this has been fixed. You may have already seen the notice on hewiki, but I just wanted to follow up here. Thank you again for bringing it to our attention. CKoerner (WMF) (talk) 16:13, 18 February 2016 (UTC)
 * thanks. note that i also created a task ( T127201 ), and even managed to beat you to it, but it was closed as "duplicate"... thanks again, peace - קיפודנחש (talk) 19:47, 19 February 2016 (UTC) also cam be hacked in many ways very easily. XD It's not broken and I suggest you do not use this site for most things are not true on here. Just like I'm doing right now. Try just using google. I know there isn't much there but at least you can use realy true facts there. Plus you cannot edit anyo

Unexpected behaviour in Greek
In the Greek language, we use a stress mark (tonos) on vowels to show which syllable is stressed. The normal search of Wikipedia returns articles that differ from the entered string only on stress, which is the expected behaviour. For example, if you enter ανεμος the first search result is άνεμος. However, the CompletionSuggester does not suggest άνεμος or any other word that starts with ά. It does, however, return ανεμοδαρμένος, ανεμούριο, ανεμοβλογιά, ανεμοστρόβιλος and several other words that start with an alpha without a stress mark. Rentzepopoulos (talk) 11:22, 8 March 2016 (UTC)
 * Thank you for the report Rentzepopoulos. I've created a task to figure out what is going on. T129502 CKoerner (WMF) (talk) 16:44, 10 March 2016 (UTC)
 * Thanks for the report, I'm looking into possible solutions to fix this problem. Could you confirm that the problem existed before the completion suggester was introduced? If it's not the case I must have missed something and I'll have to check more carefully what's happening for this kind of suggestions. DCausse (WMF) (talk) 11:02, 14 March 2016 (UTC)
 * We think we've fixed this issue now. The change should go live before Friday. Can you try again on Friday, and tell us if it works? I'll try to check myself too, but since you're the one who reported the issue, you'll know better than me whether our fix worked. Thanks. :-) --Dan Garry, Wikimedia Foundation (talk) 02:56, 15 March 2016 (UTC)
 * I was away for a few days, so I am responding just now. I checked the CompletionSuggester after a message in Greek Agora page inviting for beta testing the feature. This was on 8 March, presumably before CompletionSuggester was rolled out as a non-beta feature. Regarding the fix, I assume that you mean Friday 18 March. I keep a note to myself to check this then. By the way, is there a way to identify the current version of CompletionSuggester so that I can report together with the corresponding version numbers? Rentzepopoulos (talk) 08:59, 15 March 2016 (UTC)
 * The best thing to do for something like a version number for the completion suggester is go to Special:Version on your wiki and look at the release date for the CirrusSearch extension. Thanks for checking the fix on Friday, I appreciate it! --Dan Garry, Wikimedia Foundation (talk) 17:02, 15 March 2016 (UTC)
 * The problem is partially fixed, for example searching for ανθρακας will properly suggest άνθρακας meaning that the diacritics and stress marks are now properly handled, unfortunately I encounter another issue with the scoring method, ανεμο appears to be a very common prefix resulting in the problem that άνεμος is not properly ranked and does not appear in the suggestions. I will adjust the scoring method accordingly. Sorry for the inconvenience, I'll come back to you once the problem is fixed. DCausse (WMF) (talk) 02:21, 18 March 2016 (UTC)
 * This is exactly my finding as well. Ideally, stressed/unstressed and small/capital should be treated as the same letter (α, ά, Α, Ά) so that they do not receive different score. Rentzepopoulos (talk) 07:39, 18 March 2016 (UTC)
 * The problem seems to be resolved now. Note that elwiki is the first wiki we use this functionality and it's possible that we introduced some other undesirable behaviors, please let us know if you find any of them. Note that it should also support other (ancient?) stress marks like ἄγος that could be more useful on el.wiktionary.org. If you think this feature is valuable for greek we may want to gradually enable it on other greek wikis. Thank you for help! DCausse (WMF) (talk) 07:23, 1 April 2016 (UTC)
 * It is certainly better than before; however, I would like to note the following: I tried searching for "αγγλος" (simplified unstressed version for "Englishman") and although there exists an article "Άγγλος", this does not come up first; instead, the word "αγγλοσάξονες" (anglo-saxons) appears first.




 * It appears that "σ" is not considered equal to "ς", which is exactly the same letter spelt differently only when it appears at the end of the word. Although I don't know the internals of the search algorithm used, I think that the rule should be as follows:
 * Convert all accented forms of letters into a "base" representative letter (so that ά, Ά, α, Α etc. become "α"; similarly, "ς" is mapped to "σ");
 * Perform the search using the "base" representation;
 * Rank normally, not taking into account the original form of the letter ("ά" and "α" are completely equal);
 * In case you identify two results that have the same "base" representation (and obviously the same ranking) put first the one that closer matches the original form, primarily with respect to accents, and secondarily with respect to case (capital/small).
 * I hope the above is clear enough. Take care, Rentzepopoulos (talk) 14:01, 1 April 2016 (UTC)
 * Thank you, it's perfectly clear. You're right the system performs badly with "short words" that share a popular common prefix. Usually we rely on a specific lookup on the database to "re-rank" exact matches at the top but unfortunately the database (unlike the index) does not support diacritics removal. And in this case the suggestions will be solely ranked based on the page metadata (size, incoming links, pageviews...). The problem is that I don't have much room to include extra logic (for performance reasons), I'll have to think about it more thoroughly because I don't see any obvious solutions... Thank you!DCausse (WMF) (talk) 14:57, 1 April 2016 (UTC)

Does not find internamespace redirects
On huwiki, we have a shortcut “WP” for Wikipedia namespace (e.g. hu:WP:WP for the list of the shortcuts). However, it’s not registered in MediaWiki, so these pages are technically articles. When first I didn’t see such a page in the suggestion list, I thought it was a temporary error, but now I think it’s the problem of CompletionSuggester. Any chance to fix it? --Tacsipacsi (talk) 17:35, 8 March 2016 (UTC)
 * Tacsipacsi, are you seeing this behavior with the completion suggester beta feature or the current, non-beta search? To make sure I understand you correctly, the "WP:" shortcuts are in the main article namespace, but each are redirects to pages in the "Wikipédia:" namespace. Is that right? CKoerner (WMF) (talk) 16:50, 10 March 2016 (UTC)
 * I use the beta feature, but know temporarily turned it off and it wasn’t better. The previous behaviour was that it displayed nothing for WP:W, but it suggested the redirect page when I typed the last “P”. And yes, almost all of the 631 pages redirects to the Wikipedia namespace, the rest are redirects to other namespaces (e.g. hu:WP:VÉDETT redirects to a category). Some of them redirect to a section of a project page. (If you want to add it as a namespace alias, you should get local consensus. I think it won’t be hard, but there are too much changes lately without even enough informing that some users are really annoyed.) --Tacsipacsi (talk) 18:06, 10 March 2016 (UTC)
 * Since the problem persists when the completion suggester beta feature is turned off, this is probably not directly related to the completion suggester, but it may be related to some other changes that we've made recently. I've filed T129545 to investigate this. Thanks for the report. --Dan Garry, Wikimedia Foundation (talk) 19:54, 10 March 2016 (UTC)
 * We think we've fixed this issue now. The change should go live before Friday. Can you try again on Friday, and tell us if it works? I'll try to check myself too, but since you're the one who reported the issue, you'll know better than me whether our fix worked. Thanks. :-) --Dan Garry, Wikimedia Foundation (talk) 02:56, 15 March 2016 (UTC)
 * It works; moreover, it shows the results for me after typing the first letter. --Tacsipacsi (talk) 12:26, 18 March 2016 (UTC)

Exact search useful
Hello, I hope it will be possible to inactivate the function, as an exact search is very useful for maintenance (orthographic for instance) and to find informations on very specific subjects. --La femme de menage (talk) 10:27, 15 March 2016 (UTC)
 * Hi La femme de menage, I'm not very familiar with this area (orthography) and would appreciate some examples to help me better understand how the completion suggester impacts your work. I can help relay your concerns to the team with better accuracy if I have examples and, for lack of a better word, evidence. :) CKoerner (WMF) (talk) 14:46, 17 March 2016 (UTC)


 * Why is this new feauture introduced and there is no even possibility to turn it off? Exact search was useful indeed. How to get it back? This should have been called 'CircusSearch' rather than 'CirrusSearch'. --Obsuser (talk) 06:46, 17 March 2016 (UTC)
 * Obsuser, comparing the work of the team to a 'circus' is not helpful. Where I come from it comes across as rude. I'll ask you the same question as I did La femme de menage, can you please help us better understand the use of search in your work and how this impacts what you do? CKoerner (WMF) (talk) 14:46, 17 March 2016 (UTC)
 * Sorry, I do not want to belittle anyone’s work. It was more like a joke, obviously bad one...
 * But really, was it hard to keep it beta or add possibility to turn it off? Many users need acurrate search, and reasons are simple: you see whether page exists or not (especially when you want to transcribe a name, such as in Serbian; you get some other name as search result thinking its what you wanted but its not [e.g. two-letter-swap different names]); you get "false hope" that what you searched for exists and it does not; search results are mixed (e.g. if you search "Abcde, " you get unexcpected mix of results: you should first get all instances followed by comma and space but you don’t; don’t know if this was case before CirrusSearch) etc.
 * Even if you wanted to introduce something new, nice idea would be to add slider in Preferences that would be used to control search sensitivity (left would be the smallest sensitivity), and programmers would have had task to make some factors related to percentage of slider i.e. includion of three-letter mismatch, two-letter mismatch, capital/non-capital, ’ or ' etc.
 * Worst thing is that there is really much more other still unsolved problems, proposed on .en Village pump, phab etc. This did not make any crucial improvement; it just improved readers experience a bit but editors (at least I) are negativelly affected when creating new or trying to search for existing pages.
 * Besides new search, new talk page is also a failure. Wikipedia maintained its really beautiful and unique stile for over 15 years and there’s no need to introduce new not needed "revolutions" but rather to try improving existing software not by changing it or replacing it but simply resolving bugs etc.


 * My suggestion is to all those results showed thanks to CirrusSearch (in one color, of course; maybe red but that would be pretty bad looking; or to make them very light grey...) so user knows if displayed search result is not a full match.


 * Maybe this is not a right place but why are numbered lists twice as much outdented when compared to...
 * ... bulleted lists? --Obsuser (talk) 16:05, 17 March 2016 (UTC)
 * thanks for your comments. It's not true to say that the previous system provided exact search. The previous system removed all diacritics and some punctuation characters when searching, so using the search suggestions as a tool to detect and fix typos or to find exact page titles is/was maybe not appropriate? Would Special:PrefixIndex be a good candidate for this kind of usage? On the other hand I like the idea you suggested to provide a slider in the user Preferences. The completion suggester supports multiple profiles with many knobs to tune, unfortunately these knobs are not configurable by the user. At a glance we could provide the following profiles:
 * Strict (some punctuation removed but all diacritics kept intact)
 * Normal (same as strict but with stopwords and diacritics removed)
 * Fuzzy-1 (1 typo allowed)
 * Fuzzy-2 (2 typos)
 * Fallback to the previous system (which is between Strict and normal)
 * Note that this is what is supported by the backend, and I have no idea if it's easy/worthwhile to add a new Preference tab just for the Completion suggester. DCausse (WMF) (talk) 00:32, 18 March 2016 (UTC)
 * Hi CKoerner (WMF). Here is an example. Let's consider that in French, "Armand Colin" is the right orthographic form, and that "Armand Collin" is a mistake. When I wish to correct it, if I make a search on "de Armand Colin" ("de" being there just to avoid the redirect fr:Armand Collin, it could be "x" or anything else), here are the results : 6690 results. Please note the high number of false positives. So the exact spelling is more convenient : 178 results. I also use this "exact" research before creating article or adding content to find if a topic has been get in in some articles or not. Hope this will help. --La femme de menage (talk) 08:58, 23 March 2016 (UTC)

Disable
Where can I disable this feature? – Be..anyone &#x1F4A9;  05:31, 14 April 2016 (UTC)
 * You can now disable the feature in the "Search" tab of your user preferences; doing so will significantly degrade your search experience. --Dan Garry, Wikimedia Foundation (talk) 21:48, 3 November 2016 (UTC)
 * Whether a strict search will "degrade the search experience" or not, is a highly subjective judgement. For me, a good "search experience" is when I get information on exactly what I am looking for, and not on anything that might just be written in a similar way. If there is none, then so be it. There are plenty of legitimate search terms that get shifted into second place by suggestions for more common, similar words. --Schlosser67 (talk) 10:12, 16 December 2016 (UTC)

IndexCreationException failed to create index
On our MediaWiki 1.27.1 system, I am getting errors running updateSuggesterIndex.php for the first time. Any suggestions?

$ php updateSuggesterIndex.php Scanning available plugins... kopf Picking analyzer...english Fetching Elasticsearch version...1.7.4...ok Inferring index identifier...wikidb-vpw__titlesuggest_first Index does not exist yet cannot recycle. Inferring index identifier...wikidb-vpw__titlesuggest_first Setting index identifier...wikidb-vpw__titlesuggest_1479240582 2016-11-15 15:09:42 Unexpected Elasticsearch failure. Elasticsearch failed in an unexpected way. This is always a bug in CirrusSearch. Error type: Elastica\Exception\ResponseException Message: IndexCreationExceptionwikidb-vpw__titlesuggest_1479240582] failed to create index]; nested: ElasticsearchIllegalArgumentException[failed to find token filter type [icu_normalizer] for [icu_normalizer; nested: NoClassSettingsException[Failed to load class setting [type] with value [icu_normalizer]]; nested: ClassNotFoundException[org.elasticsearch.index.analysis.icunormalizer.IcuNormalizerTokenFilterFactory]; Trace:
 * 1) 0 /home/wiki/wiki/wiki/vendor/ruflin/elastica/lib/Elastica/Request.php(171): Elastica\Transport\Http->exec(Object(Elastica\Request), Array)
 * 2) 1 /home/wiki/wiki/wiki/vendor/ruflin/elastica/lib/Elastica/Client.php(621): Elastica\Request->send
 * 3) 2 /home/wiki/wiki/wiki/vendor/ruflin/elastica/lib/Elastica/Index.php(496): Elastica\Client->request('wikidb-vpw__tit...', 'PUT', Array, Array)
 * 4) 3 /home/wiki/wiki/wiki/extensions/CirrusSearch/maintenance/updateSuggesterIndex.php(706): Elastica\Index->request('wikidb-vpw__tit...', 'PUT', Array, Array)
 * 5) 4 /home/wiki/wiki/wiki/extensions/CirrusSearch/maintenance/updateSuggesterIndex.php(301): CirrusSearch\Maintenance\UpdateSuggesterIndex->createIndex
 * 6) 5 /home/wiki/wiki/wiki/extensions/CirrusSearch/maintenance/updateSuggesterIndex.php(231): CirrusSearch\Maintenance\UpdateSuggesterIndex->rebuild
 * 7) 6 /home/wiki/wiki/wiki/maintenance/doMaintenance.php(103): CirrusSearch\Maintenance\UpdateSuggesterIndex->execute
 * 8) 7 /home/wiki/wiki/wiki/extensions/CirrusSearch/maintenance/updateSuggesterIndex.php(809): require_once('/home/wiki/wiki...')
 * 9) 8 {main}

I already tried "composer update" but it didn't help. Thank you for any advice. --Maiden taiwan (talk) 20:18, 15 November 2016 (UTC)


 * Filed as a bug: https://phabricator.wikimedia.org/T150799. --Maiden taiwan (talk) 20:56, 15 November 2016 (UTC)