Search issues in MediaWiki 1.11

Avi Rappoport wrote up some scathing complaints about MediaWiki's default search system. Most of the issues fail to take into account that Wikipedia doesn't use the MySQL search backend, which explains some of the "mysterious" differences in behavior. :)

Most of the suckage on a default install is in MySQL, but not all of it. But it doesn't matter to the user whose fault it is -- he just wants it fixed!

MySQL's length limit
Resolved


 * Words under 4 characters don't return results.
 * Old solution: force people to customize MySQL's configuration.
 * Current solution as of 1.14: pad short words in our index so they're longer. :P

MySQL's stopwords.

 * Words on the stopword list don't return results.
 * Current solution: force people to buy a VPS in order to customize MySQL's configuration.
 * Cleaner solution: pad words we know are in MySQL's default stopword list so they get indexed. :P
 * kinda hacky but should work
 * Working patch submitted to 352, but haven't been committed due to performance degradation.

"extremely limited search syntax"

 * "can't exclude terms from search results"
 * fixed in 1.12: use -foo
 * "No way to search for one or more among several terms"
 * (boolean OR) This indeed probably doesn't currently work. (but is supported in Lucene)
 * needs fancier syntax conversion to build an appropriate query for MySQL backend
 * "Does not search for plural or other versions of words (stemming)
 * True with current MySQL backend.
 * would have to do this manually, which is non-trivial.
 * "No truncation or wildcards"
 * fixed in 1.12: use foo*
 * "No search variable for automatically repeating the search with wildcards in case of too little or no results"

"Misleading Results Information"

 * "How many actual matches does your search find? It's a mystery..."
 * Full count of matches not currently shown.
 * Solution: ..... can we do a select count(*) or something? Is this vaguely efficient?
 * Resolved in 1.16 with $wgCountTotalSearchHits on MySQL and SQLite backends. Still disabled by default though.

"Search Results Show Markup Junk"

 * Guilty as charged.
 * Solution: run markup stripping? How to best handle this efficiently. What about matches on target links that don't appear in literal text?

"Match Words Highlighting Is Wrong"

 * This was mostly fixed a couple months ago... but remains imperfect.
 * "Stopwords Are Not Match Words"
 * Not relevant where stopwords don't exist in the system
 * (as with Lucene search, or if we fix the above-noted MySQL issue to avoid stopwords)


 * Ranking...
 * There's some recommendation to use a non-boolean match to get the ranking, and sort by that.
 * May seem a little odd, but... we can see :D