Search issues in MediaWiki 1.11

Avi Rappoport wrote up some scathing complaints about MediaWiki's default search system. Most of the issues fail to take into account that Wikipedia doesn't use the MySQL search backend, which explains some of the "mysterious" differences in behavior. :)

Most of the suckage on a default install is in MySQL, but not all of it. But it doesn't matter to the user whose fault it is -- he just wants it fixed!

MySQL's length limit

 * Words under 4 characters don't return results.
 * Current solution: force people to customize MySQL's configuration.
 * Cleaner solution: pad short words in our index so they're longer. :P
 * Done in 1.14.

MySQL's stopwords.

 * Words on the stopword list don't return results.
 * Current solution: force people to customize MySQL's configuration.
 * Cleaner solution: pad words we know are in MySQL's default stopword list so they get indexed. :P
 * kinda hacky but should work

"extremely limited search syntax"

 * "can't exclude terms from search results"
 * fixed in 1.12: use -foo
 * "No way to search for one or more among several terms"
 * (boolean OR) This indeed probably doesn't currently work. (but is supported in Lucene)
 * needs fancier syntax conversion to build an appropriate query for MySQL backend
 * "Does not search for plural or other versions of words (stemming)
 * True with current MySQL backend.
 * would have to do this manually, which is non-trivial.
 * "No truncation or wildcards"
 * fixed in 1.12: use foo*

"Misleading Results Information"

 * "How many actual matches does your search find? It's a mystery..."
 * Full count of matches not currently shown.
 * Solution: ..... can we do a select count(*) or something? Is this vaguely efficient?

"Search Results Show Markup Junk"

 * Guilty as charged.
 * Solution: run markup stripping? How to best handle this efficiently. What about matches on target links that don't appear in literal text?

"Match Words Highlighting Is Wrong"

 * This was mostly fixed a couple months ago... but remains imperfect.
 * "Stopwords Are Not Match Words"
 * Not relevant where stopwords don't exist in the system
 * (as with Lucene search, or if we fix the above-noted MySQL issue to avoid stopwords)


 * Ranking...
 * There's some recommendation to use a non-boolean match to get the ranking, and sort by that.
 * May seem a little odd, but... we can see :D