User:TJones (WMF)/Notes/Project Wishlist

See TJones_(WMF)/Notes for other projects.

If I had infinite time, these are some of the other projects I'd like to work on. I do work on some of them as my 10%-time projects, and hope to get to these and others in the future. If you'd like to help in any way—comments, questions, and suggestions are welcome!—please contact me or leave a note on the talk page.

Future Hackathon Potential Project List[edit]

This is a list of mostly language-focused, not-necessarily-great ideas, in order of my current desire to work on them at the 2020 Hackathon.

Work with Albanian speakers to implement a basic Albanian language analyzer, with appropriate folding for non-Albanian diacritics, stop word list, test an Albanian stemmer or two ( 1a & 1b, 2) and begin porting it to Java if warranted.
plugin to do transliteration for languages where it is relatively easy (Serbian was on the list, but it’s already done!—and for very simple mappings this is just a character map). LanguageConverter docs have a list of what's implemented, but there are others.
- “Bollywood detector”—identify and map Bollywood movie names into multiple scripts (these show up in zero-results searches)
work out the use cases and infrastructure for supporting a community-built thesaurus
- "synonym tester": a user script to test the effects of making two words synonyms
expand the plugin to do automatic homoglyph corrections (T222669) to include Greek/Latin and Cyrillic/Greek (and handle those rare tri-script tokens)
look into ways of automatically generating a stemmer from Wiktionary conjugation/declension data (maybe start with Estonian?)
find a way to automatically determine low-information title/redirect prefixes like “List of …” and investigate indexing the Completion Suggester without them
extract “related results” from an article’s infoboxes, opening text, or elsewhere and display them on the search results page with the article
project WordNet or other thesaurus/ontology onto short strings (e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful thesaurus terms and prune the rest
implement a phonetic search keyword for matching query to titles
develop a different statistical approach to detect wrong keyboard typing and build a search-only filter to generate alternative tokens—for Russian/English (T138958), Hebrew/English T155104, OR one hand on wrong home row key

Potential Non-Hackathon 10% Projects[edit]

If anyone at a hackathon wanted to work on these, I'd be more than happy to, but these are more search internals tech debt type projects.

recheck differences in unpacked vs monolithic analyzers (eliminating our automatic upgrades, which 98% likely to have caused the diffs)
compare the analyzers for the top 5-10 wiki languages by volume, and look for ways to increase consistency among them

Completed Projects!![edit]

Mirandese (mwl) analysis plugin built from Portuguese and French parts, plus a stop list provided by an mwl editor (T194941) Done!
plugin to merge high surrogates and low surrogates that get split up by the Chinese analyzer (T168427) Done!
plugin to do automatic homoglyph corrections (T222669) Done!