User:TJones (WMF)/Notes/Potential Applications of Natural Language Processing to On-Wiki Search

From mediawiki.org

May 2018 — See TJones_(WMF)/Notes for other projects. See also T193070

“[W]e need to design the right tasks. Too easy and NLP is only a burden; too hard and the necessary inferences are beyond current NLP techniques.”

—R. Baeza-Yates et al., “Towards Semantic Search” (doi:10.1007/978-3-540-69858-6_2)

Introduction and Overview[edit]

What is natural language processing? It’s hard to pin down exactly, but almost any automated processing of text as language might qualify; in a corporate environment, it’s whatever text processing you can do that your competitors can’t. The English Wikipedia article on NLP does a good job of laying out a lot of the common, high-level tasks NLP addresses, most of which we’ll at least touch on below.

The goal of this report is to look at any aspects of computational linguistics/NLP that might be useful for search, and list at least 50-100 ideas and variants (many without too much detail at first), and through discussion find 10-20 that seem really promising, and then identify 1-2 that we can pursue, ourselves and/or with the help of an outside consultant, over the next 2-4 quarters.

Some topics are driven more by use case (phonetic search), and some more by technique (Word2vec must be good for something), so the level of obvious applicability may vary, and some items are defined in one item and referenced in another. Items are roughly grouped by similarity, but only kind of.

Focusing more deeply on any of these topics could lead to a recursive amount of similar detail. For most topics we still need to investigate the cost—in terms of development, integration, complexity, computational resources, etc—vs the value to be had—improvements to search, benefit to readers, editors, etc. I don’t have any strong opinions on “build vs buy” for most of these, especially given the fact that we would only pursue open-source options, which affords much more control. For some, whether build or buy, it may make sense to wait for other resources to mature, like Wikidata’s lexeme search (Stas’s Notes, T189739), or more structured data in Wiktionary.

I think it is often worth considering some not-so-cutting-edge techniques that, by today’s standards, are not only simple, but relatively lightweight. These have the advantage of being well-documented, easier to implement and understand, and lower risk should they fail to pan out. Many also follow an 80/20 reward/cost ratio compared to cutting-edge techniques—and implementing five ideas that each get 80% of the value of the best version of each might be better than implementing one idea in the best way possible, especially when the payoff for each implementation is unclear. Experience can then show which is the best to pursue to the max for the most benefit. Of course, cutting-edge techniques are good, too, if they are practical to implement!

As the dividing line between Natural Language Processing, Machine Learning, and Information Retrieval is sometimes blurry, I haven’t been too careful to discard ideas that are clearly more ML or IR than NLP.

Every item probably deserves an explicit “needs literature review for more ideas and options” bullet point, but that would get to be very repetitive. Many items that refer to words or terms might need to be modified to refer to character n-grams for languages without spaces (esp. those without good segmentation algorithms).

N.B.: Items are loosely grouped, but completely unsorted. Top-level items aren't more important than sub-items, and we might work on a sub-sub-item and never take on a higher level item. The hierarchy and grouping is just a way to organize all the information and reduce repetition in the discussion.

Current Recommendations[edit]

David, Erik, and Trey reviewed a selection of the most promising-seeming and/or most interesting projects and gave them a very rough cost estimate based on how big of a relative impact they would have (weighted double), technologically how hard they would be, and how difficult the UI aspect would be (weighted half). See the Scoring Matrix below. The scores are not definitive, but helped guide the discussion.

For the possibility of working with an outside consultant, we also considered how easily separated each project would be from our overall system (making it easier for someone new to get up to speed), how projects feed into each other, how easily we could work on projects ourselves (like, we know pretty much what to do, we just have to do it), etc.

Our current recommendation for an outside consultant would be to start with (1) spelling correction/did you mean improvements, with an option to extend the project to include either (2) "more like" suggestion improvements, or (3) query reformulation mining, specifically for typo corrections. These are bolded in the scoring matrix below.

For spelling correction (#1), we are envisioning an approach that integrates generic intra-word and inter-word statistical models, optional language-specific features, and explicit weighted corrections. We believe we could mine redirects flagged as typo correction for explicit corrections, and the query reformulation mining (#3) would also provide frequency-weighted explicit corrections. Our hope is that a system built initially for English would be readily applicable to other alphabetic languages, most probably other Indo-European languages, based on statistics available from Elastic; and that some elements of the system could be applied to other non-alphabetic languages and languages that are typologically dissimilar to Indo-European languages.[1]

Looking at the rest of the list, (a) wrong keyboard detection seems like something we should work on internally, since we already have a few good ideas on how to approach it. (b) Acronym support is a pet peeve for several members of the team, and seems to be straightforward to improve. (c) Automatic stemmer building and (d) automatic stop word generation aren't so much projects we should work on as things we should research to see if there are already tools or lists out there we could use to make the projects much easier.

Scoring Matrix[edit]

Project Tech UI Impact Cost Notes
Spelling / DYM Improvements hard N/A large 2 good scope, distinct from other parts of the system, good for a consultant
improve "more like" suggestions hard N/A large 2 good scope, distinct from other parts of the system, eval is hard, good for a consultant
wrong keyboard detection easy/medium easy medium 2.5 easy for us to work on—small impact for most, but large impact for some
ignore completion prefixes easy/medium easy small/medium 3.5 determine low information title/redirect prefixes and also index titles without them. E.g., "List of"
query expansion medium/hard N/A medium 3.5 needs better scope
Proper acronym support easy/medium N/A small/medium 3.5 easy for us to work on—and at the same fix the word_break_helper, properly support N.A.S.A == NASA, (get rid of poor hacks such as this)
query reformulation mining hard easy medium 4 for eventual use for spelling correction or synonyms—needs separate work to put to good use
automatic stemmer building hard N/A medium 4 (needs research—may be able to use existing tools)
entity recognition medium N/A small/medium 4 needs better scope
diversity reranking medium N/A small/medium 4
community built thesaurus hard hard medium/large 4 (lots of non-technical issues and needs buy-in from the communities)
related results medium easy small/medium 4
noun phrase indexing medium/hard N/A small/medium 4.5
link analysis medium/hard N/A small/medium 4.5
automatic stop words medium N/A small 5 (needs research—there may be decent lists out there, should definitely use those first)
phonetic search medium easy small 5

The Possibilities are Endless![edit]

  • Phonetic search: either as a second-try search (i.e., when there are few or zero results) or as a keyword. Probably limited to titles only. Language-dependent. T182708
    • Santhosh has worked an Indic phonetic algorithm from cross-script searching. Blog post. Github.
    • See also T168880


  • Parsing / part-of-speech tagging (POS tagging usually involves less structural information). Language-dependent.
    • can help with word-sense disambiguation, detection of noun phrases for noun-phrase indexing and entity recognition.


  • Automatic stemmer building: explore a general framework for automatic stemmer building, esp. using Wiktionary data
  • Semi-automated morphological tools: I as thinking that some tools to do conjugations and declensions would be handy for Wiktionary. Then I remembered that templates are arguably Turing complete (or close enough) that of course such things already exist for English and lots of other languages on English Wiktionary.


  • Noun-phrase indexing: index more complex noun phrases in addition to or instead of the parts of the noun phrase. Can disambiguate some nouns by making them more specific, can provide better matches to people, places, or things. Detecting noun-phrases in queries could be much harder, so looking for n-grams that are indexed as phrases instead would be one approach.
    • Various techniques could be used to limit the what actually gets indexed, like TF and IDF ranges, or score all candidates and only index the top n.
    • Could be generalized to phrasal indexing based on something other than purely syntactic considerations.
      • Could use page titles and redirects as phrase candidates (perhaps w/ minimum IDF score).
  • Entity recognition, classification, and resolution: Recognize entities (possibly restricted to named entities), classify them (people, places, companies, other organizations, etc.), and determine whether different identified entities are the same. (language-dependent)
    • Improve recall by recognizing that “Jack Kennedy” and “John F. Kennedy” are the same person.
    • Improve precision by recognizing that the fifteen instances of “Kennedy” in an article are all about Bobby Kennedy, and so do not represent good term density for a search on “Jack Kennedy”.
    • Distinguish between “John F. Kennedy”, “John F. Kennedy International Airport”, “John F. Kennedy School of Government”, and “John F. Kennedy Center for the Performing Arts” as different kinds of entities.
    • A useful input to some recognizers and resolvers is a gazetteer. Extracting such a list of known entities from Wikidata or from Wikipedia by category could be useful.
  • Coreference resolution / Anaphora resolution: This is similar to entity resolution, but applies to pronouns. Probably not terribly useful directly, but could be useful as part of topic segmentation, and possible as a way to increase the relative weight/term density of entities mentioned in a text.
  • Relationship extraction: once your entity recognition has found some entities, you can also derive relationships between them. Might be good for document summarization or identifying candidates for related results.
  • LTR Features
    • Binary: noting that a given phrase/entity has been identified in both the query and a given article.
    • Count: how many times does the phrase/entity from the query occur in the article.


  • Word-sense disambiguation
    • assign specific sense to a given instance of a word. For example, bank could be a financial institution or the edge of a river, but in the phrase “money in the bank” it’s clearly the financial institution meaning.
    • can map to senses defined in a lexicon, or two ad hoc senses derived via un- or semi-supervised algorithms like Word2vec or similar. Ad hoc algorithmically defined senses are not transparent, but can be more specific than lexical senses. For example, the “bank” in “West Bank” refers to a river bank, but its use in a proper noun is less about geomorphology and more about politics.
    • can improve precision by searching for the relevant sense of a word, rather than the exact string.
    • related to semantic search and semantic similarity
  • Use a Thesaurus
    • Automatically building a thesaurus:
      • mine query logs for synonyms—diff queries with same click is evidence of synonymy
      • mine redirects for synonyms—diff words in title is evidence of synonymy
      • probably language independent; though may need to consider n-grams or other elements for languages without analyzers
      • could expand beyond pure synonymy to “expanded search” with Word2vec or similar
    • Community-built thesaurus: Allow the communities to define synonym groups on a wiki page some where, and regularly harvest and enable them. (Needs a lot of community discussion and some infrastructure and process to deal with testing, finding consensus, edit warring, etc. But I could imagine “relforge lite” would allow people to see approximate results by ORing together synonyms.) Language-dependent, possibly wiki-dependent—though same-language projects could borrow from each other.
      • Various techniques for automated thesaurus building could be used to suggest candidates for community review, including smarter hyphen processing options.
    • Some additional considerations:
      • Should a thesaurus always be on, or should it have to be invoked? Do we have multiple levels of thesaurus, some on by default, some only used when “expanded search” is invoked, and some way (“term” or verbatim:term) of disabling all thesaurus terms.
    • Thesaurus for Unicode characters: Automatically add Unicode character names to a thesaurus, so ☥ == Ankh, € == euro, etc, based on Unicode character name descriptions. Translate to other languages via Wikidata, or other sources—are there official Unicode names in other major world languages? Candidates may need some sort of review, and decide what level of phrasal matching is required (e.g., "goofy face" for 🤪, is it a single token or two words? Do textual matches have be an exact phrase match, or does goofy smiley face count? Etc.) (From a discussion on T211824.)


  • (Semi-)Automatically finding stop words: get a ranked list of words by max DF or by IDF and pick a cut off (hard) or get speaker review (less hard). Language-dependent.
    • Or finding existing lists, with a usable license
    • Or decide that stop words are overrated.
    • T56875
    • See also these stop word lists with a BSD-style license.


  • Query rewriting: (a super-set of query expansion) the automatic or suggested modification of a user’s query (this is also sometimes called “query reformulation”, but here I’m going to use that to refer to users modifying their own queries. See below.) This includes:
    • Spelling correction: not only fixing obvious typos, but also correcting typos that are still valid words. For example, while fro is a perfectly fine word, in “toys fro tots” it is probably supposed to to be for. Statistical methods based on query or document word bigrams can try to detect and correct such typos. Many techniques are language-independent.
      • Parsing might be able to detect unexpected parts of speech, or evaluate the syntactic quality of a suggested repair.
      • A reversed index would also allow us to make repairs in the first couple of letters of a word (which now we can’t do—so neither Did You Mean (DYM) nor the completion suggester can correct Nississippi to Mississippi.
    • Using word sense disambiguation (in longer queries) to search only for a particular sense of a word. For example, in river bank, we can exclude or reduce the score of results for banks as financial institutions.
    • Suggesting additional search terms (in shorter queries) to disambiguate ambiguous terms or refine a query. Could be based on query log mining (term X often co-occurs with term Y), document neighborhoods (add the most common co-occurring term from each of n neighborhoods in which term X most frequently occurs), or other techniques. Language-independent.
    • This also includes some kinds of information that we handle at indexing time, rather than at query time, like stemming and using synonyms from a thesaurus.
    • See also Completion suggester improvements for related ideas for improving queries while the user is typing.
  • Query reformulation mining: detecting when users modify their own queries to try to get better results and mining that information.
    • Mine logs for sequential queries that are very similar (at the character level or at the word level). At the character level, might imply spelling correction. At the word level, might imply synonyms. Probably language-independent.
  • Breaking down compounds to index. Less applicable in English, more applicable in German and other languages. Language-dependent.
  • Smarter hyphen processing, e.g., equating merry-go-round and merrygoround. Could be done through a token filter that converts hyphenated forms to non-hyphenated forms (language independent). Could be done via thesaurus for specific words, curated by mining candidates (language-dependent curation, but language independent creation on candidate list) or automatically determined based on some threshold, e.g., both forms occur in the corpus n times, where n ≥ 1 (language-independent).


  • Document summarization: build an automatic summary of a document. The simplest approach chooses sentences of phrases from the existing document based on TF/IDF-like weighting and structural information; much cleverer approaches try to synthesize more compact summaries by parsing the text and trying to “understand” it.
    • Could weight summary based on query or other keywords.
    • Simple approach could use additional information from noun-phrase indexing, entity recognition, word-sense disambiguation, topic segmentation, or other NLP-derived info to improve weighting of sentences/phrases.
    • Semi-clever approach could try to parse info boxes and other commonly used templates to construct summary info.
    • Could be an API that allows users to request summary version of document of specified percentage (25%) or specified length (1000 characters).
      • A UI supporting a slider that grows/shrinks the summary is possible.
    • Multi-document summaries are difficult, but could in theory provide an overview of what is known on a topic from across multiple documents.
      • for example, summarize a topic based on the top n search results
      • multi-doc summaries, entity recognition, and topic segmentation could allow pulling together all the information Wikipedia has on a topic about person/place X, even though it is scattered across multiple articles and there is no article on X.
    • Simple approach is roughly language-independent, adding weighting by query/keywords is possibly quasi-language independent, in that it may only involve one parameter—use tokens (e.g., English) or use n-grams (e.g., Chinese). Clever approaches are probably language-dependent (the more clever, the more likely to be language-dependent). Template parsing is wiki-dependent.


  • Semantic search / Semantic similarity: these are broad topics, and many of the sub-components are touched on throughout.
  • Document neighborhoods / ad-hoc facets: several approaches could be used to define document neighborhoods, and the neighborhoods could be used for several things. The basic idea is to find either n clusters of documents or clusters of m documents that are similar in some way. I’m calling these “neighborhoods” because “clusters” gets used for many, many things.
    • defining neighborhoods: any similarity metric and clustering algorithms can be used to cluster documents. Some similarity metrics: TF/IDF keyword vectors, Word2vec/Doc2vec, Latent semantic analysis, cluster pruning, etc.
      • sqrt(N) seems like a good heuristic for number of clusters if you have no other basis for choosing
      • could assign docs to singular nearest neighborhood, or to all neighborhoods within some distance.
      • could define multiple levels of neighborhood
    • implementing neighborhoods: most of the candidate metrics are vector-based, and storing vectors in Elasticsearch is probably impractical; creating a new field called, say, “nbhd” and storing an arbitrary token in it is at least plausible (though based on neighborhood size could still cause problems with skewed indexes).
    • using neighborhoods: some use cases
      • increasing recall: assign a query to one or more neighborhoods and return all documents in the neighborhood(s) as potential matches. Probably requires new ways of scoring. Might want to limit neighborhood size (rather than number of neighborhoods) in this use case.
      • could be used for diversity reranking
      • LTR features: learning-to-rank could take neighborhood info into account for ranking. Possible features include:
        • Binary value for “neighborhood match” between query and doc; each could have one neighborhood, or short list of neighborhoods (multiple matches or hierarchical neighborhoods)
        • Weighted value for “neighborhood match” between query and doc, given multiple neighborhoods each: could be # overlap in top-5 neighborhoods, hierarchical neighborhood, rank of best match, etc.
        • Nominal values of query and document—e.g., LTR could learn that documents in nbhd321 are slightly better results for queries in nbhd017. Sparsity of data and stability of neighborhood labels are issues.
  • Search by similarity
    • use built in morelike or our own formulation for similarity (including document neighborhoods)
    • add a “more like this” / “fewer like this” query-refinement option on the search results page
    • have an interface that allows you to input a text snippet and find similar documents
    • match documents and editors
      • find collaborators to work on a given page by finding people who have made edits to similar pages
      • less creepily, find pages to edit based on edits you’ve already made (plus, say, the quality score of article)
      • possibly weighted by number of edits or number of characters contributed to edited pages
    • could be useful for category tools
  • Category tools
    • use the similarity measures from search by similarity
    • find docs like other docs in this category for category suggestion
      • probably exclude docs already in sub-categories
      • maybe add emphasis on docs in super-categories
    • find categories with similar content to suggest mergers
    • cluster large category contents into groups to suggest splits/sub-categories
  • Diversity reranking—promoting results that increase the diversity of topics in the top n results. For example, the top ten results for the search bank might all be about financial institutions; promoting one  or more result about the edges of rivers would improve result diversity. Needs some document similarity measure—see search by similarity and document neighborhoods and the eBay blog post linked to above.
    • Could apply to full-text search or the completion suggester.


  • Zone indexes: index additional fields or zones (which I’ll refer to generically as zones). Possible pre-defined zones include title, opening text, see also, section titles, subsection titles, section opening text, and frequently used sections like citations, further reading, references, external links, or quotations (in Wiktionary), as well as captions, general aux text, other sections, and category names. Depending on which zones, some are language-dependent (references) and some are not (section titles).
    • could use topic segmentation to automatically find additional topic zones.
    • zone indexes could be exposed as keywords (like intitle, e.g., search in section titles)
    • LTR features:
      • zone relevance score (see below) for particular zones
      • what zone hits are from (e.g., section title > references)
      • term proximity with respect to zones; e.g., all hits are within one topic zone or one subsection is better than if hits for different query terms are all in different zones.
    • as keyword or LTR feature, it’s possible to calculate zone-specific relevance score, such as TF/IDF/BM25-type scores, etc. For example, “wars involving” is not useful category text on the WWII article.


  • Completion suggester improvements
    • determine low-information title/redirect prefixes (like “List of”) and index pages with and without such prefixes
    • find other entities or noun phrases (see noun-phrase indexing and entity recognition) in titles/redirects and also index the “best” of those
    • n-gram–based word-level completion: predict/suggest the next few words in a query based on the last few words the user has typed (when the whole query isn't matching anything useful)
    • make spelling correction suggestions per-word (which might be different from matching a title with one or two errors)
    • (We'll have to think carefully about the UI if we want to show title matches, spelling corrections, and next-word suggestions.)


  • Question answering (language-dependent)
    • Shallow version: dropping questions words or question phrasing, see T174621, and hope the improved results do the job (see also Trey’s Notes)
    • Deep question answering involves converting the question to a query and trying to return specific answers by parsing page results OR converting to SPARQL and getting a Wikidata answer.


  • 20-questions style UI for Wikidata—helping people find something they know something about, but can’t quite put their finger on
    • Basic/static version split the universe into the n (≤10) most “distinctive” categories, and the user selects one to continue, iterate on specified subset. You should be able to get to anything in ~10 steps. at any point when there are fewer than m (50 ≤ m ≤ 100) results show them all. splits could be re-computed monthly (computation is language-independent, presentation is language-dependent)
    • Advanced/dynamic version: allow one-or-more selections at any level; for certain categories (e.g., “person”) allow specification of likely-known information (birthdate range, date of death range, country of origin, gender, etc.)—these could be manually defined for the most obvious categories, or they could be inferred based on the number of items in a category that have those values, and what kinds of values they are. dynamically determine “distinctive” subcategories on the fly based on current constraints (could be very compute intensive, depending on algorithms available). (Topic-dependent)
    • Feedback option: allow people to mark categories as unhelpful, and then either don’t show them for that session, for that user, for that main subcategory, or generally mark them as dispreferred for all users (much testing needed!). (computation is language-independent, presentation is language-dependent)


  • Link analysis—using incoming and outgoing links to improve search results
    • Could be on-wiki link text, or using something like Common Crawl
    • Incoming link text could provide additional terms to index the document by
    • Targets of outgoing on-wiki links could also provide additional terms to index
    • Outgoing links could highlight important terms for the document that should be weighted more heavily
    • Outgoing links could also help identify entities (see entity recognition)


  • Topic segmentation: Identifying topic shifts within a document can help break it up into “sub-documents” that are about different things. This is useful for document summarization, including giving better snippets. Other uses might include indexing sub-document “topics” rather than whole long documents, and scoring particular sections of a document for relevance rather than entire documents—see zone indexes.
    • The same information used to detect topic shifts can be used across documents for similarity detection, which can be used for clustering or diversity reranking
  • Sentiment analysis: detect whether a bit of text indicates a positive, neutral, or negative opinion. Could be a useful for topic segmentation, and also for reversing the emotional polarity on queries. An example I saw for non-encyclopedic searches was “are reptiles good pets?” which should have the same results as “are reptiles bad pets?” (modulo the intent of the searcher to find one preferred answer over the other). Something like Word2Vec could possibly turn positive sentiment terms into negative sentiment terms to search for dissenting opinions.
    • Doesn’t seem great for any obvious encyclopedia queries, but maybe.
    • Might be good for finding dictionary quotation examples with a given polarity.
    • Might be good for finding relevant sections on Wikivoyage.


  • Learn-to-Rank improvements
    • Additional features for LTR: see LTR features in (or following) document neighborhoods, zone indexes, noun-phrase indexing, entity recognition.
    • Better query grouping for LTR training: several techniques could be used to improve our ability to group queries for LTR training beyond the current language analysis. Query reformulation mining (or a related thesaurus) could help identify additional queries that are “essentially the same”.


  • Speech recognition and Optical character recognition: The most obvious use for these is as input techniques. Users could speak a search, or upload a photo (or live camera feed) of text in order to search by similarity or try to find a particular document (say, in Wikisource). (Language-dependent)
  • Text to speech: Reading results or result titles aloud. (Language-dependent)
  • Speech recognition and text-to-speech can improve accessibility, but users who need those technologies may already have them available on their devices/browsers.
  • Language generation: This most likely applies to document summarization from a non-textual source, like an info box. (Language-dependent)


  • Language Identification: We already use language identification for poorly performing queries on some Wikipedias, and will show results, if any, from the wiki of the identified language.
    • extend language identification for poorly performing queries to more languages and/or more projects, possibly in a more general, less fine-tuned way (Trey’s notes)
    • wrong keyboard detection: notice when someone has switched their language but not their keyboard and typed what looks like gibberish (T138958, T155104, Trey’s notes)
      • using existing language ID on poorly performing queries, similar to current language ID
      • using some other statistical methods for adding additional tokens or giving Did You Mean (DYM) suggestions
    • in-document language identification: detect sections of texts that are in a language other than the main language of the wiki, and treat that text differently
      • block unneeded/nonsensical language analysis
      • annotate it as being in the detected language and allow searching by annotations
      • potentially much easier and more accurate than for queries because of larger sample sizes


  • Statistical alignment for transliteration: deduce multilingual transliteration schemes from Wikidata names data:
    • use data for medium well-known people and places (to avoid the Germany/Deutschland/Allemagne/Niemcy problem), see examples: Q198369, Q940330, Q1136189
    • could be useful for maps when no manual transliteration/translation fallback is available
    • could be used to create index-only redirects for names in a known language
    • could provide suggestions for verification for missing wikidata labels


  • Index-only redirects: Rather than doing something like creating a bot to automatically add useful redirects, we could create a mechanism for generating index-only redirects that are indexed like redirects, but which don’t actually exist outside the index. They would be generated at index time, and could be expanded or removed to accommodate weaknesses in full-text search or the completion suggester.
    • Find commonish typos that neither Did You Mean (DYM) nor the completion suggester can correct, and automatically add index-only redirects for those variants. Improvements to the completion suggester, say, might make some index-only redirect generation unnecessary. Existing redirects could be mined for candidates.
    • Provide translated/transliterated redirects for named people and places, either from Wikidata or automatically (see statistical alignment for transliteration). Automatic transliteration would not happen when Wikidata labels were available. Over time, specific index-only redirects might disappear because an incorrect automatic redirect was replaced with a manual Wikidata one.
  • Better redirect display: Currently on English Wikipedia, “Sophie Simmons” redirects to a section of the article about her father, “Gene Simmons”, which seems to indicate that this is probably the best article about her. However, because there is some overlap in the redirect and title (“Simmons”) the redirect text isn’t shown in the full-text results, which makes it look like the “Gene Simmons” article is only a mediocre match, instead of an exact title match to a redirect. On the other hand, “Goerge Clooney” redirects to “George Clooney”, and maybe it isn’t necessary to show that redirect. Perhaps some other metric for similarity could serve as a better gate on whether or not to show the redirect text along with “redirect from”. Probably script-dependent (similarity in spaceless languages like Japanese and Chinese might behave differently) and possibly language dependent (highly inflected languages might skew similarity.


  • Cross-language information retrieval: How to search in one language using another? Especially taking into account that some wikis are much more developed than others?
    • Machine translation of the search and/or the results would allow people to get info in a language they don’t understand when no info is available in the wikis they can read.
    • Cross-language query lookup: As mentioned in statistical alignment for transliteration, Wikidata is a good source of cross-language mappings for named entities, either for translated index-only redirects, or for cross-language query lookup. For example, is someone on English Wikipedia searches for Moscou (Moscow in French), either look up Moscow on English Wikipedia, or redirect them to Moscou on French Wikipedia.
      • This is actually an example of why this could be very hard: Moscou actually matches more than 10 languages! In the other direction, as Джон Смит (Russian transliteration of John Smith) matches a lot of people—in this case they are all named “John Smith” in English, but there must be more ambiguous examples out there.
    • Other second-chance search schemes:
      • For poorly performing queries, search the users second-choice (and third, etc?) language wiki—specified in user prefs, or using accept-language headers, etc.
      • Search across all of the users specified languages (up to some reasonable limit) at once.
    • Highlight links to the same article in other languages, i.e., resurrect part of “explore similar” (see T149809, etc.; see also related results)
  • Related results: We can automatically generate result links on related topics.
    • Pull information from within results pages (especially from info boxes) to provide links to interesting related topics. Options include:
      • The”best” links from the result page—could be generically popular, most-clicked on by people from that page, or links from the highest query-term–density part of the page (see also zone indexes). Page-specific info likely would only be tracked for popular pages.
      • Pull links from info boxes. Could be template-specific; could be based on general popularity, page-specific popularity, or even template-specific popularity (e.g., most-clicked element of this template). Could be manually create by template, or automatically learned (e.g., template-specific popularity above), or a mix. For example: manually curated extraction from top n templates from the top m wikis plus automated template-specific popularity for everything else; or manually assigned weights for top n templates from the top m wikis to bootstrap the process, but then new weights are learned.
      • For disambiguation pages, use general popularity or page-specific popularity to suggest links to top n items on the page.
      • Provide links to the same article in other languages
      • Provide links to related pages (using “more like” or other methods to search by similarity)
      • Provide links to related categories, or “best” pages from “top” categories (using various metrics)
      • (Can you tell I want to resurrect part or all of “explore similar”? See T149809)
    • Consider longest-prefix-of-query vs. article-title matching (or maybe longest-substring within a query)—possibly requiring at least two words. If this gives a good article title match, pop it to the top!
    • Consider searching for really, really good matches in some namespaces—especially the Category and File namespaces—even if not requested by the user.
  • Related queries: We can automatically generate additional possible queries and offer them to searchers, making them more specific, more general, or on related topics.
    • More specific searches: add additional keywords to the query; these can be based on various sources: similar queries from other users, redirect mining, query reformulation mining, or keywords extracted from the top n results (as a form of “more like this one”).
    • More general searches: using on-wiki categories, a lexical database like WordNet or WikibaseLexeme data, or other ontology, offer queries with higher-level terms. For example, a query with cat, dog, and hamster might get reformulated with animal, mammal, or pet in addition or instead of the others.
    • Related searches: using an ontology as above, offer query reformulations with different, related terms. So cat, dog, or hamster might generate options with gerbil, turtle, or parrot.


  • Analysis chain improvements—general and specific:
    • Sentence breaking: Not sure if we need to improve this, but could improve working with zones, and general proximity search.
    • Word segmentation:
      • Spaceless languages could use word segmentation algorithms, but even languages with spaces could benefit from better treatment of acronyms and initialisms (NASA vs N.A.S.A. vs N A S A—exploded acronyms are rare in English, but do occur)
      • Normalization of compounds and hyphenated words (sportsball vs sports-ball vs sports ball)—see also breaking down compounds, and smarter hyphen processing
      • Abbreviation handling (abbreviation vs abbrev); could be via thesaurus or other mechanism, word-sense disambiguation could also help; abbreviation detection could improve sentence breaking
    • Pay attention to capitalization. If a user goes to the trouble to capitalize some letters and not others, maybe it means something. IT isn't necessarily it, and MgH isn't MGH.
    • Lots of specific language-specific improvements.
  • Support for non-standard input methods: some people don’t have ready access to a keyboard that supports their language fully. It can be a problem for specific characters—like using one (1), lowercase L (l), or uppercase i (I) for the palochka (Ӏ) in Kabardian (see T186401), a strong preference for non-Unicode encoding in the community (e.g., T191535), or just a lack of any keyboard meaning all input is in transliteration.


  • General Machine Learning: We should keep in mind that ORES is available to provide "machine learning as a service." It might be useful in general for us to become familiar with using ORES to do the right kind of number crunching for us.


A Sigh and a Caveat[edit]

Whew! Suggestions, questions, comments, etc are welcome on the Talk page!

I know I have a bias for English and for Wikipedia. I’m better at thinking about other languages, but not so great at thinking about use cases beyond Wikipedia and Wiktionary (my favorite wiki). Lots of these ideas are general, but some are more useful for an encyclopedia than a dictionary, and might not be super useful at all for other projects. If you have any ideas for NLP applications on other wikis, please share!

Some general references available online[edit]

  1. For the burgeoning linguistics nerds, most Indo-European languages don't have extremely complicated morphology that gloms together a lot of elements into one word. See polysynthetic languages for the opposite extreme. Russian, Greek, Hindi, and Persian are all Indo-European languages, and the Greek and Cyrillic alphabets might do fine, while Hindi's Devanagari abugida and Persian's Arabic script abjad may make them less amenable to the same statistical methods. Spaceless languages, like Chinese (which also has logographic writing), Japanese (mixed logographic and syllabaries), and Thai (abugida) may also have a much harder time using intraword statistical methods.