Wikimedia Technology/Annual Plans/FY2019/TEC8: Search Platform

From mediawiki.org

Search Platform provides the infrastructure and back-end tooling for content discovery across the Wikimedia/Wikipedia landscape. This includes not only surfacing relevant content when readers are looking for it, but also guiding people to content when they have may not have expressed themselves clearly, or might not know exactly what they're looking for. And we do this across languages, for both MediaWiki and Wikidata. Our main focus is on utilizing machine learning and NLP to improve ranking and relevancy of search results and to provide front-end teams with interfaces to results that can be used to improve the experience of search for readers and editors.

During the 2018-19 fiscal year, Search Platform will also be significantly contributing to the Structured Data on Commons cross-department program.

Program outline[edit]

Teams contributing to the program[edit]

Search Platform, WMDE, Audiences (75% of one analyst)

Annual Plan priorities[edit]

Primary Goal: 3. Knowledge as a Service - evolve our systems and structures

How does your program affect annual plan priority?[edit]

To be able to effectively deliver Knowledge as a Service, it is critical that we provide excellent search and discovery tooling. By utilizing machine learning, the search platform team has already laid the foundations of a highly tunable search result ranking engine. While this gives us the opportunity to expand the array of features that influence the ranking of search results, there are still many improvements we can make that will help us surface more relevant results, both within and across languages, by incorporating natural language processing (NLP) and phonetic matching, and adding more specific language analyzer plugins to Elasticsearch which will be able to deal with the nuances between closely related yet distinctly different languages. With these improvements, we will be even better positioned to lead readers to the content they are seeking, and expose them to more accurate related content that will keep them exploring ever deeper into the knowledge space.

In addition to increasing the relevance of search results, the Search Platform team will be working closely with the Structured Data on Commons team to implement search requirements for this next-generation implementation of Commons.

Program Goal[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to discover and search for content.

Outcome 1
The advanced machine learning techniques we implement will improve search result relevance across language Wikipedias.
Output 1
Continue to identify new features for machine learning and incorporate them into the Machine-Learning-to-Rank (MLR) pipeline
Output 2
Experiment with Natural Language Processing (NLP) to improve the machine learning results
Output 3
Maintain CirrusSearch and the Search API
Outcome 2
Users across languages experience better search results.
Output 4
New language analyzers deployed to improve support for multiple languages (as they make sense to individual language wikis).
Outcome 3
Wikidata Query Service expanded with deeper, cross-wiki search features
Output 5
Deep category and full-text search for wikidata via WDQS (in preparation for Structured Data on Commons).
Outcome 4
Search Platform gains a much deeper understanding of search performance and improvement impact metrics
Output 6
Dashboard of new and relevant metrics that encompass the performance and impact of machine learning in content discovery

Resources[edit]

FY2017–18 FY2018–19
People (OpEx) Current team, contributing to all outcomes:
  • 3 x Senior Software Engineers
  • 1 Software Engineer
  • 1 Ops Engineer
  • 1 Engineering Manager
  • .75 of a Data Analyst (shared from Audiences)

Short-term contract resources:

  • $50K contracting budget for MLR plugin engineering
  • Senior Software Engineers (no change)
  • Senior Software Engineers (no change)
  • Senior Software Engineers (no change)
  • Software Engineer (no change)
  • Ops Engineer (no change)
  • Engineering Manager (no change)
  • 0.75 ✕ Data Analyst (no change, shared from Audiences) (Outcomes 1, 2, 3, & 4)
  • 0.5 ✕ Researcher (new, shared with Knowledge Gap program) (Outcomes 1, 2, and 4)

Short-term contract resources:

  • contracting budget for NLP engineering assistance (new, arguably part of baseline from Audiences/Search) (Outcome 1)
  • contracting budget for Blazegraph engineering assistance (new) (Outcome 3)
Stuff (CapEx)
  • New ElasticSearch cluster servers
Travel & Other
  • 4 x Wikimania
  • 2 x Dev Summit
  • 6 x Wikimedia Hackathon
  • 6 x professional conference
  • n/a x Wikimania (centralized)
  • 4 x Dev Summit (+2)
  • 6 x Wikimedia Hackathon (no change)
  • 6 x professional conference (no change)

Targets[edit]

Outcome 1[edit]

  • Search Platform has a set of clear running metrics in a dashboard indicating the performance (and improvement trajectory) of the machine learning mechanisms, encompassing coding efficiency (ie. how fast new features can be implemented and trained), processing performance (ie. how long it takes to train models), and result relevance.
Target
  • Increase processing performance and efficiency, along with result relevance, based on baseline metrics to be identified and implemented in a dashboard during FY2017-18.
Measurement method

Still being determined

Dependencies[edit]

Audiences has been providing 75% of one analyst to help with data analysis for search platform A/B tests, and we still need that resource.