Wikimedia Technology/Annual Plans/FY2019/TEC8: Search Platform/Goals

=Program Goals and Status for FY18/19=

TEC8: Search Platform
 * Goal Owner: Erika Bjune
 * Program Goals for FY18/19: Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to discover and search for content.
 * Annual Plan: TEC8: Search Platform
 * Primary Goal is Knowledge as a Service: Evolve our systems and structures
 * Tech Goal: Supporting our Community of contributors



 = Q1 Goals =

Outcome 1 / Output 1
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Incorporate Natural Language Processing (NLP) in the machine learning analysis pipeline for search

Dependency: Will need some short-term consulting help during implementation

Goal(s)

 * Select 1 or 2 NLP applications and prototype the features

Status
July 2018

August 21, 2018
 * Contract contents written up and will start recruiting soon.

September 20, 2018
 * this work continues but the prototype won't be completed in this quarter.

Outcome 1 / Output 2
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Evaluation of image features for search ranking

Goal(s)
Investigate and evaluate image level features for image search ranking (ie. Image quality score in ML indexing) (Stretch goal)

Status
July 2018

August 21, 2018
 * We're using an older test that Miriam Redi created, and meetings / questions and answers are ongoing

September 25, 2018

Outcome 1 / Output 3
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Better understanding of the effectiveness of our improvements to search and the performance of our tooling on the back end

Dependency: Analytics (Audiences)

Goal(s)
Revise search metrics and dashboard

Status
July 2018

August 21, 2018
 * We'll be reaching out to the Research team for assistance and bring them together with our help from Audiences for this goal.

September 20, 2018
 * This is currently ❌ as the Research team is a bit busy with other priorities.

September 25, 2018
 * Analytics is working on the dashboard and documentation of search metrics

Outcome 1 / Output 4
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Improved support for multiple languages by researching and deploying new language analyzers where feasible on individual language wikis.

Goal(s)
Morphological library investigations and implementations (specific languages TBD)

Status
July 2018

August 21, 2018
 * Esperanto plugin is, Malay is ✅, both will need to be deployed into production with other small language bugs in the next couple of weeks.

September 25, 2018
 * ✅ : Esperanto is ✅ and has been deployed and re-indexed. Korean is still.

Outcome 1 / Output 5
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Specific media search filters for Wikidata/Wikibase and the relationships to the topics they represent will be better supported using structured data and other techniques.

Dependency: WMDE

Goal(s)

 * Lexeme search implementation: complete search implementation for all modes for Lemmas and Forms ✅
 * Investigate applying machine-learning enabled ranking to Wikidata searches, start collecting click data for Wikidata completion searches and start developing machine-learning models for Wikidata search relevancy.

Status
July 2018

August 21, 2018
 * Lexeme search work is on-going, but running into a few small issues with the queries and how results are presented and how we want to do this in the future. Also working on extracting cirrus stuff out of Wikibase and getting the Lexemes into the Wikidata query service.

September 11, 2018
 * Lexeme search work is ✅; the models are still but most likely will be part of next quarter's work.

Outcome 2 / Output 1
Technical debt addressed and required maintenance completed for Search Platform components
 * Elasticsearch upgrades and server replacements

Dependency: SRE

Goal(s)

 * Continue to prepare for a major upgrade to Elasticsearch 6
 * Replace Elasticsearch servers which are at the end of their lease ((stalled))
 * Migrate Elasticsearch servers to RAID 0 ✅

Status
July 2018

August 21, 2018
 * Gehel has re-striped everything and migrated the Elasticsearch servers to Stretch. Lease expires on the servers next month; will also be working on migrating other (maps) servers to Stretch.

September 20, 2018
 * prep work is still ongoing, but the data center switch is taking some time away from this work. The actual full upgrade will be part of Q2's work and will require a few weeks of stress testing of the upgrade; also working on the full migration sequence path/documentation/shard checks.

September 25, 2018
 * There is still prep work to be done in Q2 for the ES6 upgrade, will not get to the actual upgrade until Q3, but currently working with local ES6 instances as prep work is being done. Progress on replacement Elasticsearch server procurement is stalled on the quotes and will be done in Q2.

Outcome 2 / Output 2
Technical debt addressed and required maintenance completed for Search Platform components
 * Higher capacity for WDQS to improve its ability to power features on-wiki for readers and the growing set of features for supporting structured data

Dependencies: SRE, WMDE

Goal(s)

 * Add storage to WDQS servers ✅
 * Enable Kafka event consumption
 * Separate the Wikidata Elasticsearch implementation into a separate extension ❌
 * Investigate Blazegraph support options and alternatives (Stretch goal)

Status
July 2018
 * Waiting on discs to arrive.

August 21, 2018
 * Most of the discs have now arrived, need to take the servers offline and re-image them next as growing the cluster is difficult.

September 20, 2018
 * Kafka event consumption is continuing, but separating the Wikidata implementation into a different extension work will be moved to Q2 for completion.

 =Q2 Goals =

Outcome 1 / Output 1
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Incorporate Natural Language Processing (NLP) in the machine learning analysis pipeline for search

Dependency: Will need some short-term consulting help during implementation

Goal(s)

 * Find and hire a contractor to help with NLP work
 * Begin working on one internal NLP project (wrong keyboard detection, starting with Russian/English) (T138958)

Status
October 2018

November 2018

December 2018

Outcome 1 / Output 2
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Improve autocomplete of Wikidata items

Goal(s)

 * Expand our machine learning to Wikidata and Commons (autocompletes for relevance, considering multilingual)

Status
October 2018

November 2018

December 2018

Outcome 1 / Output 3
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Better understanding of the effectiveness of our improvements to search and the performance of our tooling on the back end

Goal(s)

 * Prototype a feature that is based on collected data
 * Continued work from Q1 with the collection of click logs for the autocomplete feature

Status
October 2018

November 2018

December 2018

Outcome 1 / Output 4
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Improved support for multiple languages by researching and deploying new language analyzers where feasible on individual language wikis.

Goal(s)

 * Finish up the Korean morphological library analysis and deploy into production (carry-over work from Q1)
 * This work is dependent on the upgrade to ElasticSearch 6 finishing
 * General language support (i.e, misc language-specific bugs)

Status
October 2018

November 2018

December 2018

Outcome 1 / Output 5
Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.
 * Specific media search filters for Wikidata/Wikibase and the relationships to the topics they represent will be better supported using structured data and other techniques.

Dependency: WMDE

Goal(s)

 * Develop machine-learning models for Wikidata search relevancy with Lexeme models

Status
October 2018

November 2018

December 2018

Outcome 2 / Output 1
Technical debt addressed and required maintenance completed for Search Platform components
 * Elasticsearch upgrades and server replacements

Dependency: SRE, WMDE

Goal(s)

 * Split the search clusters to increase stability
 * Continue replacing ElasticSearch servers (end of life maintenance)
 * Separate the Wikidata ElasticSearch implementation into a separate extension
 * Migrate ElasticSearch cluster restart scripts as cookbooks using Spicerack
 * Stretch goal: Start working on the CloudElastic replicas (and perform a proof of concept with a few select wikis)

Status
October 2018

November 2018

December 2018

Outcome 2 / Output 2
Technical debt addressed and required maintenance completed for Search Platform components
 * Higher capacity for WDQS to improve its ability to power features on-wiki for readers and the growing set of features for supporting structured data

Dependencies: SRE, WMDE

Goal(s)

 * Performance and bug fixes for WDQS
 * Service Level Objective (SLO) work for WDQS (T199228)
 * Carryover from Q1: continue to investigate Blazegraph support options and alternatives

Status
October 2018

November 2018

December 2018