Wikimedia Technology/Annual Plans/FY2019/TEC8: Search Platform/Goals

From MediaWiki.org
Jump to navigation Jump to search

Program Goals and Status for FY18/19[edit]

  • Goal Owner: Guillaume Lederrey
  • Program Goals for FY18/19: Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to discover and search for content.
  • Annual Plan: TEC8: Search Platform
    • Primary Goal is Knowledge as a Service: Evolve our systems and structures
    • Tech Goal: Supporting our Community of contributors


[edit]

Outcome 1 / Output 1[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Incorporate Natural Language Processing (NLP) in the machine learning analysis pipeline for search

Dependency: Will need some short-term consulting help during implementation

Goal(s)[edit]

  • Select 1 or 2 NLP applications and prototype the features

Status[edit]

Note Note: July 2018

In progress In progress

Note Note: August 21, 2018

In progress In progress Contract contents written up and will start recruiting soon.

Note Note: September 20, 2018

Incomplete Partially done this work continues but the prototype won't be completed in this quarter.


Outcome 1 / Output 2[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Evaluation of image features for search ranking

Goal(s)[edit]

Investigate and evaluate image level features for image search ranking (ie. Image quality score in ML indexing) (Stretch goal)

Status[edit]

Note Note: July 2018

In progress In progress

Note Note: August 21, 2018

In progress In progress We're using an older test that Miriam Redi created, and meetings / questions and answers are ongoing

Note Note: September 25, 2018

Yes Done.

Outcome 1 / Output 3[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Better understanding of the effectiveness of our improvements to search and the performance of our tooling on the back end

Dependency: Analytics (Audiences)

Goal(s)[edit]

Revise search metrics and dashboard

Status[edit]

Note Note: July 2018

In progress In progress

Note Note: August 21, 2018

In progress In progress We'll be reaching out to the Research team for assistance and bring them together with our help from Audiences for this goal.

Note Note: September 20, 2018

This is currently N Stalled as the Research team is a bit busy with other priorities.

Note Note: September 25, 2018

In progress In progress Analytics is working on the dashboard and documentation of search metrics

Outcome 1 / Output 4[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Improved support for multiple languages by researching and deploying new language analyzers where feasible on individual language wikis.

Goal(s)[edit]

Morphological library investigations and implementations (specific languages TBD)

Status[edit]

Note Note: July 2018

In progress In progress

Note Note: August 21, 2018

In progress In progress Esperanto plugin is Incomplete Partially done, Malay is Yes Done, both will need to be deployed into production with other small language bugs in the next couple of weeks.

Note Note: September 25, 2018

Yes Done: Esperanto is Yes Done and has been deployed and re-indexed. Korean is still In progress In progress.

Outcome 1 / Output 5[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Specific media search filters for Wikidata/Wikibase and the relationships to the topics they represent will be better supported using structured data and other techniques.

Dependency: WMDE

Goal(s)[edit]

  • Lexeme search implementation: complete search implementation for all modes for Lemmas and Forms Yes Done
  • Investigate applying machine-learning enabled ranking to Wikidata searches, start collecting click data for Wikidata completion searches and start developing machine-learning models for Wikidata search relevancy. In progress In progress

Status[edit]

Note Note: July 2018

In progress In progress

Note Note: August 21, 2018

In progress In progress Lexeme search work is on-going, but running into a few small issues with the queries and how results are presented and how we want to do this in the future. Also working on extracting cirrus stuff out of Wikibase and getting the Lexemes into the Wikidata query service.

Note Note: September 11, 2018

In progress In progress Lexeme search work is Yes Done; the models are still In progress In progress but most likely will be part of next quarter's work.


Outcome 2 / Output 1[edit]

Technical debt addressed and required maintenance completed for Search Platform components

Elasticsearch upgrades and server replacements

Dependency: SRE

Goal(s)[edit]

  • Continue to prepare for a major upgrade to Elasticsearch 6 In progress In progress
  • Replace Elasticsearch servers which are at the end of their lease ((stalled))
  • Migrate Elasticsearch servers to RAID 0 Yes Done

Status[edit]

Note Note: July 2018

In progress In progress

Note Note: August 21, 2018

In progress In progress Gehel has re-striped everything and migrated the Elasticsearch servers to Stretch. Lease expires on the servers next month; will also be working on migrating other (maps) servers to Stretch.

Note Note: September 20, 2018

Incomplete Partially done prep work is still ongoing, but the data center switch is taking some time away from this work. The actual full upgrade will be part of Q2's work and will require a few weeks of stress testing of the upgrade; also working on the full migration sequence path/documentation/shard checks.

Note Note: September 25, 2018

Incomplete Partially doneThere is still prep work to be done in Q2 for the ES6 upgrade, will not get to the actual upgrade until Q3, but currently working with local ES6 instances as prep work is being done. Progress on replacement Elasticsearch server procurement is stalled on the quotes and will be done in Q2.

Outcome 2 / Output 2[edit]

Technical debt addressed and required maintenance completed for Search Platform components

Higher capacity for WDQS to improve its ability to power features on-wiki for readers and the growing set of features for supporting structured data

Dependencies: SRE, WMDE

Goal(s)[edit]

  • Add storage to WDQS servers Yes Done
  • Enable Kafka event consumption In progress In progress
  • Separate the Wikidata Elasticsearch implementation into a separate extension N Stalled
  • Investigate Blazegraph support options and alternatives (Stretch goal) To do To do

Status[edit]

Note Note: July 2018

To do To do Waiting on discs to arrive.

Note Note: August 21, 2018

In progress In progress Most of the discs have now arrived, need to take the servers offline and re-image them next as growing the cluster is difficult.

Note Note: September 20, 2018

Incomplete Partially done Kafka event consumption is continuing, but separating the Wikidata implementation into a different extension work will be moved to Q2 for completion.

[edit]

Outcome 1 / Output 1[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Incorporate Natural Language Processing (NLP) in the machine learning analysis pipeline for search

Dependency: Will need some short-term consulting help during implementation

Goal(s)[edit]

  • Find and hire a contractor to help with NLP work
  • Begin working on one internal NLP project (wrong keyboard detection, starting with Russian/English) (T138958)

Status[edit]

Note Note: October 29, 2018

In progress In progress as interviews are on-going, we'll probably start the work in early Q3. Work is now ongoing with wrong keyboard detection.

Note Note: November 29, 2018

Wrong keyboard work is In progress In progress and NLP contractor work will still happen in Q3.

Note Note: December 13, 2018

Contractor has been identified and the contract is starting to be worked on, considering this to be Yes Done.

Outcome 1 / Output 2[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Better understanding of the effectiveness of our improvements to search and the performance of our tooling on the backend

Goal(s)[edit]

  • Improve autocomplete of Wikidata items
    • Expand our machine learning to Wikidata and Commons (autocompletes for relevance, considering multilingual)
  • Prototype a feature that is based on collected data
    • Continued work from Q1 with the collection of click logs for the autocomplete feature

Status[edit]

Note Note: October 29, 2018

This is actively being worked on and is fully In progress In progress.

Note Note: November 29, 2018

This is actively being worked on and is fully In progress In progress.

Note Note: December 13, 2018

Trey has this In progress In progress but will probably flow into early Q3 with the NLP contractor.


Outcome 1 / Output 3[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Improved support for multiple languages by researching and deploying new language analyzers where feasible on individual language wikis.

Goal(s)[edit]

  • Finish up the Korean morphological library analysis and deploy into production (carry-over work from Q1)
    • This work is dependent on the upgrade to ElasticSearch 6 finishing
  • General language support (i.e, misc language-specific bugs)

Status[edit]

Note Note: October 29, 2018

The Korean goal is now Yes Done but won't be updated until the ElasticSearch 6.


Outcome 1 / Output 4[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Specific media search filters for Wikidata/Wikibase and the relationships to the topics they represent will be better supported using structured data and other techniques.

Dependency: WMDE

Goal(s)[edit]

  • [figuring out what SDC wants]
  • Search for licenses in Commons

Status[edit]

Note Note: October 29, 2018

N Stalled We are waiting on a few things yet from the SDoC program - hard specs on what is expected from the Search Platform to develop features based off of presentations.

Note Note: November 29, 2018

N Stalled We are waiting on a few things yet from the SDoC program - hard specs on what is expected from the Search Platform to develop features based off of presentations.

Note Note: December 13, 2018

This goal is still N Stalled as we await further instructions from SDoC program.


Outcome 2 / Output 1[edit]

Technical debt addressed and required maintenance completed for Search Platform components

Elasticsearch upgrades and server replacements

Dependency: SRE, WMDE

Goal(s)[edit]

  • Split the search clusters to increase stability Incomplete Partially done
  • Continue replacing ElasticSearch servers (end of life maintenance) Yes Done
  • Separate the Wikidata ElasticSearch implementation into a separate extension Incomplete Partially done
  • Migrate ElasticSearch cluster restart scripts as cookbooks using Spicerack In progress In progress
  • Stretch goal: Start working on the CloudElastic replicas (and perform a proof of concept with a few select wikis)To do To do

Status[edit]

Note Note: October 29, 2018

In progress In progress Quotes from vendors were much higher than expected, Gehel, Erik, and SRE are going over requests and refining configurations to get more within our budget.

Note Note: November 29, 2018

This is still In progress In progress — purchases for the new servers had been made and servers are getting delivered / installed.

Note Note: December 13, 2018

Splitting the search clusters has been Yes Done and testing is still In progress In progress (should be done in the next week); replacing the older servers can be considered Yes Done; separating the Wikidata ES implementation is Yes Done but the new extension is still In progress In progress and will be completed in Q3. Migrating the ES cluster to Spicerack is still In progress In progress and will need testing.
The stretch goal has been pushed to Q3.


Outcome 2 / Output 2[edit]

Technical debt addressed and required maintenance completed for Search Platform components

Higher capacity for WDQS to improve its ability to power features on-wiki for readers and the growing set of features for supporting structured data

Dependencies: SRE, WMDE

Goal(s)[edit]

  • Performance and bug fixes for WDQS In progress In progress
  • Service Level Objective (SLO) work for WDQS (T199228) In progress In progress
  • Carryover from Q1: continue to investigate Blazegraph support options and alternatives Yes Done

Status[edit]

Note Note: October 29, 2018

The SLO work is a bit N Stalled, but still somewhat In progress In progress at this point due to varying conversations, but lots of work has been ongoing with the fixes on WDQS. The team is still investigating long term usage of Blazegraph and what that means to the Foundation.

Note Note: November 29, 2018

This is now In progress In progress

Note Note: December 13, 2018

These goals are still In progress In progress; but the SLO work will flow into Q3. Blazegraph support options investigation is now Yes Done and we are determining next steps.

[edit]

Outcome 1 / Output 1[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Incorporate Natural Language Processing (NLP) in the machine learning analysis pipeline for search

Dependency: Will need some short-term consulting help during implementation

Tech Goal(s)[edit]

C: Improve our own feature set

Goal(s)[edit]

  • Work with NLP contractor on spelling and did-you-mean result tuning (T212884) and (T212888)
  • Complete Russian/English wrong keyboard detection (T138958)

Status[edit]

Note Note: January 22, 2019

* The wrong keyboard detection work is In progress In progress; the NLP contract work is N Stalled for right now due to certain legal issues.

Note Note: February 25, 2019

* The wrong keyboard work has been N Postponed for now, as the NLP work has kicked off and is fully In progress In progress

Note Note: March 19, 2019

  • Work with Julia (contractor) is In progress In progress but the work getting into production will most likely go into Q4.
  • Deployed wrong-keyboard language-ID models needed for future use. (T213931 / T216083)


Outcome 1 / Output 2[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Better understanding of the effectiveness of our improvements to search and the performance of our tooling on the backend

Tech Goal(s)[edit]

C: Improve our own feature set

Goal(s)[edit]

  • Deploy improved autocomplete for English Entities
  • Use machine learning to improve autocomplete of Wikidata items for three more languages

Status[edit]

Note Note: January 22, 2019

* The patch for this is Yes Done and going out with this week's train and the A/B tests are going to be starting soon.

Note Note: February 25, 2019

* Tested Spanish, English, and French - all good with the testing and is now deployed in production Yes Done


Outcome 1 / Output 3[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Improved support for multiple languages by researching and deploying new language analyzers where feasible on individual language wikis.

Tech Goal(s)[edit]

A: Foundation Goal: Reach
B: Serving our audiences
C: Improve our own feature set
D: Technical Debt

Goal(s)[edit]

  • Deploy Korean morphological library into production (after Elasticsearch is upgraded)
  • General language support (i.e, misc language-specific bugs)

Status[edit]

Note Note: January 22, 2019

* We're still working on upgrading ElasticSearch, so the deployment of Korean library is still N Stalled at this point. General lang support is In progress In progress and ongoing.

Note Note: February 25, 2019

* The Korean update is Incomplete Partially done and planned for next week to be rolled out as part of the ES6 upgrade. General lang support is In progress In progress and ongoing.

Note Note: March 19, 2019

* The Korean update work is Incomplete Partially done and will be completed when ES6 is Yes Done
* Improved analysis for Greek, Turkish, and Irish with language-specific lowercasing (T203117/T217602)


Outcome 2 / Output 1[edit]

Technical debt addressed and required maintenance completed for Search Platform components

Elasticsearch 6 upgrades and server replacements

Dependency: SRE, WMDE

Tech Goal(s)[edit]

C: Improve our own feature set
D: Technical Debt

Goal(s)[edit]

  • Upgrade ElasticSearch to v6
  • Complete separation of Wikidata ElasticSearch implementation into a separate extension
  • JVM and ElasticSearch upgrade Spicerack cookbooks
  • Stretch goal: Start working on the CloudElastic replicas (and perform a proof of concept with a few select wikis)

Status[edit]

Note Note: January 22, 2019

* The upgrade is still In progress In progress and troubleshooting issues with cluster states and building a whole new big bunch of shards to handle the increased load.
* The separation of Wikidata ElasticSearch implementation is still In progress In progress, but other work is also on-going right at the moment.
* JVM and ES upgrade is In progress In progress; we are doing first passes at the integration testing, and will have a first set of cookbooks soon.
* Stretch goal might be a bit closer than we initially thought, but once ES6 is in production, we can finish this up, maybe in mid February.

Note Note: February 25, 2019

* ES6 is In progress In progress planned to be deployed the week of March 4th
* Wikidata ElasticSearch implementation is still In progress In progress and we hope to get it mostly done by end of this quarter, but might go over into next quarter for the deployment.
* JVM and ElasticSearch upgrade is still In progress In progress - ES6 cookbooks are Yes Done and we only have a little bit to do with the rest of the WS6 upgrade. (T202885)
* Stretch goal is In progress In progress (T214921)

Note Note: March 19, 2019

  • Relforge update will be Yes Done soon (this week).
  • Stretch goal (CloudElastica replica) is still In progress In progress but will extend into Q4's work


Outcome 2 / Output 2[edit]

Technical debt addressed and required maintenance completed for Search Platform components

Higher capacity for WDQS to improve its ability to power features on-wiki for readers and the growing set of features for supporting structured data

Dependencies: SRE, WMDE

Tech Goal(s)[edit]

B: Serving our audiences
C: Improve our own feature set
D: Technical Debt

Goal(s)[edit]

  • Performance and bug fixes for WDQS
  • Service Level Objective (SLO) work for WDQS (T199228) In progress In progress
  • Release official Blazegraph code package and start using Blazegraph CI
  • Roadmap planning for WDQS

Status[edit]

Note Note: January 22, 2019

  • Perf / bug fixing is on-going and In progress In progress along with a new roadmap on how to figure out which ones come first.
  • SLO is In progress In progress, we'll talk about it during All Hands week
  • We are having issues with Blazegraph right now, but we're working them, so In progress In progress for the issues, but the official code package is To do To do

Note Note: February 25, 2019

  • Perf / bug fixing for WDQS is on-going and In progress In progress
  • SLO work is in a waiting status for us, but we're keeping it In progress In progress
  • Blazegraph's CI is still getting attention and is In progress In progress (T216855)

Note Note: March 19, 2019

  • Perf / bug fixing for WDQS is on-going and In progress In progress
  • SLO work for Blazegraph is a bit N Stalled right now as we're continuing a few conversations about public and internal facing WDQS up time and expectations of the service.
  • The release official Blazegraph code package is Yes Done and we're working on a few config changes for using Blazegraph in CI, but can be considered Yes Done at this point.
  • Road map planning for Blazegraph is Yes Done as well

[edit]

Outcome 1 / Output 1[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Incorporate Natural Language Processing (NLP) in the machine learning analysis pipeline for search

Dependency: Using short-term consulting help

Tech Goal(s)[edit]

C: Improve our own feature set

Goal(s)[edit]

  • Complete 'did you mean' (DYM) project with NLP contractor (T212884)
    • Assess performance and investigate beginning integration work for deployment
  • Enable new Korean analysis, reindex clusters, and retrain the 'learning to rank' (LTR) model (T216738)

Status[edit]

Note Note: May 28, 2019

  • The 'did you mean' work has been in progress and the team is In progress In progress reviewing the code patches (method 1 is Yes Done, method 2 is In progress In progress).
  • The work on the Korean analysis is Yes Done and we are currently gathering data after the reindex. The retraining is still To do To do.

Note Note: June 25, 2019

  • DYM is pretty much Yes Done at this point, as far as the work with our contractor. Method 2 works but we need to find another way to evaluate this method - still evaluating In progress In progress and this will continue into next FY/quarter.
  • Assessing perf and integration work is In progress In progress
  • Korean retraining is still To do To do and will move into next FY/quarter. We are wanting to setup an automated pipeline.


Outcome 1 / Output 3[edit]

Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to search for content across languages.

Improved support for multiple languages by researching and deploying new language analyzers where feasible on individual language wikis.

Tech Goal(s)[edit]

A: Foundation Goal: Reach
B: Serving our audiences
C: Improve our own feature set
D: Technical Debt

Goal(s)[edit]

  • Search "Local Impact": making making bigger improvements for smaller (or underrepresented) communities T218613
  • Investigate and continue to revise search metrics and dashboards (T216055)

Status[edit]

Note Note: May 28, 2019

  • Completion Suggester improvements under T117217 are Yes Done
  • Other improvements under epic T218613 are still In progress In progress

Note Note: June 25, 2019

  • This work will continue into next FY/quarter.


Outcome 2 / Output 1[edit]

Technical debt addressed and required maintenance completed for Search Platform components

Elasticsearch 6 upgrades and server replacements

Dependency: SRE

Tech Goal(s)[edit]

C: Improve our own feature set
D: Technical Debt

Goal(s)[edit]

  • Adding ability for cloud environment to use ElasticSearch and perform a proof of concept with a few select wikis (T214921)

This work is often referred to as "Cloud Elastic Replicas".

Status[edit]

To do To do April 2019

There might be some work needed to do this month after the ES6 upgrade that should be finished in q3

Note Note: May 28, 2019

  • Upgrade to ES6 is Yes Done and the CloudElastic work is In progress In progress. Might need to do some more work with the deprecation loggers (minor cleanup still In progress In progress on https://phabricator.wikimedia.org/T218994

Note Note: June 25, 2019

  • CloudElastic work is still In progress In progress - we're waiting on the load balancer to be activated and then we can automate the updates.
  • The deprecation loggers work is Incomplete Partially done and will wrap up early next quarter and continue until the next elastic upgrade.


Outcome 2 / Output 2[edit]

Technical debt addressed and required maintenance completed for Search Platform components

Higher capacity for WDQS to improve its ability to power features on-wiki for readers and the growing set of features for supporting structured data

Dependencies: SRE, WMDE

Tech Goal(s)[edit]

B: Serving our audiences
C: Improve our own feature set
D: Technical Debt

Goal(s)[edit]

  • Improve the performance and functionality of Blazegraph (the graph storage engine behind Wikidata Query Service).
    • Increase Blazegraph support for SPARQL operations. Example: T200612
    • Improve query performance of Blazegraph in areas of particular interest for WMF. Example: T152773

Status[edit]

Note Note: May 28, 2019

  • Improving Blazegraph is still In progress In progress and we are working on the existing bugs with SPARQL operations. The query performance work is going well and ready for deployment with a full db reload Incomplete Partially done, should be completely done in June.

Note Note: June 25, 2019

  • Blazegraph improvements for performance and functionality is ongoing and continuing ...lots of bug fixes are In progress In progress, and we need to investigate how to do rebuilds while having different configurations. This work will continue into next FY.