Wikimedia Search Platform/Decision Records/Recommendation Flags in Search

From mediawiki.org

Status:    In progress

Responsible What work/what role is this team/group playing? Point of Contact for this team/group (a person)
Search Platform Search Zbyszko Papierski
Accountable Why is this person accountable?
Consulted Context Point of contact
Informed Context Point of contact

What?[edit]

What is the problem or opportunity?[edit]

For the needs of a Structured Tasks project, we need to be able to filter out articles that have recommendations assigned to them. Currently we allow for two types of recommendations, that are calculated beforehand and assigned to articles - link and image. In both cases, results are calculated periodically.

What does the future look like if this is achieved?[edit]

Ability to filter the results will allow a flexible display of articles that can be combined with other criteria - like full-text search or category.

What happens if we do nothing?[edit]

Static list of articles can still be displayed without the help - but that will greatly reduce the usability of the solution.

Why?[edit]

Value Objective or Value it Supports and how
Streamlined Search Platform
Dynamic recommendation lists

Current background[edit]

Assumptions and requirements
The actual recommendations are not necessary for the search process
Articles with recommendations should be marked as such and there should be a way to clear this mark
It should be possible to do a full-text, or filtered search with inclusion of recommendations available filter
Search has preexisting feature for handling articletopics and drafttopics

Options[edit]

Only one option written down - decision was made during a meeting

Weighted flag for external input
Description There will be a field named weighted_tags that will contain the externally provided tags for articles
Pros
  • Simplified handling of external data
  • Tag syntax proposed allows for easy identification of the source
Cons
  • it only represents a boolean value with a weight, calculated as term frequency
Risks
  • Need an additional option to delete the values - they aren't a part of the original document
  • Adding new tag is still pretty manual - ideally we should reach the moment where adding new one could be done with a self-service, with only search teams aproval (like in case of the EventGate)
Effort Implementing tags in search is a simple task, but we also need to migrate existing field to a new name and format - no estimation yet, but it will take some time.
Costs No additional costs
Reference Any links to additional materials or more detailed plans?
Decision Type Decision is reversible, but since it's a matter of internal implementation (externally, search features are not affected by the change), there a low chance for that.

Important Questions[edit]

Question Who can answer? Resolution, answer or action
Delete API - how should it look like?
How to model the process of addding new tags in the future

Decision[edit]

Option Weighted flag for external input
Rationale Leaving the situation unattended generates additional technical debt which, we didn't want. The decided solution isn't complete - but will allow next steps
Data No data yet
Who Search Platform team (internal design decision)
Date 2021-01-12
Informing This decision is an internal design decision

Details[edit]

Current solution[edit]

Currently, we support article topics and drafts, both coming from ORES. Current structure example:

"ores_articletopic": [
"Culture.Media.Radio|475",
"STEM.STEM*|741"
]

In the first tag, "Culture.Media.Radio" is a value of a tag, 475 is a term frequency value used as a weigh to sort the values in search. We currently put also draft data there, which isn't ideal. This is a technical debt, we need to resolve before streamlining platform solution

Desired solution[edit]

We want to have a field that behaves the same way, but is designed for more general data. Proposed format is "<tag_source>/<tag_value>|<tag_weight>". Here's a more detailed example:

"weighted_tags": [
"classification.ores.articletopic/Culture.Media.Radio|475",
"recommendation.image/exists|1",
"recommendation.link/exists|1"
]

This structure allows us to reuse preexisting features

Existing tags from ORES will be migrated to this field.

Migration[edit]

Recommendation features are currently under development and can leverage the new structure immediately, but ORES classifications need to be migrated. Steps required:

  1. Implement recommendation features with new structure in mind
  2. Develop handling of the new structure alongside old one with BC code in CirrusSearch (search both old and new fields).
  3. Reindex articles to add the new field in the elasticsearch mapping
  4. Repopulate ores articletopics and drafttopics for all the articles (see wikimedia/discovery/analytics:spark/ores_bulk_ingest.py)
  5. Remove BC code from CirrusSearch
  6. Reindex the elasticsearch indices to remove the old fields (using the --fieldsToDelete options)