Jump to content

Trust and Safety Product/Decision records/2025-02-05-IPoid-OpenSearch

From mediawiki.org

Authors

[edit]

Status

[edit]
  • Proposed

Reviewers

[edit]

Context

[edit]

The IPoid service imports Spur data to a relational database. This presents several operational challenges:

  • Daily management of high-volume inserts/deletions/updates is difficult to course correct, when errors occur
  • Complex reconstruction of data into relational database format
  • Data recency, due to time in import
  • Database drift during daily updates

Options considered

[edit]

1. Maintain status quo.

  1. Attempt to optimize imports
  2. Invest in maintenance and observability

2. Use an OpenSearch instance for data storage and querying

  1. Simplified data pipeline to direct JSONL ingestion
  2. Native handling of IP address data types and queries
  3. Initial experiments for imports and queries are promising:
    1. Initial import of 37.28M records completed in 50 minutes
    2. Query response times of 4-20ms for IP lookups
    3. Under 50ms for attribute queries

Decision

[edit]

Propose to migrate the IPoid node JS import and serving app to use an OpenSearch instance for the data store and as a serving application for web requests.

Consequences

[edit]

1. Simplified data pipeline

  1. Direct ingestion of JSONL format without complex transformations
  2. Ability to run multiple imports per day, reducing data staleness
  3. Automatic type inference for fields including IP addresses
  4. Built-in handling of updates and merges
  5. Built-in support for historical record querying

2. Querying

  1. Ability to track historical data through `last_seen` field
  2. Flexible querying capabilities using OpenSearch query DSL

Risks

[edit]
  1. Unclear path to OpenSearch instance hosting at WMF. T362105 is deprioritized at the moment.