Platform Engineering Team/Data Value Stream/Data Gateway

From mediawiki.org

Data Gateway Service[edit]

Background[edit]

The Analytics Query Service (aka AQS) exploits a pattern whereby periodic batch analytics jobs are used to maintain materialized datasets, durably persisted to facilitate low-latency access via HTTP APIs. Other use-cases of this pattern have been identified, and there is interest in constructing a platform to lower the barrier to entry, making it easier, faster, and more conducive to experimenting with new datasets. It has been suggested that on top of the (6) existing, such a platform could quickly grow by 10s of datasets in the near-term. At this scale, managing bespoke access to database(s) for an arbitrary number of internal teams would likely become problematic. Even something as mundane as a fleet-wide upgrade of native drivers could be prohibitively expensive when code owners are many, varied in resources and priorities, or worse when code becomes orphaned entirely.

One way of overcoming these concerns is to de-couple client applications from the underlying database. There are many potential benefits from this -for example- preserving the future ability to transparently migrate or redistribute datasets among clusters, or to better manage resource utilization, caching, etc. The Data Gateway service is an experimental approach to decouple clients from the underlying database, by providing a generic HTTP interface to published datasets.

The premise is simple: Candidate datasets are purpose-built, and expect results that are verbatim (or nearly so) to what is stored (including attribute naming). The Data Gateway is nothing more than thin layer wiring HTTP semantics to a database table, and returning JSON-serialized results (an array of rows containing one or more JSON objects).

Published Datasets[edit]

The Data Gateway is conceptually a single entity where an arbitrary number of datasets is published. However at the time of writing, each dataset is implemented as its own discrete (k8s deployed) service. In the future, these may either be aggregated via HTTP routing, or replaced by a single multi-tenant gateway service.

Image Suggestions[edit]

Source: https://gerrit.wikimedia.org/r/admin/repos/generated-data-platform/datasets/image-suggestions

API[edit]

Suggestions[edit]
URL /public/image_suggestions/suggestions/{wiki}/{page_id}
Method GET
Params None
Data None
Success Example:
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 5000
Date: Mon, 11 Apr 2022 22:07:59 GMT

{
  "rows": [
    {
      "wiki": "anwiki",
      "page_id": 3326,
      "id": "644c90bc-ba40-11ec-ba4c-f0d4e2e69820",
      "image": "14_Agosto_2016_(1).jpg",
      "confidence": 80.0,
      "found_on": null,
      "kind": ["istype-commons-category"],
      "origin_wiki": "commonswiki", 
      "page_qid": "Q123",
      "page_rev": 1797958,
      "section_heading": "section title (null if this is an article-level suggestion)",
      "section_index": 1
    },
    ...
}
Error Errors are JSON objects conforming to RFC7807 (Problem Details for HTTP APIs) with a content-type of application/problem+json.
Code Reason Example
500 Internal server error
{
    "status": 500,
    "type": "about:blank",
    "title": "Cassandra query error",
    "detail": "An unknown error occurred, contact the administrator(s) ..."
}
Example
$ curl http://api.example.org/public/image_suggestions/suggestions/anwiki/3326
Notes
Feedback[edit]
URL /private/image_suggestions/feedback/{wiki}/{page_id}
Method GET
Params None
Data None
Success Example:
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 100
Date: Mon, 11 Apr 2022 22:07:59 GMT

{"rows": [...]}
Error Errors are JSON objects conforming to RFC7807 (Problem Details for HTTP APIs) with a content-type of application/problem+json.
Code Reason Example
500 Internal server error
{
    "status": 500,
    "type": "about:blank",
    "title": "Cassandra query error",
    "detail": "An unknown error occurred, contact the administrator(s) ..."
}
Example
$ curl http://api.example.org/private/image_suggestions/feedback/enwiki/53848
Notes
Instanceof (cache)[edit]
This table is a duplication of a relationship that MediaWiki is canonical for. It is maintained here for convenience, with the understanding that it is not trustworthy (it should not be considered a source of truth).
URL /private/image_suggestions/instanceof_cache/{wiki}/{page_id}
Method GET
Params None
Data None
Success Example:
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 100
Date: Mon, 11 Apr 2022 22:07:59 GMT

{
  "rows": [
    {
      "wiki": "anwiki",
      "page_id": 3326,
      "instance_of": ["Q112099", "Q3624078", "Q6256"],
      "page_rev": 1797958
    }
  ]
}
Error Errors are JSON objects conforming to RFC7807 (Problem Details for HTTP APIs) with a content-type of application/problem+json.
Code Reason Example
500 Internal server error
{
    "status": 500,
    "type": "about:blank",
    "title": "Cassandra query error",
    "detail": "An unknown error occurred, contact the administrator(s) ..."
}
Example
$ curl http://api.example.org/private/image_suggestions/instanceof_cache/anwiki/3326
Notes
Title (cache)[edit]
This table is a duplication of a relationship that MediaWiki is canonical for. It is maintained here for convenience, with the understanding that it is not trustworthy (it should not be considered a source of truth).
URL /private/image_suggestions/title_cache/{wiki}/{title}
Method GET
Params None
Data None
Success Example:
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 100
Date: Mon, 11 Apr 2022 22:07:59 GMT

{"rows": [...]}
Error Errors are JSON objects conforming to RFC7807 (Problem Details for HTTP APIs) with a content-type of application/problem+json.
Code Reason Example
500 Internal server error
{
    "status": 500,
    "type": "about:blank",
    "title": "Cassandra query error",
    "detail": "An unknown error occurred, contact the administrator(s) ..."
}
Example
$ curl http://api.example.org/private/image_suggestions/title_cache/enwiki/Banana
Notes