Uikimitia Khì-gia̍p

From mediawiki.org
Jump to navigation Jump to search
This page is a translated version of the page Wikimedia Enterprise and the translation is 56% complete.
Other languages:

The Wikimedia Enterprise API is a new service focused on high-volume commercial reusers of Wikimedia content. It will provide a new funding stream for the Wikimedia movement; greater reliability for commercial reusers; and greater reach for Wikimedia content.

For general information, the relationship to the Wikimedia strategy, operating principles, and FAQ, see Wikimedia Enterprise on Meta. The project was formerly known as "Okapi".

See also the dedicated subpage for API documentation. Current development work is tracked on our Phabricator board. For information about Wikimedia community access to this service, please see Access on the project's Meta homepage.

Contact the team if you would like to arrange a conversation about this project with your community.
Monthly "Office hours": Third Friday of each month @ 15:00 UTC. Details on our Meta page

King-sin

This is the most recent months of technical updates. [All previous updates can be found at the archive]


2021-10: Website Launch and Wikimedia Dumps release!

  • Website Launch:
    • Our website is live! Check it out
    • Launched in this is our initial product offering details along with some pricing and sign up information.
  • Wikimedia Dumps release!
    • Wikimedia Dumps now has Wikimedia Enterprise dumps! Give it a download and please provide feedback to our team as you see relevant
    • Reminder: The Daily and Hourly Diffs are available on WMCS currently

2021-09: Launch! Building towards the next version and public access

  • V1 launched on 9/15/2021: This month we stepped out of beta and fully launched v1 of Wikimedia Enterprise APIs. V1 APIs include:
    • Real Time:
      • Firehose API: Three real time streams of all of the current events happening across our projects. You can hold this connection indefinitely and returns you the same data model as the others so that you can get all of the information in just one event object. The three streams are:
        • page-update: all revisions and changes to a page across the projects
        • page-delete: all page deletions to remove from records
        • page-visibility: highly urgent community driven events within the projects to reset
      • Hourly Diffs: An API that returns a zip file containing all of changes with in a day of all "text-based" Wikimedia projects
    • Bulk:
      • Daily Exports: An API that returns a zip file containing all of changes with in a day of all "text-based" Wikimedia projects
    • Pull:
      • Structured Content API: An API that allows you to lookup a single page in the same JSON structure as the Firehose, Hourly, and Daily endpoints.
  • Implementing new architecture:
    • We are starting to implement the architecture that we've been working on in past months to move towards a more flexible system that is built around streaming data. More information to be shared on our mediawiki page soon.
    • We are also working on rewriting some of our existing launch work into the new process - this is a lot of repurposing code but making for a stronger and more scalable system.
    • After this, we will begin the implementation of Wikidata, more credibility signals, and flexible filtering into the suite of APIs.
  • Public Access:
    • The Daily and Hourly Diffs are available on WMCS currently
    • We are planning to launch with Wikimedia Dumps soon as we launch hashing capabilities in the APIs in v1! Stay tuned.

2021-08: Roadmap Design and Building towards our September Launch!

  • Roadmapping the next six months:
    • Wikidata:
      • Wikidata is a heavily used project by Wikimedia Enterprise's persona of commercial content reusers. Looking into the future, it is important for us to include "text-based" projects as well as Wikidata in the feeds that we create.
      • Our goal is to add Wikidata to the Firehose streams, Hourly Diffs, and Daily Exports giving Enterprise users the ability to access all of the projects (except Commons) in one API suite.
    • Credibility Signals
      • As we work to solve the challenges of reliably ingesting in real time Wikimedia data at scale, there are two big problems that still come with our data: Content Integrity and Machine Readability.
      • Wikimedia data reusers are not necessarily savvy in the nuances of the communities efforts to keep the projects as credible as possible and miss much of the context that comes with revisions that might help inform whether or not a new revision is worth replacing in an external system. This is exacerbated as reusers aim to move towards real time data on projects that are always in flux.
      • We plan to draw out the landscape of what signals can be included alongside real time and bulk feeds of new revisions to help end users add more context to their systems. Stay tuned here.
    • Flexible APIs:
      • Customizable Payload: With the ever expanding data added to our schemas, we need more flexibility on the payloads that end users would like. This is not easy or possible for Hourly Diffs or Daily Exports since those files are pre-generated and static but we aim to work on this capability across the Firehose and Structured Content APIs.
      • Enhanced Filtering: Since there are so many different data points coming through the feeds, end users will start to build their comfortability of ingestion around a few feeds. It is imperative that we provide the ability to filter beyond client side so that we can limit the direct traffic on end user's systems. This also provides a much easier user experience for users o the APIs.
  • September Launch:
    • We are all hands on deck building and processing towards our launch of our initial launch product.

2021-07: Onboarding, Architecture, and Launch Schema

  • Added some new folks to our engineering team:
    • Welcome Prabhat Tiwary, Daniel Memije, and Tim Abdullin! They join us with each different perspectives and experiences adding substantial experience and capacity to our team.
    • With this came a lot of work stepping back and building onboarding documentation to make sure our team can grow and folks can join and contribute to our work.
  • New Architecture
    • As Wikimedia Enterprise APIs become more defined and complicated, we have started to draw out what a target architecture would look like. We are doing a lot of planning and taking time to think through what a streaming pipe should look like.
    • Our original architecture was centered around the solution of "Exports" and less around the real-time component, which in the long run will create flexibility issues with how we store and move data around our architecture.
  • Data Model / API Schema:
    • We have decided on a target schema, dataset, and set of APIs for our move out of beta in September. See more on our documentation page here


Past updates

For previous months' updates, see the the archive.


Khài-iàu

Puē-kíng

Due to the myriad of sources of information on the internet, compiling public and private data sets together has become a major proprietary asset (seen in customer knowledge graphs) for large tech companies when building their products. It is through this work that a company’s voice assistants and search engines can be more effective than those of their competitors. Wikimedia data is the largest public data source on the internet and is used as the "common knowledge" backbone of knowledge graphs. Not having Wikimedia data in a knowledge graph is detrimental to a product’s value, as we've proven through customer research.

In order for Wikimedia Enterprise API's customers to create effective user experiences, they require two core features from the Wikimedia dataset: completeness and timeliness.

Wikimedia content provides the largest corpus of information freely available on the web. It maps broad topics across hundreds of languages and endows consumer products with a feeling of “all-knowingness” and “completeness” that drives positive user experiences.

Wikimedia content originates from a community that authors content in real time, as history unfolds. Leveraging that community’s work provides customer products with the feeling of being “in-the-know” (i.e., “timeliness”) as events occur, thus generating positive user experiences.

There is currently no way for a data-consuming customer to make one or two API requests to retrieve a complete and recent document that contains all relevant and related information for the topic requested. This has resulted in customers building complex ad-hoc solutions that are difficult to maintain; expensive, due to a large internal investment; error prone, due to inconsistencies in Wikimedia data; and fragile, due to changes in Wikimedia responses.

Research Study

From June 2020 – October 2020, the Wikimedia Enterprise team conducted a series of interviews with third-party reusers [Users] of Wikimedia data to gain a better understanding of what companies are using our data, how they are using our data, in what products they are using it, and what challenges they face when working with our APIs. Our research showed that:

  1. Users cache our data externally rather than query our APIs for live data
  2. Each user approaches our current stack differently, with unique challenges and requests
  3. The Wikimedia APIs are not viewed as a reliable ingestion mechanism for gathering data and are prone to rate limits, uptime issues, and excessive use to achieve their goals
  4. All users have the same general problems when working with our content, and we have received similar asks from users of all size

The Enterprise API team has identified four pain points that cause large third-party reusers to struggle when using our public suite of APIs for commercial purposes. Note: Many of these concepts overlap with other initiatives currently underway within the Wikimedia movement, for example the API Gateway initiative.

  • High Frequency: Commercial reusers want to be able to ingest our content "off-the-press" so that they can have the most current worldview of common knowledge when presenting information to their users.
  • System Reliability: Commercial reusers want reliable uptime on critical APIs and file downloads so that they can build using our tools without maintenance or increased risk on their products.
  • Content Integrity: Commercial reusers inherit the same challenges that Wikimedia projects have in relation to vandalism and evolving stories. Commercial reusers desire more metadata with each revision update in order to inform their judgement calls on whether or not to publish a revision to their products.
  • Machine Readability: Commercial reusers want a clean and consistent schema for working with data across all of our projects. This is due to the challenges that come from parsing and making sense of the data they get from our current APIs.


For Content Integrity and Machine Readability, the Wikimedia Enterprise team created this list of notably interesting areas to focus our work for third party reusers. This list was created in March 2021 and has thus been refined and prioritized into roadmap features laid out below, however, this serves as an artifact of this research and something that can be used to reference back to some of the problems that reusers are facing.

Theme Feature Details
Machine Readability Parsed Wikipedia Content Break out the HTML and Wikitext content into clear sections that customers can use when processing our content into their external data structures
Optimized Wikidata Ontology Wikidata entries mapped into a commercially consistent ontology
Wikimedia-Wide Schema Combine Wikimedia project data together to create “single-view” for multiple projects around topics.
Topic Specific Exports Segment corpus into distinct groupings for more targeted consumption.
Content Integrity Anomaly Signals Update schema with information guiding customers to understand the context of an edit. Examples: page view / edit data
Credibility Signals Packaged data from the community useful to detect larger industry trends in disinfo, misinfo, or bad actors
Improved Wikimedia Commons license access More machine readable licensing on Commons media
Content Quality Scoring (Vandalism detection, “best last revision”) Packaged data used to understand the editorial decision-making of how communities catch vandalism.

Sán-phín lōo-suànn-tôo

In response to the research study, the Enterprise API team focuses on building tools for commercial reusers that will offer the advantages of a data service relationship while expanding the usability of the content that we provide.

The roadmap is split into two ordered phases focused on helping large third-party reusers with:

  1. Building a "commercial ingestion pipe"
  2. Creating more useful data to feed into the "commercial ingestion pipe"

Phase 1: Building a "Commercial Ingestion Pipe" (Launched September 2021)

The goal of the first phase is to build infrastructure that ensures the Wikimedia Foundation can reasonably guarantee Service Level Agreements (SLAs) for 3rd-party reusers as well as create a "single product" where commercial reusers can confidently ingest our content in a clear and consistent manner. While the main goal of this is not explicitly to remove the load of the large reusers from Wikimedia Foundation infrastructure, it is a significant benefit, for we do not currently know the total capacity of these large reusers on donor-funded infrastructure. For more information on the APIs that are currently available, please reference our public API documentation.


The September 2021 release (v1.0) of the Enterprise APIs:

Type Name Compare To What is it? What’s New?
Realtime Enterprise Activity "Firehose" API EventStream HTTP API A stable, push HTTP stream of real time activity across “text-based” Wikimedia Projects
  • Push changes to client with stable connection
  • Filter by Project and Page-Type
  • Be Notified of suspected vandalism in real time
  • Machine Readable and Consistent JSON schema
  • Guaranteed uptime, no rate-limiting
Enterprise Structured Content API Restbase APIs Recent, machine readable content from all “text-based” Wikimedia Projects
  • Machine Readable and Consistent JSON schema
  • Guaranteed uptime, no rate-limiting
Bulk Enterprise Bulk Content API Wikimedia Dumps Recent, compressed Wikimedia data exports for bulk content ingestion
  • Machine Readable and Consistent JSON schema
  • Daily “Entire Corpus” exports
  • Hourly “Activity” exports
  • Guaranteed delivery
  • Historical Downloads

Phase 2: Enhance Wikimedia Data for Reuse (Current)

The goal of the second phase of this project is to enhance the data that comes through the infrastructure provided by the Enterprise API. By doing this, we will create more opportunity for reusers ingesting our data feeds to efficiently use our content in their products. In general, the four areas that reusers have issues with working with our content for outside reuse are Content Integrity, Machine Readability, High Frequency, and System Reliability. With Phase 1 of this project, we are able to greatly enhance the capability for external reusers to guarantee System Reliability and High Frequency - you can read more about the general value offerings on our commercial website.

Wikimedia Enterprise Future Roadmap from March 2021 (annotated with current focus points in bold/italic)
Theme Feature Details
Machine Readability Parsed Wikipedia Content Break out the HTML and Wikitext content into clear sections that customers can use when processing our content into their external data structures
Optimized Wikidata Ontology Wikidata entries mapped into a commercially consistent ontology
Wikimedia-Wide Schema Combine Wikimedia project data together to create “single-view” for multiple projects around topics.
Topic Specific Exports Segment corpus into distinct groupings for more targeted consumption.
Content Integrity Anomaly Signals Update schema with information guiding customers to understand the context of an edit. Examples: page view / edit data
Credibility Signals Packaged data from the community useful to detect larger industry trends in disinfo, misinfo, or bad actors
Improved Wikimedia Commons license access More machine readable licensing on Commons media
Content Quality Scoring (Vandalism detection, “best last revision”) Packaged data used to understand the editorial decision-making of how communities catch vandalism.

Roadmap: Q1 - Q3 2021/22

Content Integrity

To kick off this new phase, we have decided to focus on the Content Integrity to help Wikimedia Enterprise users that are ingesting Wikimedia data in bulk or real-time scale to understand what they are ingesting as they receive the updates. For external reusers that choose to work with Wikimedia data in real-time or even with a slight delay increase their exposure to the most fluid components of the projects and increase risk of propagating vandalism, dis/mis-disinformation, unstable article content, etc. Our goal is not to prescribe content with a decision as to its reliability, but rather to increase the contextual information around a revision to allow Wikimedia Enterprise reusers to have a "better picture" of what this revision is doing and how they might want to handle it on their end. This will manifest in new fields in our responses in the Firehose, Bulk APIs (hourly diffs only), and Structured Content API. We are focused on two main categories:

  • Credibility Signals: "Context" of a revision. This looks like diving into "what changed", editor reputation, and general article level flagging. The goal initially is to lean on the information that is publicly used by editors and translate those concepts to the reusers that are otherwise unfamiliar.
  • Anomaly Signals: "Noise" around a revision. This looks like temporal edit, page views, or talk page activity. The goal initially is to compile quantitative signals to unpack popularity that can be used to help reusers prioritize updates as well as calibrate around our trends and what that might mean for the reliability of the content.

More information to come.

Adding Wikidata to Wikimedia Enterprise

Phase 1 (and V1) of Wikimedia Enterprise APIs were focused on "text-based" projects/languages exclusively, this was decided on a mix of initial research and to maximize the value of our first offering for the largest set of external reusers without overburdening the scope of the initial release. As our product grows and we become more knowledgable about the user-base, we intend to expand our service to include Wikidata into all of our feeds. We are still deciding how we will structure Wikidata into the schema that we have decided for "text-based" projects but where we will be including Wikidata:

  • Bulk API:
    • Daily Exports of Wikidata
    • Hourly exports of changed entities
  • Firehose API:
    • A new "entity" stream containing all new updates
  • Structured Content API:
    • Lookup QIDs

Wikimedia Enterprise (Version 1.0)

See also: The API documentation subpage.

Note: We are still defining the exact nomenclature for API endpoints and documentation, but these are the main products that our team is currently building.

Structured Content API

High-volume reusers that use an infrastructure reliant on the EventStream platform depend on services like RESTBase to pull HTML from page titles and current revisions to update their products. High-volume reusers have requested a reliable means to gather this data, as well as structures other than HTML when incorporating our content into their KGs and products.

Wikimedia Enterprise Structured Content API, at release, will contain:

  • A commercial schema
  • SLA

Firehose API

High-volume reusers currently rely heavily on the changes that are pushed from our community to update their products in real time, using EventStream APIs to access such changes.High-volume reusers are interested in a service that will allow them to filter the changes they receive to limit their processing, guarantee stable HTTP connections to ensure no data loss, and supply a more useful schema to limit the number of api calls they need to make per event.

Enterprise Firehose API, at release, will contain:

  • Filtering of events by Project or Revision Namespace
  • Guaranteed connections
  • Commercially useful schema similar* to those that we are building in our Structured Content API and Bulk API
  • SLA

*We are still in the process of mapping out the technical specifications to determine the limitations of schema in event platforms and will post here when we have finalized our design.

Phuè-liōng API

For high volume reusers that currently rely on the Wikimedia Dumps to access our information, we have created a solution to ingest Wikimedia content in near real time without excessive API calls (Structured Content API) or maintaining hooks into our infrastructure (Firehose).

Enterprise Bulk API, at release, will contain:

  • 24-hour JSON*, Wikitext, or HTML compressed dumps of "text-based" Wikimedia projects
  • A hourly update file with revision changes of "text-based" Wikimedia projects
  • SLA

*JSON dumps will contain the same schema per page as the Structured Content API.

These dumps will be available for public use bi-weekly on Wikimedia Dumps and for WMCS users coming in June 2021

Kuè-khì Huat-tián

Daily HTML Dumps

The Enterprise team's first product was building daily dump files of HTML for every "text-based" Wikimedia project. These dumps will help content re-users use a more familiar data type as they work with Wikimedia content.

Reusers have four immediate needs from a service that supports large-scale content reuse: system reliability, high frequency or real-time access, content integrity, and machine readability.

Web Kài-bīn

This is a screenshot from the alpha dashboard (when the project was codenamed "Okapi") where users can download and save daily exports of HTML from "text-based" Wikimedia projects

A downloader interface now in design stages allows for users to download a daily dump for each "text-based" project, search and download individual pages, and save their preferences for return visits. Currently the software is in Alpha and still in usage and quality testing. This dashboard is built in React with internal-facing client endpoints built on top of our infrastructure. The downloads are hosted and served through S3.

Rationale behind choosing this as the Enterprise API's first product

  • Already validated: Before the Enterprise team ran research to discover the needs of high-volume data reusers, this was the most historically requested feature. Large technology partners, researchers, and internal stakeholders within the Wikimedia Foundation have long sought a comprehensive way to access all of the Wikimedia "text-based" wikis in a form outside of Wikitext.
  • Take pressure off internal Wikimedia infrastructure: While not proven, anecdotally we can conclude there is a significant band of traffic to our APIs by high-volume reusers aiming to get the most up-to-date content cached on their systems for reuse. Building a tool where they can achieve this has been the first step to pulling high-volume reusers away from WMF infrastructure and onto a new service.
  • Standalone in nature: Of the projects already laid out for consideration by the Enterprise team, this is the most standalone. We can easily understand the specs without working with a specific partner. We were not forced to make technical decisions that would affect a later product or offering. In fact, in many ways, this flexibility forced us to build a data platform that produced many of the APIs that we are offering in the near future.
  • Strong business development case: This project gave the Enterprise team a lot of room to talk through solutions with reusers and open up business development conversations.
  • Strong introductory project for contractors: The Enterprise team started with a team of outside contractors. This forced the team to become reusers of Wikimedia in order to build this product. In the process, the team was able to identify and relate to the problems with the APIs that our customer base faces, giving them a broader understanding of the issues at hand.

Siat-kè Bûn-kiānn

Okapi architecture.png

Application Hosting

The engineering goal of this project is to rapidly prototype and build solutions that could scale to the needs of the Enterprise API's intended customers – high volume, high speed, commercial reusers. To do this, the product has been optimized for quick iteration, infrastructural separation from critical Wikimedia projects, and to utilize downstream Service Level Agreements (SLAs). To achieve these goals in the short term, we have built the Enterprise API upon a third-party cloud provider (specifically Amazon Web Services [AWS]). While there are many advantages of using external cloud for our use case, we acknowledge there are also fundamental tensions – given the culture and principles of how applications are built at the Foundation.

Consequently, the goal with the Enterprise API is to create an application that is "cloud-agnostic" and can be spun up on any provider's platform. We have taken reasonable steps to architect abstraction layers within our application to remove any overt dependencies on our current host, Amazon Web Services. This was also a pragmatic decision, due to the unclear nature of where this project will live long-term.

The following steps were taken to ensure that principle. We have:

  • Designed and built service interfaces to create abstractions from provider-specific tools. For instance, we have layers that tie to general File Storage capabilities, decoupling us from using exclusively "AWS S3" or creating undo dependency on other potential cloud options
  • Built the application using Terraform as Infrastructure as Code to manage our cloud services. [The Terraform code will be published in the near future and this documentation will be updated when it is]
  • Used Docker for containerization throughout the application
  • Implemented hard drive encryption to ensure that the data is protected (we are working to expand our data encryption and will continually as this project develops)

We have intentionally kept our technical stack as general, libre & open source, and lightweight as possible. There is a temptation to use a number of proprietary services that may provide easy solutions to hard problems (including EMR, DynamoDB, etc). However, we have restricted our reliance on Amazon services to what we can be found in most other cloud providers. Below is a list of services used by the Enterprise API within Amazon and its purpose in our infrastructure:

  • Amazon Elasticsearch Service - Search Engine
  • Amazon MSK - Apache Kafka Cluster
  • Amazon ELB - Load Balancer
  • Amazon VPC - Virtual Private Cloud
  • Amazon Cognito - Authentication

We are looking to provide Service Level Agreements (SLA) to customers similar to those guaranteed by Amazon's EC2. We don't have equivalent uptime information from the Wikimedia Foundation's existing infrastructure. However, this is something we are exploring with Wikimedia Site Reliability Engineering. Any alternative hosting in the future would require equivalent services or time to allow us to add more staff to our team in order to give us confidence to handle the SLA we are promising.

In the meantime, we are researching alternatives to AWS (and remain open to ideas that might fit our use case) when this project is more established and we are confident in knowing what the infrastructure needs are in reality.

Team

We are staffing our engineering team currently with Speed & Function. At this early stage in the project, we are not yet sure of the long-term engineering needs and wish to thoroughly assess the project’s ability to become self-sustaining. In this way, we hope not to disrupt other WMF projects or divert excessive resources.

See also

  • Wikitech: Data Services portal – A list of community-facing services that allow for direct access to databases and dumps, as well as web interfaces for querying and programmatic access to data stores.
  • Enterprise hub – a page for those interested in using the MediaWiki software in corporate contexts:
  • Wikimedia update feed service – A defunct paid data service that enabled third parties to maintain and update local databases of Wikimedia content.
This table: view · talk · edit
API Availability URL base Example
MediaWiki-2020-small-icon.svg MediaWiki Action API Included with MediaWiki

Enabled on Wikimedia projects

/api.php https://en.wikipedia.org/w/api.php?action=query&prop=info&titles=Earth
MediaWiki-2020-small-icon.svg MediaWiki REST API Included with MediaWiki 1.35+

Enabled on Wikimedia projects

/rest.php https://en.wikipedia.org/w/rest.php/v1/page/Earth
Wikimedia-logo.svg Uikimitia REST API Not included with MediaWiki

Available for Wikimedia projects only

/api/rest https://en.wikipedia.org/api/rest_v1/page/title/Earth
For commercial-scale APIs for Wikimedia projects, see Uikimitia Khì-gia̍p