Wikimedia Enterprise

The Okapi Team is a new cross-departmental team at the Wikimedia Foundation consisting of folks in the Technology, Product, and Advancement departments. Our core responsibilities are to uncover, design, and build products that will remove load from our primary servers and to enable more sustainable and diverse funding of the Wikimedia movement.

Current Technical Roadmap
Current Roadmap as laid out from previous customer discovery initiatives related to developing third-party products.

Personae
These will evolve and grow as more user personas start to engage with the product.


 * High Volume Data Downloader - Any person who intends to access Wikimedia Foundation data through bulk data downloading.
 * System Administrator - Non-technical person who will monitor the system.
 * Engineer - Technical person working on OKAPI project that will be deploying code.

Alpha - "Okapi HTML Dumps"
The Okapi team's Alpha product will be HTML Dumps. For every text-based WMF project, we will have a compressed file of all of said projects' HTML articles with their "best last revision" available for download via Okapi's visual interface. These dumps will help content re-users use a more familiar data-type as they work with WMF content.

Rationale behind HTML Dumps:

 * Already validated it will be valuable: Historically most requested feature by high volume data re-users. Large technology partners, researchers, and internal stakeholders within the Wikimedia Foundation have long sought for a comprehensive way to access all of the WMF "text-based" wikis in a form outside of Wikitext.
 * Take some pressure off internal Wikimedia infrastructure: When re-users pull our current dumps, some hit our systems with every file in order to parse the Wikitext to HTML. For those re-users who do not use our current dumps in wikitext, they use our APIs to compile all of the WMF hosted articles into their systems. In both of these cases, releasing this would immediately consolidte these calls to our team and de-burden other parts of the WMF.
 * Standalone in nature: Of the projects already laid out to consider, this is the most standalone. We can easily understand the specs without working with a specific partner. We also will not be forced to make design decisions that would affect a later product.
 * Get BD in good shape: As we discover the business model we are going to move forward with here, building towards an initial valuable offering allows us to have more structured conversations.
 * Strong introductory project for contractors: Limited in scope and touches many different projects internally. As far as projects to familiarize with the Wikimedia infrastructure, this is a great look into WMF tech stack and will net learnings that our engineers will be able to use in future initiatives.

Goals:

 * Accessibility:
 * Downloader Interface: Clean interface for Downloaders to access and download files. (Interface Wireframe)
 * Admin Interface: Clean interface for Administrators to monitor and tweak the dump's creation.
 * Endpoints: API endpoints for easy integration into current WMF dumps page
 * Data Output:
 * Frequency: Ability to download each project's up-to-date compressed dump file daily.
 * Reliability: Dependable, accessible infrastructure so we can guarantee downloading capabilities.
 * Article Quality: Limit vandalized content by focusing on the "Best Last Revision" for each article using time delays as well as ORES scores.

Downloader Interface
Clean interface for Downloaders to access and download files.

Admin Interface
Clean interface for Administrators to monitor and tweak the dump's creation.

Endpoints
API endpoints for easy integration into current WMF dumps page

Frequency
Ability to download each project's up-to-date compressed dump file daily.

Reliability
Dependable, accessible infrastructure so we can guarantee downloading capabilities.

Article Quality
Limit vandalized content by focusing on the "Best Last Revision" for each article using time delays as well as ORES scores.

Hosting locations
This project is intended to serve already-public data but at massive scales, thereby reducing the burden on the existing infrastructures and teams. We’re prototyping using AWS because it’s faster, and while we’re working at the prototype stage and trying to figure out what this product should become, ability to make rapid changes in response to user and engineer feedback takes precedence.

We expect to use Kubernetes as the container for the tools we build to enable high portability and to be platform agnostic including with Wikimedia's own cloud services for the benefit of the Wikimedia movement. The code we produce will be published into a publicly accessible repo and will licensed under a free software license.

Closely Related Endeavours:

 * Dumps and Data dumps - We are working with the Dumps team to learn from their challenges and eventually help combine our work together around HTML Dumps and Wikitext Dumps.
 * Core Platform Team/Initiatives/API Gateway - In a similar effort of API Strategy, we are focusing on the "users of large scale" whereas the API Gateway is much more focused on the rest of the Wikimedian engineering community.

Public/Private Technologies as a part of the current solution:

 * RESTBase Parsoid Cache to pull the most recent HTML cache into our dumps
 * https://en.wikipedia.org/api/rest_v1/page/title/Test
 * https://en.wikipedia.org/api/rest_v1/page/html/Test
 * Event Streams API to monitor changes
 * ORES to monitor vandalism and help clean our dumps.

This project shares challenges and an overlapping problem space with the following:

 * Wikimedia update feed service - A previous paid data service that enabled third parties to maintain and update local databases of Wikimedia content.
 * Data request limitations - We are working with this team to see how this idea could play into OKAPI products.
 * Kiwix - We overlap with Kiwix's mwoffliner project for the HTML Dumps epic on our project but do not overlap in any way in use case. We are exploring leveraging their technology and also potentially providing tools to collaborate as we exit prototyping phases.