Wikimedia Enterprise/Updates

From mediawiki.org
Jump to navigation Jump to search
This is an archive of all technical updates for the Wikimedia Enterprise project.


2021-09: Launch! Building towards the next version and public access[edit]

  • V1 launched on 9/15/2021: This month we stepped out of beta and fully launched v1 of Wikimedia Enterprise APIs. V1 APIs include:
    • Real Time:
      • Firehose API: Three real time streams of all of the current events happening across our projects. You can hold this connection indefinitely and returns you the same data model as the others so that you can get all of the information in just one event object. The three streams are:
        • page-update: all revisions and changes to a page across the projects
        • page-delete: all page deletions to remove from records
        • page-visibility: highly urgent community driven events within the projects to reset
      • Hourly Diffs: An API that returns a zip file containing all of changes with in a day of all "text-based" Wikimedia projects
    • Bulk:
      • Daily Exports: An API that returns a zip file containing all of changes with in a day of all "text-based" Wikimedia projects
    • Pull:
      • Structured Content API: An API that allows you to lookup a single page in the same JSON structure as the Firehose, Hourly, and Daily endpoints.
  • Implementing new architecture:
    • We are starting to implement the architecture that we've been working on in past months to move towards a more flexible system that is built around streaming data. More information to be shared on our mediawiki page soon.
    • We are also working on rewriting some of our existing launch work into the new process - this is a lot of repurposing code but making for a stronger and more scalable system.
    • After this, we will begin the implementation of Wikidata, more credibility signals, and flexible filtering into the suite of APIs.
  • Public Access:
    • The Daily and Hourly Diffs are available on WMCS currently
    • We are planning to launch with Wikimedia Dumps soon as we launch hashing capabilities in the APIs in v1! Stay tuned.

2021-08: Roadmap Design and Building towards our September Launch![edit]

  • Roadmapping the next six months:
    • Wikidata:
      • Wikidata is a heavily used project by Wikimedia Enterprise's persona of commercial content reusers. Looking into the future, it is important for us to include "text-based" projects as well as Wikidata in the feeds that we create.
      • Our goal is to add Wikidata to the Firehose streams, Hourly Diffs, and Daily Exports giving Enterprise users the ability to access all of the projects (except Commons) in one API suite.
    • Credibility Signals
      • As we work to solve the challenges of reliably ingesting in real time Wikimedia data at scale, there are two big problems that still come with our data: Content Integrity and Machine Readability.
      • Wikimedia data reusers are not necessarily savvy in the nuances of the communities efforts to keep the projects as credible as possible and miss much of the context that comes with revisions that might help inform whether or not a new revision is worth replacing in an external system. This is exacerbated as reusers aim to move towards real time data on projects that are always in flux.
      • We plan to draw out the landscape of what signals can be included alongside real time and bulk feeds of new revisions to help end users add more context to their systems. Stay tuned here.
    • Flexible APIs:
      • Customizable Payload: With the ever expanding data added to our schemas, we need more flexibility on the payloads that end users would like. This is not easy or possible for Hourly Diffs or Daily Exports since those files are pre-generated and static but we aim to work on this capability across the Firehose and Structured Content APIs.
      • Enhanced Filtering: Since there are so many different data points coming through the feeds, end users will start to build their comfortability of ingestion around a few feeds. It is imperative that we provide the ability to filter beyond client side so that we can limit the direct traffic on end user's systems. This also provides a much easier user experience for users o the APIs.
  • September Launch:
    • We are all hands on deck building and processing towards our launch of our initial launch product.

2021-07: Onboarding, Architecture, and Launch Schema[edit]

  • Added some new folks to our engineering team:
    • Welcome Prabhat Tiwary, Daniel Memije, and Tim Abdullin! They join us with each different perspectives and experiences adding substantial experience and capacity to our team.
    • With this came a lot of work stepping back and building onboarding documentation to make sure our team can grow and folks can join and contribute to our work.
  • New Architecture
    • As Wikimedia Enterprise APIs become more defined and complicated, we have started to draw out what a target architecture would look like. We are doing a lot of planning and taking time to think through what a streaming pipe should look like.
    • Our original architecture was centered around the solution of "Exports" and less around the real-time component, which in the long run will create flexibility issues with how we store and move data around our architecture.
  • Data Model / API Schema:
    • We have decided on a target schema, dataset, and set of APIs for our move out of beta in September. See more on our documentation page here


2021-06: Parsing HTML, Schema, API Organization, and Public Access[edit]

  • Parsing HTML
    • We are entering the world of "what we can do to make the data easier to use" as we near having reliable pipes as the core of the Enterprise product.
    • First stop, parsing HTML. We are working with the Parsing team to find ways that Enterprise can support the open-source project to make parsing Parsoid HTML easier at scale for our end users.
  • Data Model / API Schema:
    • We are sending our schema work into the technical decision making process at the Wikimedia Foundation, follow on this ticket from the architecture team.
    • We have decided to adopt snake_case in our APIs as it has more flexibility with non-english languages, as we look down the line of more accessible apis.
  • Launch API Organization
    • Next week we will add to our docs page our final API name-spacing and structure for launch, we are including endpoints to quickly discern if anything has changed from project to project. Stay tuned here, I'm just typing them up in draft.
  • Public Access

2021-05: Schema, Public Access, Documentation, and Firehose[edit]

  • Data Model / API Schema:
  • Public Access:
  • Documentation:
    • For now, we are hosting our documentation on-wiki here until we build out our larger sitemap for the Wikimedia Enterprise product. This work is in progress but feel free to watch that page for updates.
    • We are live on phabricator and all Wikimedia Enterprise related technical work is documented on our board!
  • Firehose API:
    • We have scoped the v1 release of the Firehose API and it will include filtering of Project and Page-Types (namespaces) for easier ingestion. Track progress here.
    • The Firehose will include the data from the above schema in a real time feed.

2021-04: Beta, Transparency, and Roadmap[edit]

  • Beta Launch!:
    • The team launched a "closed beta" for our bulk and structured-content api endpoints! So far, great feedback but still working through kinks that come with a beta offering.
    • Follow this ticket for more information on when public access will be available via Wikimedia Database Dumps. Note these will be experimental, if interested in providing feedback, feel free to post on our phabricator board - we appreciate it!
    • We are finalizing a timeline with the Technical Engagement team to find how we can provide access to folks with access to their tools. Stay tuned.
  • Project transparency improvements:
    • We are moving all of Wikimedia Enterprise's project management to our Phabricator board over the next week or two.
    • We are reflecting/iterating on our open-source workflow to provide a better window into our Github push schedule for those who are interested in following along. More to come here.
  • Roadmap:
    • The next big roadmap item is refining the "data schema" work we have already done and publishing updates here. We are looking to include more contextual data to revisions as part of our ingestion feeds.

2021-03: Community conversations[edit]

  • Refreshed documentation
    • Publication of completely refreshed documentation on MediaWiki.org and Meta. See Meta talkpage with significant amount of community feedback/comment.
  • Landing-page website
    • Launched! Incremental improvements in temporary code.
    • The website content itself is temporary and a placeholder until a fully featured page is launched alongside the product in a few months