Architecture Repository/Artifacts/Structured content proof of value

From mediawiki.org

‎

Wikimedia logo Wikimedia Architecture Repository
Home | Artifacts | Process | Patterns

Structured content proof of value[edit]

A hands-on working prototype of an experimental modern platform

Last updated: 2022-12-16 by APaskulin (WMF)
Status: v1 published March 25, 2021

The purpose of this artifact[edit]

The purpose of the artifact is threefold:

  1. Share the context in which we undertook this exercise and why it mattered to do.
  2. Give a brief overview of the implementation with links to the demo instance and source code.
  3. Outline the challenges to consider, Big Questions and next steps we are exploring.

This artifact is written primarily for people within the Wikimedia Foundation who are interested in, or benefit from, working towards the Foundation’s long term goals.

We hope this document is a first step towards structured knowledge that enables information to flow between independent components. Our goal is inching the system towards loose coupling, event driven interactions and other patterns that enable emergence (new, positive behaviors arising from parts of knowledge interacting as a wider whole.)

You should use this artifact for informational purposes. If you’d like to dive deeper or help us mature the architectural thinking towards modernization (a system of loosely-coupled capabilities with logical boundaries that rely on emerging industry patterns including canonical data modeling and event-driven interactions), contact us at architecture@wikimedia.org.

Why this artifact is valuable[edit]

Architecture interconnects the Foundation's strategically-imperative goals with the system-level decisions needed to reach them. This artifact, and the prototype exercise it describes, interconnects the Foundation's strategically-imperative goal, Knowledge as a Service, with the technology decisions needed to reach it. The work described in this artifact aligns with the Foundation’s 2020-2021 Platform Evolution objective to evolve Wikimedia systems architecture to use more modern patterns.

This artifact also demonstrates the value of modern systems patterns, uncovers leverage points (places to make impactful changes) and describes potentially-disruptive challenges blocking mission-critical decisions.

Our mission[edit]

To meet the Foundation's goal[edit]

"to serve our users, we will become a platform that serves open knowledge to the world across interfaces and communities. We will build tools for allies and partners to organize and exchange free knowledge beyond Wikimedia. Our infrastructure will enable us and others to collect and use different forms of free, trusted knowledge." -- Knowledge as a service

... we will[edit]

"make contributor and reader experiences useful and joyful; moving from viewing Wikipedia as solely a website, to developing, supporting, and maintaining the Wikimedia ecosystem as a collection of knowledge, information, and insights with infinite possible product experiences and applications." -- Modernizing the Wikimedia product experience

... by architecting towards[edit]

  • serving collections of knowledge and information,
  • created from multiple trusted sources,
  • to nearly-infinite product experiences
  • and other platforms.

Beginning with this proof of value (PoV)[edit]

We created a prototypical step towards a modern platform: a system of loosely-coupled capabilities with logical boundaries that rely on emerging industry patterns including canonical data modeling and event-driven interactions.

The proof of value[edit]

The impact of this exercise -- what has improved at the organization because of it -- is coalescing the platform evolution conversation around a set of essential themes. The goal is to identify patterns of thinking and implementation that can move us towards our stated mission. This exercise is a small sample of interrelated efforts towards articulating a path forward. And that effort is gaining momentum.

Including:

  • EventStorming and other modeling with Product and the emerging product strategy work.
  • Cross-foundation exploration of event driven interactions and data modeling, including a reading group, architectural exploration groups and shared work with Okapi (among others).
  • Collaboration between Architecture and Abstract Wikipedia to develop these shared patterns as that platform comes to life.
  • Workshops and ongoing collaboration between the Foundation’s architecture team and WMDE’s architecture team.
  • Bringing ideas and approaches from the external world of knowledge systems and discerning which will benefit us.

Walkthrough[edit]

Watch the walkthrough on YouTube

Overview[edit]

Here is a hands-on working prototype and the implementation behind it. The goal was to create:

The majority of our work, here and as a team, aims to understand and define “modern platform”. In this case, it needed to:

  • Create a canonical structure for the knowledge / content
  • Break down a page into collections of sections
  • Implement by using the patterns described here (outside MW)
  • Include “topics” (in this case, using Rosette as the source)
  • Make the structured knowledge graph publicly queryable

The demo has been shared across the organization. This small experiment has evolved into deeper, cross-functional explorations, including the Core Platform Team, Product Strategy Working Group, Okapi and SDAW.

If you'd like to skip right to the details, read the implementation overview or how to view the demo.

While creating something tangible, our primary focus is on systems analysis. We are laying the foundation for systems architecture -- a practice that will support the work ahead. This work includes designing system patterns, discovering leverage points and identifying the Big Questions.

Implementation[edit]

The prototype is built in AWS using SNS messages, Lambdas written in Go, S3, GraphQL, DynamoDB and Elastic Search. See component list. It interacts with Rosette to analyze the sections and return topics. The repository is in GitHub.

Though this list is written sequentially, these activities are asynchronous.

Event-driven workflow[edit]

Respond to a change event:

  • Respond to an event stream message sent by Simple Wikipedia when article has changed
  • Retrieve the article via the Parsoid API and save the raw result
  • Break the raw result down into sections associated by hypermedia links - a page has parts (sections), a section is part of a page. ('hasparts/isparts')
  • Save them as individual json objects with a predictable structure using schema.org
  • Save the page title associated with the resource ID

When a new file is saved:

  • Send section to Rosette via API
  • Save the list of resulting (most-salient) Wikidata Items (topics) associated with that section

Request-driven workflow[edit]

  • Return requests for pages and/or sections with only the data requested
  • Return requests for sections associated with a topic

Architecture diagram[edit]

Architectural model for the PoC

Initial diagrams

Component diagram[edit]

Component diagram for Wikimedia Phoenix

Caveats[edit]

There are many challenges to consider before this prototype is "production ready". Production ready was not our goal. We are engaging with some of those challenges next.

We did not design, as part of this exercise, a pattern for updating the topics when there is a big-enough change. First, we need to define "big enough" and also understand where the logic of "change" fits into this pattern. When an article changes, the topics associated with the article might also change. Which means, we need to update the Rosette topics. We could resend the article every time there is an edit but this would be highly inefficient.

Demo[edit]

You can access a demo instance here. The data supporting this demo is Simple Wikipedia with topics from Rosette for each section (approximately 500,000 nodes).

The demo is a front end that interacts with GraphQL and the structured content store. The structured content store contains content from Simple Wikipedia, updated when edits are made there. It also includes the topics associated with each object (page, section) from Rosette.

The demo provides several examples of potential behavior of the PoV. You can:

  • fetch a section of an article by it's name
  • fetch the sections associated with a specific topic
  • input a GraphQL query

Demo: Fetch a part of a page[edit]

You can fetch a section by its name.

This example showcases how an article that was divided into sections allows for flexibly fetching the section names, and individual sections only. When a page is chosen from the drop-down list, a GraphQL query is sent to the content store, requesting the list of section titles that are available for the article. The query sets up a request for the article name, modification date, and the name of all its parts (section titles) from the requested article and populates the second drop down.

After the request is completed, the second drop down is populated, allowing the demo user to request a specific section. The second query sets up a request for the specific part inside the article itself. This means that the payload that is sent includes only the requested part, without receiving the complete article, and without expecting the consumer to process or manipulate the received content to present what is needed.

GraphQL queries[edit]

The queries used in this part of the demo show how easy it is to request and receive only the specific pieces of information that the consumer requires, and reduces the load of processing or manipulating the page by the consumer.

Requesting a list of sections within the article:

{
   page(name: { authority: "simple.wikipedia.org", name: "PAGE NAME"} ) {
     name
     dateModified
     hasPart(offset: 0) {
       name
     }
   }
 }

Requesting a specific section by name:

{
   node(name: { authority: "simple.wikipedia.org", pageName: "PAGENAME", name: "SECTION NAME" } ) {
     dateModified
     name
     unsafe
   }
 }

Demo: Fetch sections by keyword[edit]

You can request sections that are associated with a specific keyword from Wikidata.

This example shows the connection between parts (article sections) and semantic keywords (Wikidata items) produced by Rosette. The demo collects Rosette keywords that are associated with sections and provides them in a drop-down list. Choosing a keyword produces a GraphQL query that requests the sections that are associated with that keyword and presents them to the user. Each section then showcases the most salient Wikidata items for itself, allowing the user to explore collections of knowledge by keyword.

GraphQL queries[edit]

Requesting sections for a given keyword:

To request the HTML for a section, add the unsafe property to the query.
{
  nodes(keyword: "WIKIDATA Q-ID") {
    name
    id
    isPartOf {
      id
      name
    }
    keywords {
      id
      salience
    }
  }
}

Requesting the top 5 most relevant keywords for a section:

{
  node(
    name: {
      authority: "simple.wikipedia.org"
      pageName: "PAGE NAME"
      name: "SECTION NAME"
    }
  ) {
    dateModified
    name
    keywords(limit: 5) {
      id
      salience
    }
  }
}

Demo: GraphQL sandbox[edit]

You can input a GraphQL query of your own.

Finally, the demo includes a GraphQL sandbox for testing and exploring the way queries are built and the payload that they produce. The sandbox is based on GraphiQL.

On the left side of the screen, the GraphQL editor allows the user to insert a custom query of their choosing, based on the available types. The user can learn what types they can request using the “Docs” popup on the top right corner of the screen.

The “Docs” button opens a popup for the “Documentation explorer”, allowing the user to view the available GraphQL hierarchical entity definition, and use those to request information.

Clicking on the “Execute query” button at the top left of the screen will run the query, and the resulting payload is presented at the right side of the interface.

The user can also look at the history of previous queries and adjust those to experiment with the different payload results.

Architecting our mission[edit]

Modern platform[edit]

We've made choices about what "modern platform" means. These choices were informed by the wider world of content and knowledge systems, where others face similar challenges. We explored emerging patterns and challenges. How do you "create once, publish everywhere"? How do we design for the distribution of knowledge to wherever people engage with it?

We also relied on 18 months of architectural explorations conducted prior to this exercise. These explorations enabled us to identify what we need from a "modern" platform. Some needs are in synch with the world at large and a few (essential challenges) are unique.

We define modern platform as interrelated capabilities relying on emerging industry patterns (see below). At a system level, these patterns are the implementation details. They lay the foundation for low-level interactions between knowledge sources and products that scale as the system scales. Which is essential to our mission.

Patterns[edit]

Patterns enable us to design for emergence: create interrelated capabilities that can become greater than the sum of their parts. We focused on patterns that enable stable, predictable, changeable and encapsulated parts. Patterns that let us design a system by focusing on:

  • the data model (the shape of) "knowledge"
  • the parts that deliver the necessary capabilities (things the system does)
  • the relationship between those parts
  • and the structure of their interaction

The patterns we've explored include:

Canonical data modeling: A canonical data model is a predictably-structured, technology-agnostic data structure that represents the system as a whole instead of each component having its own representation of the data. Discrete bits of information are interconnected based on relationships between them and contextualized with metadata. This allows users and machines to consume content easily without specifically caring about the underlying technologies driving the system.

Working draft of our CDM

Loose coupling: Loose coupling is the practice of organizing a system into independent, distinct subsystems that communicate with one another to support the complete operation of the system. The implementation of how to split the operation of the system into subsystems depends on the needs of the system, the capabilities it requires, the infrastructure, and the way product and technology teams work together.

Event-based interactions and event sourcing: The event-based interactions pattern defines the way that subsystems interact with each other in a loosely coupled system. Instead of querying a central database, separate subsystems exchange information publishing and consuming “events”. Each event contains information about a change that has occurred, regardless of where the change originated. These events can be consumed by the rest of the system, allowing each subsystem to remain distinct and encapsulated but still share, subscribe, and respond to operations done by other subsystems.

CQRS: Differentiating between reading and editing. In the PoV, the current structure inside of MediaWiki is left alone, it is the "trusted source". When changes happen in MW, the new system reacts by getting the necessary information and translating it into the canonical data model. This means the design works for reading but not for editing. If > 90% of the requests are for reads, can editing be a separate part of the system? We're looking at the editing workflow next.

Leverage points[edit]

The scope of modernization -- transforming the world's largest reference website into the world's largest knowledge system -- is monumental. To understand where to focus our time and attention, we've identified three leverage points.

"Folks who do systems analysis have a great belief in “leverage points.” These are places within a complex system where a small shift in one thing can produce big changes in everything." -- Donella Meadows

However we approach it, the first step is a doozy. There is no iterative path towards transformation. Neither is there a lift-and-shift migration option. We need to find capabilities in the system that we can decouple from the current day-to-day operations. As challenging as leverage points may be to find and to change, they unlock highly-valuable opportunities. While simultaneously laying a strong and cohesive foundation for the future system.

The leverage points explored in this PoV are:

  1. Giving shape and structure to Knowledge: Honestly, we don't know if it's humanly possible to "structure" Wikipedia content sufficiently. the knowledge we want to share with the world isn't made for modern distributions. We must try. Also, knowledge is currently shaped by the context of "web page" and that doesn't fit emerging contexts.
  2. Designing inherent relationships between knowledge parts to create collections: Collections are relationships developed, programmatically or by editors, between pieces of knowledge. The way humans envision and plan these relationships shapes the way the knowledge is developed. The PoV pre-builds the knowledge payload (an answer to the queries) based on the relationships we know are the most valued. How would we expand this over time?
  3. Building decoupled relationships between parts of the system rather than building capabilities into the software: This includes changing the choreography of essential activities ... in many ways, the paradigm itself is changing.

Exploring patterns and identifying leverage points helped us prioritize questions to explore next.

Big questions[edit]

The scope of questions we need to answer, some we have not yet discovered, is equally monumental. The PoV leaves many questions unanswered -- on purpose. We are triggering cross-functional discussions and decisions needed to discern a path forward. While we have more questions than answers, we are significantly more confident in the questions. Top four include:

  1. What is "just enough" structure needed for the knowledge? We've structured sections of a page. Is that the lowest-level data object or do we need to go deeper?
  2. What infrastructure can support these patterns at scale in an open-source world?
  3. From a system point of view, can reading be decoupled from editing? With subquestions:
    1. What is the tolerance for eventual consistency?
    2. What is the tolerance for moving away from a desktop page as the system's source of truth?
  4. How will modernization impact the current editing workflow?

The highest-value next step is continuing to gather and apply learnings from teams across the foundation that help answer these questions.

Collections of knowledge and information[edit]

How can we design knowledge to be consumed by "infinite product experiences"? How do we enable these "experiences" to control how the knowledge is displayed and how users interact with it? When we say collections, what do we mean?

A page is one, predominant, type of collection of knowledge. What are the others?

The shape of knowledge[edit]

During our architectural explorations, a single blocker arose again and again. At the heart of our ecosystem, the knowledge we want to share with the world isn't made for modern distributions. It exists as a "web page" made from a gigantic, tangled, monolithically-orchestrated bundle of proprietary text.

This bundle of text has enabled terrific benefit to the world. The challenge is, without detangling, it won't meet the system's long-term goals in the emerging digital world.

A predictable data model is needed (to some extent) to feed multitudes of new and varied product experiences. Products and platforms that consume knowledge outside the context of MediaWiki need the knowledge structured as distributable, consumable information.

How much "structure" is enough? For example, should about a page about a person have sections based on schema.org recommendations? This would make it more consumable by products and platforms. Should "sections" or "references" exist as a structure of knowledge, inside and outside MediaWiki?

At the moment, some of the knowledge is in pages within pages (within pages), related loosely by unique software logic. The page, like a body without a skeleton, has no predictable shape until MediaWiki pieces the bones together to form a Wikipedia web page. Instructions for displaying knowledge on a web page are inextricably woven into the knowledge we hope to distribute beyond a Wikipedia page. How do we enable knowledge to shift context - from a web page to Alexa, for example?

Exchanging free knowledge beyond Wikimedia requires loose coupling and a predictable language of exchange, beyond HTML. Loose coupling enables parts of the system to be built and operate independently. Editing, for example, doesn't need to be enmeshed with reading. Machine learning can provide necessary information without being ensconced inside editing software. Decoupling depends on the knowledge itself being shared in a software-and-context-agnostic way.

The shape of collections[edit]

A page is a container for a collection of parts. A page was the initial shape, a website was a framework that related containers to other containers, thus building a collection of knowledge. Now, we are building a new framework that includes pages but is not defined by them.

Categories are collections. Pages and parts of pages associated with a Wikidata item is a collection. Collections are relationships developed, programmatically or by editors, between pieces of knowledge. The way humans envision and plan these relationships shapes the way the knowledge is developed.

To form scalable collections, the knowledge needs cataloging. Consistency of relationships between knowledge parts makes collections consumable by nearly-infinite products and platforms. Without overtaxing the system with queries. Predictable, prebuilt relationships that don't rely on extensive fuzzy logic queries are ideal.

Multiple trusted sources[edit]

The PoV uses Simple Wikipedia as the primary knowledge source. But the same pattern will apply when adding any subsequent source. There can be multiple Wikipedia's, for example. The platform would add a service that responds to an event sent from the source and gets the change from the source's API. As long as both are possible, a source is likely a valid participant.

Rosette is our source for topics, creating collections based on related Wikidata items. Other context-creating sources can be added similarly. Wikidata can also be a source to enhance information about the topic.

Many product experiences and other platforms[edit]

When we imagine "nearly-infinite product experiences", what comes to mind? Answering that question is cross-functional work happening now. For the PoV, we imagined things like:

  • product experiences requesting knowledge so they can build their own "page", or collection or context for displaying.
  • these experiences drawing from multiple sources and needing relationships between them that give the knowledge meaning (everything about Barack Obama, for example)
  • a website or app about Cricket that draws people towards Wikipedia in places that aren't part of the community yet
  • any decoupled frontend experience

For platforms, we imagined

  • big platforms who use the free knowledge getting exactly what they need (and perhaps monetizing that request)
  • interrelationships with What's App and Facebook that draw people into learning and perhaps editing
  • pushing knowledge to platforms

We also imagined the Internet of Things, News and Information sources and products designed to increase engagement.

Challenges to consider[edit]

The primary challenge is embracing uncertainty. We can't know how this emergent system will emerge. We can make sound, well-reasoned decisions while exploring the path to modernization. We can design for emergence with architectural best practices. But uncertainty is our companion and we will be making sound decisions in the midst of it.

Modeling and planning for change triggers confusion and anxiety, two things that will most certainly push the system in the wrong direction (to regain status quo). This is a challenge must not be underestimated.

We are -- above all else -- changing the timing of events in the system away from transactional, towards asynchronous. This will challenge some fundamental precepts of the current software architecture. For example, a Page isn't the primary object (its a collection of sections.) A deep discussion is beyond the scope of this artifact but is inherent in all subsequent work.

Other major challenges include:

  • Designing an infrastructure that will scale including developing tools that provide the necessary capabilities.
  • Agreeing on "just enough structure" of the current page content.
  • Breaking down a page into a data model that differs from the crowdsourced version
  • Versioning (across the system)
  • Understanding how creating knowledge in the "sources" interrelates with serving that knowledge everywhere.

Next steps[edit]

The next steps are delivering four further artifacts:

  1. Infrastructure capabilities analysis: designing the ability to build these types of systems on our infrastructure.
  2. Data model exploration: strategizing around source of truth and how information flows between independent services.
  3. Sharing and iterating on the evolving model of the target architecture.
  4. Roadmap of organizational and sociotechnical process changes needed to do any of this.

Teams like Okapi, Inuka, WMDE and Structured Data are exploring adopting this work as part of their future plans.

Many branches of discussion have already begun. Their success depends on:

  • understanding the tradeoffs, especially in areas that have been philosophically off limits
  • enabling a continuous flow of informed decisions
  • cross-functional discovery and iterative step taking
  • defining aspirational terms like "modern"
  • understanding the cost

By "cost", we mean estimating the financial investment, though its too soon for that analysis. We also mean the time, energy and expertise required. We mean the social and cultural changes that may be necessary to remove roadblocks. And we mean discerning the balance between our values, goals and investments.

Glossary[edit]

  • Modern / Modernize: a system of loosely-coupled capabilities with logical boundaries that rely on emerging industry patterns including canonical data modeling and event-driven interactions