Architecture Repository/Artifacts/Knowledge store

From mediawiki.org

‎

Wikimedia logo Wikimedia Architecture Repository
Home | Artifacts | Process | Patterns

Knowledge store[edit]

Free knowledge data model based on schema.org

Last updated: 2022-12-16 by APaskulin (WMF)
Status: v1 published May 2021‎ to inform the creation of the schema for Wikimedia Enterprise. For the current Wikimedia Enterprise schema, see the data dictionary on enterprise.wikimedia.com.

The purpose of this document is to define a predictable structure for distributing Wikimedia content. To do this, we’ve chosen to use standard types and properties from schema.org. This model is not meant to replace existing data structures within MediaWiki; instead, these structures can act as part of a distribution layer that consumes, structures, and serves knowledge beyond Wikimedia.

Note on schema.org: schema.org is designed to provide structured metadata for web content. We’ve taken this idea a step further by using schema.org’s shared vocabulary to structure the content itself. This allows us to use the same patterns as schema.org even though we’re not using traditional Microdata, RDFa, or JSON-LD formats.

Using this model[edit]

We encourage Wikimedia projects to make use of this model, either as a whole or as a base to build on. Services currently using this model include Phoenix (structured content proof of value) and Wikimedia Enterprise.

Adding a property[edit]

As defined here, the model is restricted to properties that are meaningful outside the context of MediaWiki. To suggest a new property, leave a comment on the talk page. New properties should conform with the applicable schema.org type whenever possible.

Feedback and questions[edit]

To share feedback and question, leave a comment on the talk page. Note that there are often several unknowns associated with each type; these unknowns are tracked in the notes and questions subsections.

Patterns[edit]

Canonical data modeling

Allows content to be understood by people, programs, and machines outside the boundaries of the system

Capabilities[edit]

Serve and distribute

Distribute predictably-structured knowledge to products and platforms

Language[edit]

a human language
Based on schema.org Language

Example
{
  "name": "English",
  "identifier": "en",
  "direction": "ltr"
}
Property Type Description
name Text Language name in that language
identifier Text Language code as used by Wikimedia (ISO 639 with exceptions[1])
direction (not on schema.org) Text right-to-left (rtl) or left-to-right (ltr)
variant (not on schema.org) Text Language variant[2] (if applicable)

Notes and questions

Project[edit]

a wiki in a single language
Based on schema.org CreativeWork (not on schema.org Project)

Example
{
  "name": "Wikipedia",
  "identifier": "en.wikipedia.org",
  "in_language": {
    "identifier": "en"
  },
  "url": "https://en.wikipedia.org",
  "size": {
    "value": 70934,
    "unit_text": "MB"
  }
}
Property Type Description
name Text Unabbreviated project name in the language specified by inLanguage (Example: Wikipedia, Wikisłownik, etc.)
identifier Text Project domain (Example: en.wikipedia.org)
in_language Language Human language the project is written in
url Text URL for the project entry point (not directly to the main page)
size QuantitativeValue Project size when downloaded as a whole (compressed)

Notes and questions

  • How should we handle inLanguage for multi-lingual projects? (Commons, Wikispecies, Wikidata, etc.)

Page[edit]

a wiki page
Based on schema.org Article

Example
{
  "name": "Pinnation",
  "identifier": 339742,
  "url": "https://en.wikipedia.org/wiki/Pinnation",
  "in_language": {
    "identifier": "en"
  },
  "is_part_of": [
    {
      "identifier": "en.wikipedia.org"
    }
  ],
  "version": 975098740,
  "date_modified": "2020-08-26T18:48:58Z",
  "license": [
    {
      "identifier": "CC-BY-SA-3.0",
      "name": "Creative Commons Attribution Share Alike 3.0 Unported",
      "url": "https://creativecommons.org/licenses/by-sa/3.0/"
    }
  ],
  "main_entity": {
    "identifier": "Q3756157"
  },
  "keywords": "Plant morphology, Leaves",
  "has_part": [
    {
      "identifier": "/node/ff569ed4759dbfc"
    }
  ]
}
Property Type Description
name Text Page title in reading-friendly format (spaces instead of underscores)
identifier Integer Page ID (MediaWiki page ID)
url Text Complete URL for the page
in_language Language Human language the page is written in
is_part_of array of Project Wiki the page belongs to
version Integer Revision ID (MediaWiki revision ID)
date_modified Text Timestamp of latest revision in ISO 8601 format (DateTime)
license array of License Content license
main_entity Entity Primary subject of the page (Wikidata ID)
keywords Text Comma separated list of categories the page belongs to
has_part array of Section Page sections

Notes and questions

  • Consider using display title for name instead of reading-friendly title
  • How should we handle media files associated with a page? Schema.org has audio, video, thumbnailURL, and primaryImageOfPage (MediaObject). Note that using primaryImageOfPage would be from WebPage type.
    • How to handle licenses for images embedded in a page? (Check with legal)
  • Should we include other URLs (mobile, edit, talk, etc.)? Schema.org has discussionUrl but no others.
  • We’ve intentionally not included content at the page level in favor of providing content at the section level.
  • Is it a problem that isPartOf would be inconsistent between objects?
  • Properties to consider:
    • about - Rosette or other set of page subjects (Wikidata items)
    • interactionStatistic seems like the most logical place for pageviews, number of edits, etc. What types of stats should we include? (array of InteractionCounter)
    • mentions - array of Thing, links included within the page
    • abstract: Is there a way we could get the first two sentences of the article?
    • citation (References used on the page)
    • schemaVersion (https://schema.org/docs/releases.html#v12.0) seems like a good idea, but I’m struggling to see the value. These releases seem to come out every few months.
    • page quality score (aggregateRating?)
    • copyrightHolder -  “The text of Wikipedia is copyrighted (automatically, under the Berne Convention) by Wikipedia editors and contributors and is formally licensed to the public under one or several liberal licenses.”[1] (Covered by license?)
    • dateCreated (page’s initial publication date)
    • creativeWorkStatus
    • creditText (attribution text)

Section[edit]

content grouped under a heading or as an introduction before the first heading on a page
Based on schema.org CreativeWork

Example
{
  "name": "Orbit and turning",
  "identifier": "/node/ff569ed4759dbfc",
  "version": 975098740,
  "is_part_of": [
    {
      "identifier": 339742
    }
  ],
  "text": "...html...",
  "encoding_format": "text/html",
  "license": [
    {
      "identifier": "CC-BY-SA-3.0",
      "name": "Creative Commons Attribution Share Alike 3.0 Unported",
      "url": "https://creativecommons.org/licenses/by-sa/3.0/"
    }
  ]
}
Property Type Description
name Text Section heading
identifier Text Knowledge store ID
version Integer MediaWiki revision ID
is_part_of array of Page Page the section belongs to
text Text Section content in HTML
encoding_format MIME type "text/html"
license array of License Content license

Notes and questions

  • Properties to consider:
    • dateModified
    • about - Rosette or other set of page subjects (Wikidata items)

License[edit]

content license
Based on schema.org CreativeWork

Example
{
  "identifier": "CC-BY-SA-3.0",
  "name": "Creative Commons Attribution Share Alike 3.0 Unported",
  "url": "https://creativecommons.org/licenses/by-sa/3.0/"
}
Property Type Description
name Text License name
identifier Text License ID from spdx.org
url Text URL for the license text

Notes and questions

Entity[edit]

a subject of a page
Based on schema.org Thing

Example
{
  "identifier": "Q3756157"
}
Property Type Description
identifier Text Wikidata ID

Notes and questions

  • Connection with Wikidata