User:Tpt/RFC

From mediawiki.org

The goal of this draft of proposal, reflecting the own thoughts of User:Tpt is to find:

  1. a good way to dispatch book metadata between Wikidata, Commons and Wikisource in order to avoid as much as possible duplication.
  2. how we should implement a powerful book metadata storage on Wikisource.

Current state[edit]

Wikisource[edit]

Currently the books metadata are stored in Wikisource in a really unstructured and not typed way: metadata of books without scans are stored in template like the Header template of the English Wikisource and metadata of books with scans are often duplicated between the Index: page and an header template in the main namespace, despite the existence of the "header" parameter of the <pages> tag. So, it's very difficult to read these data using machines. Some very hacky partial solutions like this set of HTML classes have been created but don't allows an exact and easy reading of data.

Wikidata[edit]

Wikidata stores bibliographic data using more or less the FRBR pattern. There are items for the "work" (i.e., the book as an abstract entity) and for each "manifestation" (mostly the different editions of the book). For more informations see the Wikidata books task force.

Commons[edit]

Currently scans metadata are stored in the Book template in a not typed way but it is planned that Commons is going to switch to a Wikidata-like metadata namespace that will allow machine to easily read file data.

Summary[edit]

The current system leads to:

  1. no machine readability of metadata and so a really reduced interoperability with everyone else.
  2. a major duplication of data. Work and manifestation metadata are currently duplicated, for a book with a scan, between 2 or 4 times, on the File: page on Commons, on the Index: page of Wikisource, on the main pages of Wikisource and on Wikidata.

Assumptions[edit]

In this following proposal we assume that:

  • metadata duplication is bad and should be avoided as much as possible.
  • we should make the distinction between the work, manifestation/expressions entities defined by FRBR. The distinction between work and Manifestation/expression is needed because we have often more than one edition of the same work in the Wikisource(s).
  • we should make the distinction between a book scan and its transcription and the works as presented in the Wikisource main namespace because there may be more than one work for only one scan as in collections or more than one scan for one work as in books split in many volumes.
  • we should keep backward compatibility in order to allow an easy adoption of the new systems.
  • we should have something as simple and user friendly as possible

Proposal[edit]

Here is the repartition of data we may adopt:

                        Work item (Wikidata)
                                 |
                   Manifestation item (Wikidata)
                       |                  |
             Scan data (Commons)          |
                       |                  |  
Transcription data (Index:, Wikisource)   |
           |           |                  |
           |    "Content" data (Index: Wikisource)
           |                 |
         Presentation of texts (Main: Wikisource)

Note: "|" represents that the lower place has access to data of the upper place. As example, "Scan data (Commons)" can access to "Manifestation item (Wikidata)" data.

We would rebuild the way Index: pages work. They would still contains the transcription metadata.

FAQ[edit]

What about using the BookManager extension?[edit]

During the Google Summer of Code 2013 Molly White, helped by User:Raylton P. Sousa and User:Mwalker (WMF) has begun to build a new MediaWiki extension in order to add the notion of book to MediaWiki. See meta:Book management. It's an amazing project but I have two strong concerns:

  1. It creates yet an other namespace and, so, introduces extra complexity.
  2. it doesn't looks really extensible and configurable, and so, is again the change ability that is at the core of how Wikimedia project are build
    1. data stored doesn't allows to add more than one value per entry that avoid easy machine readability for cases like books that have more than one authors. It makes also more difficult to share data with Wikidata that supports this feature.

First solution: Wikibase[edit]

A first solution is maybe to use the Wikibase extension that power Wikidata with some modifications in order to support BookManager or ProofreadPage specific features like the book structure and to output in a read-only format data from Wikidata. It would allows us to have a very powerful metadata system, with a very easy integration with Wikidata. But this is only possible with a strong help of the Wikidata development team. It would also lead to a difficult migration of Index: pages content.

Second solution: Custom metadata system[edit]

The second solution is to implement a data model more simple than the Wikidata ones but using the same format for data types (string, datetime...).

An "Entry" ie an Index: or a Book: page would have a metadata section that is a table of (property, array of values) tuples with "array of values" an array that contains all the values for a specific metadata property. Each key as a specific datatype that values should match. In order to keep compatibility with existing Index: pages, its type may fallback to the simple "string" type in order to keep backward compatibility with current index pages.

Unlike Wikibase, there won't be a single view and edit system implemented in JavaScript but a PHP-based editing system that would be an improved version of the current BookManager one and a view managed by a template (or, maybe, by Scribunto module) as it's done for Index: pages. With that system we won't break the current site structure and users workflow. In the future it would be nice to be able to see (and maybe even to edit) content of Wikidata items related to the current Index: and Book: page.

The Book: pages should be able to parse content of the "toc" field of Index: pages, as currently done by the Proofread Page extension, in order to generate its book structure.

Examples of structure[edit]

Base[edit]
{
  "metadata": {
    "author": [
      {
        "value": "Auteur:Jean de la Fontaine",
        "type": "mediawiki-title"
      },
      {
        "value": "Auteur:Jean Racine",
        "type": "mediawiki-title"
      }
    ]
  }
}
Book: page[edit]
{
  "index": "Test.djvu",
  "metadata": {...},
  "structure": [...] //Book structure
}

With in the index field the name of the main related index.

Index: page[edit]
{
  "metadata": {...},
  "pagination": "",
  "toc": ""
}

Data types[edit]

  • string (without wikitext, as Wikibase)
  • wikitext
  • mediawiki title: MediaWiki page title (may be fully replaced by wikibase item ids)
  • wikibase item id
  • date (as Wikibase)
  • number (number without unit of Wikibase)
  • wikitext

Properties[edit]

Two ways looks possibles in order to manage metadata properties.

  1. Use a global configuration for all Wikisources with labels internationalized for each property. This will allow us to have an easy interoperability between Wikisources and to develop tools that work everywhere. The capacity to have custom fields will be kept but should pass throw a change in the site configuration (no system message anymore)
  2. Use system messages as currently done for Proofread Page with a mapping to a standard set of properties.
  3. Use an hybrid approach: a commons set of basic properties hard-coded and allows addition with system messages.

Implementation details[edit]

The basic system related to metadata will be implemented in an extension named something like MetadataLib that will act as library for ProofreadPage and BookManager that will act independently in order to allows the use of BookManager by Wikibooks. MetadataLib code will heavily reuse BookManager code and will be strongly inspired by Wikibase. A fourth extension, Extension:Wikisource will act as bridge between ProofreadPage, BookManager and Wikidata.