User:Tpt/RFC

The goal of this proposal, reflecting the own thoughts of User:Tpt is to find:
 * 1) a good way to dispatch book metadata between Wikidata, Commons and Wikisource in order to avoid as much as possible duplication.
 * 2) how we should implement a powerful book metadata storage on Wikisource.

Wikisource
Currently the books metadata are stored in Wikisource in a really unstructured and not typed way: metadata of books without scans are stored in template like the Header template of the English Wikisource and metadata of books with scans are often duplicated between the Index: page and an header template in the main namespace, despite the existance of the "header" parameter of the tag. So, it's very difficult to read these data using machines. Some very hacky partial solutions like this set of HTML classes have been created but don't allows an exact and easy reading of data.

Wikidata
Wikidata stores bibliographic data using the FRBR pattern. There are items for the "work" (ie the book as absract entity) and for each "manifestation" (mostly the different editions of the book). For more informations see the Wikidata books task force.

Commons
Currently scans metadata are stored in a stored in the Book template in a not typed way but, in the future, Commons will probably adopt a Wikidata like metadata namespace that will allow machine to easily read file data.

Summary
The current system leads to:
 * 1) a system that does not allow machine redability of metadata and so a really reduced interoperability with everyone else.
 * 2) a major duplication of data. Work and manifestation metadata are currently duplicated, for a book with a scan, between 2 or 4 times, on the File: page on Commons, on the Index: page of Wikisource, on the main pages of Wikisource and on Wikidata.

Assumptions
In this following proposal we assume that:
 * metadata duplication is bad and should be avoided as much as possible.
 * we should make the distinction between the work, manifestation/expressions entities defined by FRBR. The distinction between work and Manifestation/expression is needed because we have often more than one edition of the same work in the Wikisource(s).
 * we should make the distinction between a book scan and its transcription and the works as presented in the Wikisource main namespace because there may be more than one work for only one scan as in collections or more than one scan for one work as in books split in many volumes.
 * we should keep backward compatibility in order to allow an easy adoption of the new systems.

Proposal
Here is the repartition of data we may adopt: Work item (Wikidata) |                  Manifestation item (Wikidata) |                 |             Scan data (Commons)          | |                 |  Transcription data (Index:, Wikisource)   | |          |                  |           |    "Presentational" data (Book: Wikisource) |                |         Presentation of texts (Main: Wikisource) Notes:
 * "|" represent that the lower place can access to data of the upper place. As example, "Scan data (Commons)" can access to "Manifestation item (Wikidata)" data.
 * Book: page are...

Wikisource metadata storage implementation
After a review of BookManager extension I present two possibilities of implementation of metadata management systems for Wikisource, more or less based on existing elements.

Note on BookManager
During the Google Summer of Code 2013 Molly White, helped by User:Raylton P. Sousa and User:Mwalker (WMF) has begun to build a new MediaWiki extension in order to add the notion of book to MediaWiki. See Book management. It's an amazing project but I have with it two disagreeing points:
 * 1) I believe that the new Book: pages should be used alongside Index: pages because there is not a 1-1 relationship between the two kinds of pages: the Index: pages are about a printed book and its transcription and the Book: pages are about a version of a book as an abstract entity. So, there may be more than one Book: pages for only one Index: page as in recueils (TODO)or more than one Index: page for only one Book: page as in books split in many volumes.
 * 2) I believe that the metadata format chosen have some strong disadvantages:
 * 3) it isn't really extensible and configurable, and so, is again the change ability that is at the core of how Wikimedia project are build
 * 4) data stored doesn't allows to add more than one value per entry that avoid easy machine readability for cases like books that one more than one author. It makes also more difficult to share data with Wikidata that supports this feature.

First solution: Wikibase
A fist solution is maybe to use the Wikibase extension that power Wikidata with some modifications in order to support BookManager or ProofreadPage specific features like the book structure and to output in a read-only format data from Wikidata. It would allows us to have a very powerful metadata system, with a very easy integration with Wikid data. But this is only feasible with a strong help of the Wikidata development team employed by Wikimedia team. It would also lead to a difficult migration of Index: pages content.

Second solution: Custom metadata system
The second solution is to implement a data model more simple than the Wikidata ones but using the same format for data types (string, datetime...).

An "Entry" ie an Index: or a Book: page would have a metadata section that is a table of (key, array of values) with "array of values" an array that contains all the values for a specific metadata key. Each key as a specific datatype that values should match, but, in order to keep compatibility with existing Index: pages, its type may be fallbacked to the "invalid" type in and of specific informations(property, values) tuple. The values are stored, like Wikibase values, in the format provided by the DataValue library(see the data types section for a beginning of list of datatypes ).

Unlike Wikibase, there won't be a single view and edit system implemented in JavaScript but a PHP-based editing system that would be an improved version of the current ProofreadPage and BookManager editing systems and a view managed by a template (or, maybe, by Scribunto module) as it's done for Index: pages. With that system we won't break the current site structure and users workflow.

The Book: pages will be able to parse content of the "toc" field of Index: pages, as currently done by the Proofread Page extension, in order to generate its book structure.

Book: page
With in the index field the name of the main related index.

Index: page
With in the index field the name of the main related index.

Data types

 * string (without wikitext, as Wikibase)
 * monolingual strings (without wikitext, as Wikibase
 * mediawiki-title links inside of the Wikisource (really needed?)
 * wikibase item
 * date (as Wikibase)
 * number (number without unit of Wikibase)
 * wikitext