User:Tpt/RFC

The goal of this proposal, reflecting the own thoughts of User:Tpt is to find:
 * 1) a good way to dispatch book metadata between Wikidata, Commons and Wikisource in order to avoid as much as possible duplication.
 * 2) how we should implement a powerful book metadata storage on Wikisource.

Wikisource
Currently the books metadata are stored in Wikisource in a really unstructured and not typed way: metadata of books without scans are stored in template like the Header template of the English Wikisource and metadata of books with scans are often duplicated between the Index: page and an header template in the main namespace, despite the existance of the "header" parameter of the tag. So, it's very difficult to read these data using machines. Some very hacky partial solutions like this set of HTML classes have been created but don't allows an exact and easy reading of data.

Wikidata
Wikidata stores bibliographic data using the FRBR pattern. There are items for the "work" (ie the book as absract entity) and for each "manifestation" (mostly the different editions of the book). For more informations see the Wikidata books task force.

Commons
Currently scans metadata are stored in a stored in the Book template in a not typed way but, in the future, Commons will probably adopt a Wikidata like metadata namespace that will allow machine to easily read file data.

Summary
The current system leads to:
 * 1) a system that does not allow machine redability of metadata and so a really reduced interoperability with everyone else.
 * 2) a major duplication of data. Work and manifestation metadata are currently duplicated, for a book with a scan, between 2 or 4 times, on the File: page on Commons, on the Index: page of Wikisource, on the main pages of Wikisource and on Wikidata.

Assumptions
In this following proposal we assume that:
 * metadata duplication is bad and should be avoided as much as possible.
 * we should make the distinction between the work, manifestation/expressions entities defined by FRBR. The distinction between work and Manifestation/expression is needed because we have often more than one edition of the same work in the Wikisource(s).
 * we should make the distinction between a book scan and its transcription and the works as presented in the Wikisource main namespace because there may be more than one work for only one scan as in collections or more than one scan for one work as in books split in many volumes.
 * we should keep backward compatibility in order to allow an easy adoption of the new systems.

Proposal
Here is the repartition of data we may adopt: Work item (Wikidata) |                  Manifestation item (Wikidata) |                 |             Scan data (Commons)          | |                 |  Transcription data (Index:, Wikisource)   | |                 |                 "Presentational" data (Book: Wikisource) Notes:
 * "|" represent that the lower place can access to data of the upper place. As example, "Scan data (Commons)" can access to "Manifestation item (Wikidata)" data.
 * Book: page are...

Wikisource metadata storage implementation
After a review of BookManager extension I present two possibilities of implementation of metadata management systems for Wikisource, more or less based on existing elements.

Note on BookManager
During the Google Summer of Code 2013 Molly White, helped by User:Raylton P. Sousa and User:Mwalker (WMF) has begun to build a new MediaWiki extension in order to add the notion of book to MediaWiki. See Book management. It's an amazing project but I have with it two disagreeing points:
 * 1) I believe that the new Book: pages should be used alongside Index: pages because there is not a 1-1 relationship between the two kinds of pages: the Index: pages are about a printed book and its transcription and the Book: pages are about a version of a book as an abstract entity. So, there may be more than one Book: pages for only one Index: page as in recueils (TODO)or more than one Index: page for only one Book: page as in books split in many volumes.
 * 2) I believe that the metadata format chosen have some strong disadvantages:
 * 3) it isn't really extensible and configurable, and so, is again the change ability that is at the core of how Wikimedia project are build
 * 4) data stored doesn't allows to add more than one value per entry that avoid easy machine readability for cases like books that one more than one author. It makes also more difficult to share data with Wikidata that supports this feature.

first solution: Wikibase
A fist solution is maybe to use the Wikibase extension that power Wikidata with some smallmodifications in order to support BookManager or ProofreadPage specific features like the book structure.It would allows us to have a very powerful metadata system, with a very easy integration withWikdidata.The inconvenients are that it wouldn't allow us to have...

second solution: Custom metadata system
The second solution is to implement a subset of the Wikibase data model. The idea is thatan "entry" (better name?) ie an Index: or a Book: page is composed of a list of claims(as Wikibase entities) that contains only a main snack ie a (property, value) tuple.The values are stored, like Wikibase values, in the format provided by the DataValue library(see the data types section for a beggining of list of datatypes ).

The storage and the API output formats will be compatible with the Wikibase ones in order toallow a possible migration in the futur to first solution and an as much as possible codesharing.

File data (Commons)  bibliographical data (Wikidata) |               |       |     Index page (Wikisource)     | |           |             Book: page (Wikisource) Unlike Wikibase, there won't be a single view and edit system implemented in JavaScript but a PHP-based editing system that would be an improved version of the currentProofreadPage and BookManager editing systems and a view managed by a template (or, maybe, bya Scribunto module) as it's done for Index: pages. With that system we doesn't break thecurrent site structure and users workflow.

Here is a formalisation in Backus-Naur form: := * := "index" | "book" :=   :=  :=

Book: page
With in the index field the name of the main related index.

Data types

 * string (without wikitext, as Wikibase)
 * monolingual strings (without wikitext, as Wikibase
 * wiki links inside of the Wikisource (really needed?)
 * wikibase item* date (as Wikibase)
 * number (number without unit of Wikibase)
 * wikitext