User:Tpt/RFC

I have the project to implement in Extension:ProofreadPage a metadata storage system based on the sql database in order to provide an api and metadata in the header of each page of the Inde and Main namespaces. But there are a lot of technical choices to do before beginning an implementation.

Abstract
Wikisource need to have a powerful metadata system... This new system must be fully compatible with the old one in order to migrate easily.

Why not use Semantic Mediawiki or the upcomming Wikidata storage system ?
Semantic Mediawiki provide a lot of advanced things like a powerful search system that need a lot of computing time. So, it is impossible to use it on hight traffics websites like Wikisource. The Wikidata project want customize mediawiki text-base storage system in order to provide a powerful metadata system but I think it won't be compatible with content-base wiki like wikisouce.

Storage
There will be a table (pf_data) in the database that store the metadata of the index pages and the metadata changed in main namespace using a parser function (see Ways of getting metadata). There will be an other one in order to make a link from pages of the main namespace to the index that they use. If there is multi-index translusion, the index chosen is the first with header=1.

Draft of sql structure
The problem of this system is that the values aren't typed. The index aren't choose yet.

Ways for set metadata
The metadata of index namespace are took from the form on publication. We don't change the current system of metatemplate.

For the main namespace we can create a new parser function like in order to change metadata that are included from the index namespace. This can be added directly in MediaWiki:Proofreadpage_header_template. New datas are stored by the parser as the Extension:Geodata does. We can imagine also that show the metadata.

A base set of data
The extension will provide a base set of metadata (lang, title, publisher, pubdate...) in order to provide them through the api using standardised sets like Dublin Core. They won't be used directly in the wiki but the relation between them and the data in the wiki will be done in the configuration page.

Configuration
We will rewrite the configuration system and the form generator of the Index namespace in order to improve it and type the data. This must be of course fully compatible with the current system. Here is the structure of the configuration, all the parameters are optionals, the default value are set :

The transition from the old configuration page may be done when the update is done (I don't know if it is possible) or the extension can use the old configuration page when the new doesn't exist in order to give time to small wikis.

Interaction with other metadata format
We can add Dublin Core tags in header of html pages, it's a very used, flexible and xhtml-compatible format. We will provide an OAi-PMH repository throw the api. The system can be interconnected to Wikidata in order, by exemple, link author property to the author resource in Wikidata.

Impact on servers
If we just provide a metadata storage system without search system and statistics as describe here, the impact will be very small : the operations will be only get/set a few rows in the database corresponding to one or a few pages. There will not be the complex actions like search of Semantic MediaWiki.

Draft of roadmap

 * April-July : Discussions with Wikisource contributors and developers in order to validate the project.
 * July-August : Rewriting of the form creator of the index namespace. This new code must be fully compatible with the new and the old configuration system.
 * August-* : Writing of the metadata storage system itself and of a beginning of api.
 * Later : Better API, maybe a search engine...

An other possible storage system
The metadata will be store in a json array for each pages in the main or index namespace. In the main namespace, the array contain only data that are set in this page, the others are linked with the pf_index_id filed in order to doesn't have to refresh all the pages when the index is updated. If there is multi-index translusion, the index is the first with header=1.

This array of data is like :