User:Tpt/RFC

I have the project to implement in Extension:ProofreadPage a metadata storage system based on the sql database in order to provide an api and metadata in the header of each page of the Index and Main namespaces. But there are a lot of technical choices to be done before beginning an implementation.

Abstract
Wikisource needs to have a powerful metadata system... This new system must be fully compatible with the old one in order to migrate easily.

Why not use Semantic Mediawiki or the upcomming Wikidata storage system ?
Semantic Mediawiki provides a lot of advanced things like a powerful search system that need a lot of computing time. So, it is impossible to use it on hight traffics websites like Wikisource. The Wikidata project wants to customize mediawiki text-base storage system in order to provide a powerful metadata system but I think it won't be compatible with content-base wiki like wikisouce.

Storage
There will be a table (pf_data) in the database that will store the metadata of the index pages and the metadata changed in main namespace using a parser function (see Ways of getting metadata). There will be an other one in order to make a link from pages of the main namespace to the index that they use. If there is multi-index transclusion, the index chosen is the first with header=1.

Draft of sql structure
The problem of this system is that the values aren't typed. The index aren't choose yet.

Ways to set metadata
The metadata of index namespace are taken from the form on publication. We don't change the current system of metatemplate.

For the main namespace we can create a new parser function like in order to change metadata that are included from the index namespace. This can be added directly in MediaWiki:Proofreadpage_header_template. New datas are stored by the parser as the Extension:Geodata does. We can imagine also that show the metadata.

A base set of data
The extension will provide a base set of metadata (lang, title, publisher, pubdate...) in order to provide them through the api using standardised sets like Dublin Core. They won't be used directly in the wiki but the relation between them and the data in the wiki will be done in the configuration page.

Configuration
We will rewrite the configuration system and the form generator of the Index namespace in order to improve it and type the data. This must be of course fully compatible with the current system. Here is the structure of the configuration, all the parameters are optionals, the default value are set :

The transition from the old configuration page may be done when the update is done (I don't know if it is possible) or the extension can use the old configuration page when the new doesn't exist in order to give time to small wikis.

Interaction with other metadata format
We can add Dublin Core tags in header of html pages, it's a very used, flexible and xhtml-compatible format. We will provide an OAi-PMH repository throw the api. The system can be interconnected to Wikidata in order, by exemple, link author property to the author resource in Wikidata.

Impact on servers
If we just provide a metadata storage system without search system and statistics as described here, the impact will be very small : the operations will be only get/set a few rows in the database corresponding to one or a few pages. There will not be the complex actions like search of Semantic MediaWiki.

Draft of roadmap

 * April-July : Discussions with Wikisource contributors and developers in order to validate the project.
 * July-August : Rewriting of the form creator of the index namespace. This new code must be fully compatible with the new and the old configuration system.
 * August-* : Writing of the metadata storage system itself and of a beginning of api.
 * Later : Better API, maybe a search engine...

An other possible storage system
The metadata will be stored in a json array for each pages in the main or index namespace. In the main namespace, the array contains only data that are set in this page, the others are linked with the pf_index_id filed in order to not have to refresh all the pages when the index is updated. If there is multi-index transclusion, the index is the first with header=1.

This array of data is like :

Using page_props table
MediaWiki can store page properties in page_props table, maybe this table can be used to store indexes metadata? A single magic word taking index title and property name can be provided to fetch the metadata.

Pros:
 * No need to create a new storage mechanism.
 * Properies are available through API, no need to create a separate interface (action=query&titles=Index:Title&prop=pageprops).
 * No need to change current template system. We can use the MediaWiki parser to extract template arguments and put them into the database.

Issues to solve:
 * Those properties are inserted to database by LinksUpdate, which takes ParserOutput as input. We need to find a hook, which can be used to set the properties - current proof of concept solution uses ParserBeforeStrip.
 * Field name consistency. It would be nice to have consistent naming of fields for different language versions.
 * Property value is not indexed, so it cannot be used for searching.

Other:
 * A maintenance script needs to be prepared to rebuild all indexes, so metadata will be stored in the database.

Example page: Indeks:1

Example api output:

Output from mysql: mysql> SELECT * FROM page_props where pp_page = 1; +-+--+--+ +-+--+--+ +-+--+--+ 8 rows in set (0.00 sec)
 * pp_page | pp_propname         | pp_value                                                 |
 * 1 | proofread-Autor     | Jan Kowalski                                             |
 * 1 | proofread-Ilustracja |                                                         |
 * 1 | proofread-Rok       | 1998                                                     |
 * 1 | proofread-Strony    | 1 2 3 |
 * 1 | proofread-Tytuł     | Friends                                                  |
 * 1 | proofread-Uwagi     |                                                          |
 * 1 | proofread-Wydawca   | WMF                                                      |
 * 1 | proofread-Źródło    | Brak                                                     |

Example usage (1 is the title of the index):
 * -> Jan Kowalski