Files and licenses concept

From mediawiki.org

This page is a collection of technical information regarding storing certain file properties separately from wikitext. Currently the aim is to store only basic copyright information (author and license), but this could later be extended.

Current situation / Introduction[edit]

Every page has an entry in the mw_page table, primarily uniquely identified by page_id. Every file has an entry in the mw_image table, identified by img_name.

Every revision of every page has an entry in the mw_revision table. A revision is made when the page is created, edited, renamed, or protected.

Every text version of every page has an entry in the mw_text table. If a revision didn't change the text, it keeps referring to the same mw_text row.

Licenses allowed to be chosen during upload are defined at MediaWiki:Licenses. That page is kept as a list –  the last piped part describes the license, everything before that is wrapped between {{ and }} on the File-page under a == {{subst:MediaWiki:License-header}} ==.

Information about the file is stored in the {{Information}}-template.

Viewing a file:

  • Page namespace/title is looked up in mw_page, and mw_image.
  • Information from other tables is retrieved by the page ID (mw_page.page_id)

Purpose and use cases[edit]

It should be easy to obtain the copyright information for a file (example:bug 25624). Use cases include:

  • A data consumer (like the pdf creation tool, or a mobile app) wants to transform a page into a new format, and get all the information (esp. author(s) and license(s)) that is necessary to do proper attribution in the new format.
  • Someone importing CC BY text from another source wants to ensure that the authors, source, and license of the original work are stored in a consistent way, so that the license can be complied with later in a standardized way.

Why this is such a good idea is summarized by the following:

  • Getting information from the API for re-use (ie. WordPress plugin to search images and attribute names automatically in the article)
  • Standardizing file-pages centrally (either by core or by mediawiki-message but not per-page with templates)
  • Perhaps for search engines to index files properly by using xmlns:cc-attributes or <meta copyright> tags that currently are based on the general license for the wiki text instead of the file.
  • Special File-search with ability to filter by licenses (just like ns0=&ns1&, lic1&lic2; Individual wikis could edit messages (1/2) and put links to search through certain sets of licenses)
  • Automatically attributing authors (in case of CC-BY-*) in articles and perhaps mention the license (in case of CC-SA-*)
  • Name author in search (See Flickr)

The following (sub)requirements can be identified:

  1. Author and licensing information should be stored outside the wikitext, since parsing wikitext is often difficult for third-party data consumers (like the pdf creator)
  2. A list of files by copyright holder should be obtainable
  3. Copyright information should be provided by the user at upload
  4. Copyright information should be editable
    1. Copyright information should be versioned
  5. Administrators should be able to define a set of canonical licenses
    1. For each license basic information should be defined (name, full title, url to legal code/deed)
    2. The display of a license should be localizable

Proposed situation[edit]

Every page has an entry in the mw_page table, primarily uniquely identified by page_id.

Every file has an entry in the mw_image table, identified by img_name.Special properties about this file are stored in mw_file_props.

Every revision of every page has an entry in the mw_revision table. A revision is made when the page is created, edited, renamed, protected or when file props changed. Every revision contains a reference to a file properties version, which may be NULL.

Every text version of every page has an entry in the mw_text table. If a revision didn't change the text, it keeps referring to the same mw_text row.

For every file-properties version of a file there are one or more entries in the mw_file_props table. If a revision didn't change the properties, it keeps referring to the same mw_file_props row.

Licenses valid on the wiki are defined in the mw_license table, which is managed from [[Special:LicenseManager]]. Since there could potentially be many licenses, the ones choosable from the upload form are fetched from that table. The <select> would contain an <optgroup for "Most used licensed" (top 5 or 10, order by lic_count) and an <optgroup> for all licenses ordered by alphabet.

Links to information about the file (such as author and license) are stored in the mw_file_props-table and displayed on the File-page through a centrally determined layout. Description, source, date, location and additional wikitext (like User-templates and categories) are stored in Wikitext, only author and license are stored separately.

How to get there[edit]

Licenses[edit]

Table structure[edit]

Licenses have their own table, (say mw_license). With columns like:

	lic_id		PRI UNIQ AI,
	lic_name	VARBINARY 255
	lic_url		VARBINARY 255
	lic_count	INT
  • The text of the licenses are stored in [[MediaWiki:License-NAME-text]]] which contains wikitext (where NAME is mw_license.lc_abbrev).
    When used on the File-page, the following parameters are passed:
    • $1: author (mw_file_props.fp_author)
    • $2: attribution (mw_file_props.fp_attribution, if NULL same as author)
    • $3: title ({{int:License-ABBREV-title}})
  • The title of the licenses are stored in [[MediaWiki:License-ABBREV-title]] which is plain-text.

Example:

  # Database-entry
  lic_name	TASL
  lic_url	http://tasl.org/licensedeed.html

  # Message
  [[MediaWiki:License-TASL-text]]
	This file by $1 is licensed under $3. Please attribute the author as:<br />''$2''
  [[MediaWiki:License-TASL-title]]
	The Awesome Something License

Example:

  # Database-entry
  lic_name	CC-BY-SA-3.0
  lic_url	http://creativecommons.org/licenses/by-sa/3.0/legalcode

  # Message
  [[MediaWiki:License-CC-BY-SA-3.0-text]]
	{{Cc-by-sa-3.0|attribution=$2}}
  [[MediaWiki:License-CC-BY-SA-3.0-title]]
	Creative Commons Attribution Share-Alike 3.0 License
   [[MediaWiki:License-CC-BY-SA-3.0-url/nl]]
	http://creativecommons.org/licenses/by-sa/3.0/deed.en
 
 //reason for these messages being seperated from database and in messages is to allow easier translation, example for Dutch:
   [[MediaWiki:License-CC-BY-SA-3.0-title/nl]]
	Creative Commons Naamsvermelding Gelijk-Delen 3.0 licentie
   [[MediaWiki:License-CC-BY-SA-3.0-url/nl]]
	http://creativecommons.org/licenses/by-sa/3.0/deed.nl

License management[edit]

[[Special:LicenseManager]]
  • Lists all licenses (may be viewed by anyone ([*]). Editing is done on pages like [[Special:LicenseManager/12]] (by id, like AbuseFilter)
  • The actual texts are stored in MediaWiki:-messages, so they could contain a template to allow editing by non-sysops. Editing is limited to users with the licensemanager-modify right.
$wgGroupPermissions['*']['licensemanager-modify'] = false;
$wgGroupPermissions['sysop']['licensemanager-modify'] = true;
  • Changes, creations and removals of licenses are publicly logged at Special:Log/licensemanager.
  • Removal only possible if not in use. In the event a file previously using the license would be reverted to a state where it uses this one again, it would display {{int:License-notfound}} and categorize internally into a category like Category:Files with previously deleted licenses

Upload[edit]

Drop-down menu[edit]

During upload a license must be chosen from the drop-down menu. The drop-down menu is populated by the license table. The <select> would contain an <optgroup for "Most used licensed" (top 5 or 10, order by lic_count) and an <optgroup> for all licenses ordered by alphabet. Licenses that are marked as deleted are not shown and can't be used.

  <select>
    <optgroup label="Most used licenses">
      <option val="1">Creative Commons Attribution Share-Alike 3.0 License</option><!-- Contents of {{int:License-CC-BY-SA-3.0-title}} -->
      <option val="2">GNU Free Documentation License (Version 1.2 or later)</option>
    </optgroup>
    <optgroup label="All licenses alphabetically">
      <option val="1">Creative Commons Attribution 3.0 License</option>
      <option val="4">Creative Commons Attribution Share-Alike 3.0 License</option>
      <option val="2">GNU Free Documentation License (Version 1.2 or later)</option>
      <option val="12">The Awesome Something License</option>
    </optgroup>
  </select>

File properties[edit]

Table structure[edit]

Meta data about the file itself is still kept in the mw_image table. Page content information is still kept in mw_page and mw_revision.

Information about the file as a work is kept as a property either in the new mw_file_props.

The mw_file_props is similar to the mw_text table in that it is kept per revision and only updated when needed.

A reference to the current file props is kept in the appropriate mw_revision rows, just like it keeps a reference to mw_text. When either doesn't change the reference is kept and so duplicate sets will be made in mw_file_props.

mw_file_props contains:
	fp_id		INT (mw_revision.rev_fileprops_id is a key to this column)
	fp_key		VARBINARY(255)
	fp_value_int	INT
	fp_value_text	VARBINARY(255)

EXAMPLE

fp_id fp_key fp_value_int fp_value_text
1 author 50 (mw_user.user_id of User:Krinkle) NULL if empty the username is used
1 author 43 (mw_user.user_id of User:Catrope) Roan wiki user who wants display name different from username
1 author NULL John Doe
1 license 2 (mw_license.lic_id of CC-BY-SA-3.0)
1 license 5 (mw_license.lic_id of GFDL)

This file has three authors: Krinkle, Catrope (attributed as Roan) and John Doe (not a wiki user). And is dual licensed.

Management[edit]

An example of what the Wikitext of a File-page could/would look like:

{{#file-descr:
  {{en|Chiang Kai-shek Memorial Hall's gate at night in [[:en:Taipei|Taipei]].}}
  {{fr|Porte du Chiang Kai-shek Memorial Hall de nuit à [[:fr:Taipei|Taipei]].}}
}}
{{#file-date|2011-01-14}}
{{#file-source|{{Own}}}}
{{user:guillom/photos}}
[[Category:National Taiwan Democracy Memorial Hall]]
[[Category:Gate of Great Centrality and Perfect Uprightness (Taipei)]]
[[Category:MediaWiki Projects]]

The following elements:

  • Author (small textarea)
  • Date (date picker, eventually should contain 14-int timestamp
    jQuery datepicker features a way to prevent other characters from being entered
    also do serverside check that this is a valid timestamp)
  • License (dropdown + button to add/remove license (multiple licenses are allowed)
  • Attribution (small textarea)

.. are kept outside of wikitext in their own respective fields (fetched from and saved to mw_file_props). These seperate input fields could be editable in two ways:

  • Either on [[Special:FileProperties/File:Example.jpg]]
  • Or in additional form elements above or below the textbox on the action=edit page

When saving properties a revision is saved (like when moving or protecting the page) with the same rev_text_id but with a new reference in rev_fp_id to the added row in mw_file_props. Like wise when saving altered wikitext a revision is saved with the same rev_fp_id but with a new reference in rev_text_id to the added row in mw_text.


Idea: A user setting in the preferences decides whether the user sees his own language's description (if available - like {{LangSwitch}}) or all descriptions.

Display[edit]

When viewing a file-page the page is built like:

  • #filetoc
    • ...
  • #file
    • ...
  • #fileinformation
    • Automated page generated like this.
    • Layout is fixed (perhaps allow editing the layout in a MediaWiki-message, or dont and instead provide sufficient CSS-hooks).

[[MediaWiki:Fileinformation-template]] $1: description (tag-hook), $2: source (tag-hook), $3: author (file props), $4: license-titles (file props), $5: additional wikitext (all other page text (such as problem tags, user templates etc.) that wasn't filtered (for example, category and interwiki links are filtered from output)

==={{int:fileinformation-header}}===
__EMBEDMETADATA__
<div class=toccolours><legend>{{int:fileinformation-description}}</legend>
$1
</div>
<div class=toccolours><legend>{{int:fileinformation-datelocation}}</legend>
...
</div>
<div class=toccolous><legend>{{int:fileinformation-copyrightlicensing}}</legend>
{{int:fileinformation-author}}
: $3
{{int:fileinformation-copyrightstatus}}
: $4
{{int:fileinformation-source}}
: $2
</div>
==={{int:fileinformation-similarmedia}}===
<div class=toccolous><legend>{{int:fileinformation-findmore}}</legend>
...
</div>
==={{int:fileinformation-additional}}===
$5

API[edit]

imageinfo[edit]

'prop=imageinfo' result format needs to be extended to include this metadata.

  • format of results?
  • iiprops key for including/excluding it?

This is a requirement for getting the metadata to pass cleanly to InstantCommons (ForeignAPIRepo) clients.

upload[edit]

'action=upload' needs to be able to pass metadata with a new upload, just as a user uploading directly to the web site needs to be able to add license info on the web UI.

  • parameter names?
  • parameter format?
    • what's the best way to pass sets of arbitrary data like this to an api thingy? array should work?

In order for an API client upload tool to present available license options, it also needs to be able to query:

Query license options[edit]

In order to add license metadata to a new or modified upload, an upload client will need to be able to query the available settings. This same interface could/should probably also be used in things like the UploadWizard that supplement Special:Upload with more client-side ajaxy stuff.

  • module name?
  • parameters?
  • result format?

Editing[edit]

In addition to initial uploads, license metadata on an existing file may need to be altered. Changed license metadata needs to be passable either to the existing page editing method, or a dedicated one for file metadata.

  • existing or new module? module name?
  • parameter format?
    • what's the best way to pass sets of arbitrary data like this to an api thingy? array should work?

Export/import and dumps[edit]

If license metadata lives outside the page text, it may also need to be added to the Special:Export & data dump format.

  • data structure
  • add to special:export
  • make sure dump tools handle it without exploding

Hm...[edit]

Transition

Some kind of detection is required to fallback to the old way (ie. don't render the new file-page layout, but as a normal wiki page). The easiest way to detect if a page has been converted from the old wki text (eg. {{Information}}) to the new system is to check if a filepage has NULL in mw_revision.rev_fileprops_id (ie. has no entry in mw_file_props), then it is an old-style file. In that case, we don't generate the File-page layout, but just parse the wikitext the good ol' way and display it on the page.

Sounds good afaik. Krinkle 17:00, 22 January 2011 (UTC)[reply]