Reading/Multimedia/Structured Data

Structured Data is a new project that aims to implement structured, machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites. It is likely to be implemented with the Wikidata technology and Wikibase and related extensions.

The Multimedia team and the Wikidata team are planning to develop this project together in 2014 and 2015, in collaboration with the volunteer communities on Wikimedia Commons and other projects impacted by this initiative.

Goals
The goal of this project is to:


 * use structured data for all media files on our sites
 * make it easier for users to read / write file info
 * enable developers to build better tools

We aim to support the use of machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites, so they are easier to view, search, edit, curate and use. To do that, we plan to migrate the unstructured data now on Commons into a structured format and container.

In coming months, we propose to:
 * Plan & discuss this plan with our community
 * Design the data structure and user interface
 * Develop the code and tools needed for this project
 * Migrate unstructured data to the new format
 * Measure and adjust as needed

Groups
Target users for Structured Data include these user groups:'''
 * readers
 * contributors
 * curators
 * editors
 * developers'''

This project will impact all key stakeholders that use media files on Commons and other Wikimedia sites. We aim to support these user groups evenly, with an initial focus on Commons users -- then contributors on Wikipedia and other sites.

Benefits
Here are some of the ways structured data can benefit our users:
 * offer a better user experience
 * find relevant content
 * drive editing and re-use
 * provide new ways to contribute
 * improve our infrastructure

Uses
Structured data can support a wide range of uses:'''
 * Search
 * Viewing
 * Editing
 * Uploading
 * Curating'''

User Stories
Here are some of the first user stories which this project is expected to support.

As a user, I want to:
 * Search for files by multiple criteria
 * Learn more with related information
 * View file information '''in my language
 * Edit file information more easily
 * Edit file info from other pages
 * Add more topics and data types
 * Upload / transfer files from any wiki
 * Get smart suggestions

See more user stories. These stories and the workflows they support are work in progress, and will be further defined after our first community discussions.

Examples
Here's an example of how structured data can be used for a typical file:


 * Title (Label): Hamster Berta
 * Description: My hamster Berta in 2007.
 * File name: Berta2007.jpg
 * File type: image/jpeg
 * Resolution: 800x800
 * Statements:
 * Contributor: http://commons.wikimedia.org/wiki/User:Hugo85
 * Real Name: Hugo Masters
 * Time of contribution: March 2007
 * Role: Photographer
 * Rights: http://creativecommons.org/license/CC-BY-SA-3.0-DE
 * Terms of use: BY-SA
 * Topic: hamster
 * Topic: pet
 * Type of work: photograph
 * Coordinate location: 52’31 N, 13’22 E
 * Location: Berlin

See more examples

Key data
In a first phase, we would like to focus on these key data for each file:


 * title
 * description
 * work
 * contributor
 * role
 * organization
 * license
 * category
 * topics

See multimedia data list

Storage
Where might data be stored?
 * Wikibase on Commons
 * title
 * description
 * work link
 * contributor link
 * license link
 * category link
 * topic link
 * file link


 * Wikidata.org
 * work
 * contributor
 * license
 * category
 * topic
 * file


 * Commons MediaWiki
 * file name
 * format
 * resolution
 * size
 * date
 * geo-tags
 * version
 * EXIF metadata

Components
Here are some of the first modules we might work on:
 * Platform
 * High-level API
 * Media Info Page
 * File Page

First features to use structured data could include:
 * Features
 * Search
 * Upload Wizard
 * Media Viewer

Timeline
This is a first draft of a proposed roadmap for fiscal year 2014-15, for discussion purposes.


 * Q1 (July-Sep. 2014)
 * planning with Wikidata
 * community discussions
 * first data structure


 * Q2 (Oct-Dec. 2014)
 * developer bootcamps in Berlin/Amsterdam
 * first prototypes
 * develop high-level API
 * small experiments (e.g.: location)
 * first metrics (usage, migration, performance)
 * design media info page, file page


 * Q3 (Jan-Mar. 2015)
 * develop and improve platform code modules
 * build media info page, upgrade file page
 * upgrade key feature to support structured data (e.g.: Search, Upload Wizard)
 * design editing tools
 * migrate data on Commons with Wikidata and community


 * Q4 (Apr.-June 2015)
 * Follow up work and bug fixes
 * design file page/editing tools
 * continue data migration on Commons
 * build edit tools
 * develop cross-wiki support

All goals above are tentative, for discussion purposes.

Discussions
In coming weeks, we will invite community members to give us more feedback on this project through a variety of discussions, which will take place onwiki, on IRC and via Google Hangouts.

We hope you can join us for one of these events, which will aim to review our plans and engage key stakeholders as active participants in this project.

To participate in these conversations, we invite you to join our public Multimedia mailing list, where we will announce discussion dates shortly.

Workgroups
We will want to form small workgroups to address open issues over time. For example:


 * Workflows
 * Which workflows should we support first?


 * Data structure
 * How do we define a basic data structure?


 * Research
 * How can we measure and validate each feature?


 * Platform
 * What modules need to be built first?
 * How do we develop the high-end API?


 * Features
 * Which features do we develop first? Who builds them?
 * How do we gradually roll out these features?


 * Migration
 * How do we coordinate the data migration?

Are you interested in participating regularly to one of the workgroups above? If so, sign up here -- and help start a sub-page for your workgroup.

FAQ
Here are some frequent questions that we hear often about this project. More information will be provided as the project develops.

This project will be a collaboration between the WMF’s Multimedia team, the Wikidata team, and communities on Wikimedia Commons, Wikipedia and sister projects.
 * • '''Who will develop this project?

The Wikimedia Commons and Wikipedia communities will need to make decisions about the new data structure and will also need to help migrate data from the old structure to the new structure. Some of this can be automated, some of it will be manual work.
 * • '''How will the community be involved?

This is likely to be a long project, taking several years to complete. We started planning and discussions in summer 2014, will prototype and test small experiments in the fall, and aim to start major development and data migration in 2015. A product plan will be based on first test results and community discussions.
 * • '''How long will this take?

As a first step, we aim to create mock interfaces which behave like Wikidata interfaces but use wikitext to read/write data. The storage interface would involve basic concepts like "file", "work", "contributor", "license", etc. For UploadWizard (where we only write information, in the form of information/license templates on a newly created page) we are considering hiding the current code of generating template text behind an API similar to wbeditentity/wbcreateclaim. After an initial period of experimentation and testing, we would build a high-end API to support a range of features.
 * • '''What will be developed first?

The current proposal is to store structured data on a “media info” page attached to the file’s page that is similar to Wikidata’s item pages (e.g.: 'File:Berlin.jpg/info'). Commons is thereby a repository and at the same time its own client, as well as a client of wikidata.org. To learn more about this proposal, visit this page.
 * • '''Where will the structured data be stored?

File pages would include an editing tool to make it easy to enter and edit metadata, just like on Wikidata. There might be a more narrow, special purpose editing interface for media info, but that's for later. Initially, editing would be very similar to Wikidata. We also expect to use an updated version of Upload Wizard for entering the structured data when a file is uploaded. Other editing methods may be provided later on.
 * • '''How will people edit the structured data?

Templates will still be used to format the data from the media info page. For example, the 'information' template would be changed to pull information like the contributor or description from the media info page. Template parameter would only be used in rare cases to override the values from the data page. Ideally, file description pages would in the end only contain a sole 'information' call with no parameters.
 * • '''Will we still use templates?

The data structure will be developed by the multimedia and Wikidata teams, in collaboration with community members. We expect that some of the current data used on Commons will be stored in structured Wikidata format, while others will be stored by MediaWiki software. Since many users and contributors seem confused by the many different data and templates now in use on our sites, we will encourage our community to help streamline and consolidate these options.
 * • '''What data structure will be used?

The Upload Wizard code should have a nice & simple interface for storing metadata, which could at first be implemented to just add the appropriate templates to the description page. We could then later swap that implementation for one backed by Wikibase. See Mingle Ticket #309: Prepare for Wikidata Integration on Commons https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/309
 * • '''What interface should the Upload Wizard use to store metadata?

This should be compatible with, but need not be tightly bound to, the high-level interface for storing (and retrieving) the metadata. The high level interface would probably need some domain knowledge (e.g. which property would be used for the license, which property for the creator(s), etc.
 * • '''What is the logical data structure of the future MediaInfo entity?

Planning Tasks
In a first planning phase, we would like to focus on these tasks in coming weeks:
 * Review current discussions, proposals and related documents
 * Review current Wikidata code and technology
 * Host community discussions (onwiki, IRC, roundtables)
 * Design specification based on community and team feedback
 * Estimate development time for key tasks
 * Determine top priorities and overall plan

These planning tasks will require several weeks in June and July, before we can seriously determine our next steps based on good information.

Development Tasks
Here are some of the tasks we expect to take on this year:


 * prototyping
 * entity id/entity per page refactoring
 * entity class refactoring
 * arbitrary access/usage tracking
 * implement media info entity
 * commons gets phase 2
 * tag UI (Wikidata item = tag) - includes integration in upload, editing and search
 * filter by meta data
 * rdf
 * linked data interface

We will track our development work for this project on this Structured Data board on our Mingle planning site. Other development boards include our Current Sprint wall and Current Cycle wall.

MediaInfo Entity Draft

 * ID: based on the name of the media description page
 * Labels (multi-lang, plain text): "short caption"
 * Description /multi-lang, plain text): "long caption"
 * (no Aliases!)
 * (no site links!)


 * original file name (read only) for reference. Mostly redundant to the ID, useful once the page name gets decoupled from the original file name.
 * perhpas additional "intrinsic" info (read only), e.g.
 * Media type / MIME type (read only)
 * Resolution of images
 * Duration of audio/video


 * Statements, with some well-known properties:
 * Free form description(s)
 * Possibly imported, provide source ref
 * mono-lingual (?!)
 * May want to allow inline wikitext (but no block-level markup)
 * Creator/Contributor (with role as a qualifier)
 * as URI (or have several properties for local users, Wikidata Items, plain names, etc)
 * Time of creation (could be a qualifier to the creator property)
 * Jurisdiction of creation (could be a qualifier to the creator property)
 * Licenses (URI or Wikidata Items)?
 * Topics (Subjects, Locations, Events) (Wikidate Items)
 * Quality badges (Wikidate Items) for featured images, etc
 * Restriction markers (Wikidate Items) for personality rights, insignia, currency, etc
 * Source (for imported media)
 * External Identifiers (for media imported from archives, etc)

Links
As this project gets ramped up in coming weeks, we will be using shared notepads to quickly gather our initial findings, then will turn them into more structured wiki pages like this one. Here are a few working documents we're using to get organized:


 * Structured Data Slides (from Wikimania roundtable discussion)
 * Roundtable Notepad (from Wikimania roundtable discussion)
 * User stories
 * Structured data examples
 * Structured Data List
 * Multimedia data API
 * Wikidata for Media Info - Proposal by Daniel Kinzler
 * Structured Data Wall
 * Wikidata: Wikimedia Commons
 * Commons Wikidata Roadmap
 * Wikidata Model
 * First Planning Meeting - June 3, 2014