Reading/Multimedia/Structured Data

Structured Data is a new project that aims to implement structured, machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites. It is likely to be implemented with the Wikidata technology and Wikibase and related extensions.

The Multimedia team and the Wikidata team are planning to develop this project together in 2014 and 2015, in collaboration with the volunteer communities on Wikimedia Commons and other projects impacted by this initiative.

Purpose
The purpose of this project is to:
 * use structured data for all media files on our sites
 * make it easier for users to read / write file information
 * enable developers to build better tools

We aim to support the use of machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites, so they are easier to view, search, edit, curate and use. To do that, we plan to migrate the unstructured data now on Commons into a structured format and container.

Rationale
Today, multimedia file information on Wikimedia sites is stored using an archaic system that is hard to use and support. The current Commons database is poorly designed, very hard to search, confusing to users and impractical for new feature development. Instead of using machine-readable data as most modern sites do nowadays, Wikimedia Commons relies on a cumbersome patchwork of plain text data embedded in range of overlapping templates, with a set of English-only categories that are incompatible with other sites.

Wikidata now offers a practical way to maintain structured data in MediaWiki, and is widely considered as a major development to support the growth of the free knowledge movement. As a result, many community members have proposed we use this mechanism to store and retrieve media metadata on Wikimedia Commons. This would provide a wide range of benefits to all users of Wikimedia Commons, as well as to the many sites which rely on its multimedia content. Associating files with Wikidata items or geo-location would support more effective ways to search and browse files on Commons. This would also make it a lot easier to show the appropriate attribution and license information when re-using a file. A range of other benefits are listed below.

The Wikimedia Foundation's multimedia team hosted a number of roundtable discussions with community members last year to ask what it should focus on in coming years. In every roundtable, the top request from participants was to implement structured data on Commons -- even when this topic was not on the agenda to begin with. Our community advisors pointed out that search does not work well on Commons, making it hard to find what you are looking for. Others also pointed out that categories are now mostly in English, making it difficult for non-English speakers to contribute on Commons. Many have also suggested that categories be complemented with more granular topics that could be linked to Wikidata's knowledge base in your language -- as well as intersected to provide better search results.

Activities
In coming months, we propose to:
 * Plan & discuss this proposal with our communities
 * Design the data structure and user interface
 * Develop the code and tools needed for this project
 * Migrate unstructured data to the new format
 * Test, measure and adjust as needed

User Groups
Target users for Structured Data include these user groups:
 * readers
 * contributors
 * curators
 * editors
 * developers

This project will impact all key stakeholders that use media files on Commons and other Wikimedia sites. We aim to support these user groups evenly, with an initial focus on Commons users -- then contributors on Wikipedia and other sites. We will also consult with other external user groups such as content providers and site operators who are also an important part of our multimedia ecosystem.

Benefits
Here are some of the ways structured data can benefit our users:
 * offer a better user experience
 * make it easier to find relevant content
 * drive editing and (license-compliant) re-use
 * provide new ways to contribute
 * improve our infrastructure

Uses
Structured data can support a wide range of uses:
 * Search
 * Viewing
 * Editing
 * Uploads
 * Curation
 * Translations

User Stories
Here are some of the user stories which this project is expected to support.

As a user, I want to:
 * Search for files by multiple criteria
 * Learn more with related information
 * View file information '''in my language
 * Edit file information more easily
 * Edit file info from other pages
 * Add more topics and data types
 * Upload / transfer files from any wiki
 * Get smart suggestions

See more user stories. These stories and the workflows they support are work in progress, and will be further defined after our first community discussions.

Example
Here's an example of what metadata for a typical file could look like in a structured form:


 * Title (Label): White Russian Hamster
 * Description: The Djungarian hamster, also known as the Siberian hamster, Siberian
 * dwarf hamster or Russian winter white dwarf hamster, is one of three species of hamster
 * in the genus Phodopus.


 * File name: Hamster ruso blanco.JPG
 * File type: image/jpeg
 * Resolution: 4,320 × 3,240 pixels
 * Statements:
 * Contributor: Nanny99
 * Role: Photographer
 * Rights: http://creativecommons.org/license/CC-BY-SA-3.0
 * Terms of use: BY-SA
 * Topic: hamster
 * Topic: pet
 * Type of work: photograph
 * Coordinate location: 52’31 N, 13’22 E
 * Location: Berlin

See more examples

Data structure
One of the challenges of this project is to define a data structure that does not limit the scope of items to be stored and is compatible with external tools, while avoiding redundancy and confusion.

Throughout the planning and experimentation phases, we will aim to discuss with our communities which data are most needed to support key workflows, and aim to collectively develop a comprehensive list of basic data, to make sure that all important items are accounted for.

We expect the data structure to include these main data clusters:
 * file
 * work
 * contributor
 * license
 * category
 * topics

Some basic concepts are being discussed as possible building blocks for our data structure. For example, we note that a media file can include one or more works, and that each work can have one or more contributors, as well as one or more licenses, categories or topics. This organizing principle could support a wide range of use cases, which we look forward to discussing and prototyping with our communities in coming months.

See multimedia data list

Storage
Where might data be stored?
 * Directly in Wikibase on Commons
 * title
 * description
 * work link
 * contributor link
 * license link
 * category link
 * topic link
 * file link


 * On Commons as links to Wikidata.org concepts
 * work
 * contributor
 * license
 * category
 * topic
 * file


 * Directly on Commons MediaWiki (but not necessarily in Wikibase)
 * file name
 * format
 * resolution
 * size
 * date
 * geo-tags
 * version
 * EXIF metadata

Components
Here are some of the first modules we might work on:
 * Platform
 * High-level API
 * Media Info Page
 * File Page

First features to use structured data could include:
 * Features
 * Search
 * Upload Wizard
 * Media Viewer

Roadmap
This are some of the phases we are considering for fiscal year 2014-15, for discussion purposes.


 * Phase 1
 * planning with Wikidata
 * community discussions
 * first data structure
 * first specifications


 * Phase 2
 * developer bootcamps in Berlin/Amsterdam
 * first prototypes
 * develop high-level API
 * small experiments (e.g.: location)
 * first metrics (usage, migration, performance)
 * design media info page, file page


 * Phase 3
 * develop and improve platform code modules
 * build media info page, upgrade file page
 * upgrade key feature to support structured data (e.g.: Search, Upload Wizard)
 * design editing tools
 * start data migration on Commons


 * Phase 4
 * Follow up work and bug fixes
 * design file page/editing tools
 * continue data migration on Commons
 * build edit tools
 * develop cross-wiki support

All goals above are tentative, for discussion purposes.

Discussions
In coming weeks, we will invite community members to give us more feedback on this project through a variety of discussions, which will take place onwiki, on IRC and via Google Hangouts.

We hope you can join us for one of these events, which will aim to review our plans and engage key stakeholders as active participants in this project.

To participate in these conversations, we invite you to join our public Multimedia mailing list, where we will announce discussion dates shortly.

Workgroups
We will want to form small workgroups to address open issues over time. For example:


 * Workflows
 * Which workflows should we support first?


 * Data structure
 * How do we define a basic data structure?


 * Research
 * How can we measure and validate each feature?


 * Platform
 * What modules need to be built first?
 * How do we develop the high-end API?


 * Features
 * Which features do we develop first? Who builds them?
 * How do we gradually roll out these features?


 * Migration
 * How do we coordinate the data migration?

Are you interested in participating regularly to one of the workgroups above? If so, sign up here -- and help start a sub-page for your workgroup.

FAQ
Here are some frequent questions that we hear often about this project. More information will be provided as the project develops.

This project will be a collaboration between the WMF’s Multimedia team, the Wikidata team, and communities on Wikimedia Commons, Wikidata, Wikipedia and sister projects.
 * • '''Who will develop this project?

The Wikimedia Commons, Wikidata and Wikipedia communities will need to make decisions about the new data structure and will also need to help migrate data from the old structure to the new structure. Some of this can be automated, some of it will be manual work.
 * • '''How will the community be involved?

This is likely to be a long project, taking several years to complete. We started planning and discussions in summer 2014, will prototype and test small experiments in the fall, and aim to start major development and data migration in 2015. A product plan will be based on first test results and community discussions.
 * • '''How long will this take?

As a first step, we aim to create mock interfaces which behave like Wikidata interfaces but use wikitext to read/write data. The storage interface would involve basic concepts like "file", "work", "contributor", "license", etc. For UploadWizard (where we only write information, in the form of information/license templates on a newly created page) we are considering hiding the current code of generating template text behind an API similar to wbeditentity/wbcreateclaim. After an initial period of experimentation and testing, we would build a high-end API to support a range of features.
 * • '''What will be developed first?

The current proposal is to store structured data on a “media info” page attached to the file’s page that is similar to Wikidata’s item pages (e.g.: 'Info:Berlin.jpg'). To learn more about this proposal, visit this page.
 * • '''Where will the structured data be stored?

File pages would include an editing tool to make it easy to enter and edit metadata, just like on Wikidata. There might be a more narrow, special purpose editing interface for media info, but that's for later. Initially, editing would be very similar to Wikidata. We also expect to use an updated version of Upload Wizard for entering the structured data when a file is uploaded. Other editing methods may be provided later on.
 * • '''How will people edit the structured data?

Templates will still be used to format the data from the media info page. For example, the 'Information' template would be changed to pull information like the contributor or description from the media info page. Template parameter would only be used in rare cases to override the values from the data page. Ideally, file description pages would in the end only contain a sole call with no parameters.
 * • '''Will we still use templates?

The data structure will be developed by the multimedia and Wikidata teams, in collaboration with community members. We expect that some of the current data used on Commons will be stored in structured Wikidata format, while others will be stored by MediaWiki software. Since many users and contributors seem confused by the many different data and templates now in use on our sites, we will encourage our community to help streamline and consolidate these options.
 * • '''What data structure will be used?

The Upload Wizard code should have a nice & simple interface for storing metadata, which could at first be implemented to just add the appropriate templates to the description page. We could then later swap that implementation for one backed by Wikibase. See Mingle Ticket #309: Prepare for Wikidata Integration on Commons https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/309
 * • '''What interface should the Upload Wizard use to store metadata?

This should be compatible with, but need not be tightly bound to, the high-level interface for storing (and retrieving) the metadata. The high level interface would probably need some domain knowledge (e.g. which property would be used for the license, which property for the creator(s), etc).
 * • '''What is the logical data structure of the future MediaInfo entity?

Planning Tasks
In a first planning phase, we would like to focus on these tasks in coming weeks:
 * Review current discussions, proposals and related documents
 * Review current Wikidata code and technology
 * Host community discussions (onwiki, IRC, roundtables)
 * Design specification based on community and team feedback
 * Estimate development time for key tasks
 * Determine top priorities and overall plan

These planning tasks will require several weeks in June and July, before we can seriously determine our next steps based on good information.

Development Tasks
Here are some of the tasks we expect to take on this year:


 * prototyping
 * entity id/entity per page refactoring
 * entity class refactoring
 * arbitrary access/usage tracking
 * implement media info entity
 * commons gets phase 2
 * topic UI (Wikidata item = topic) - includes integration in upload, editing and search
 * filter by meta data
 * rdf
 * linked data interface

We will track our development work for this project on this Structured Data board on our Mingle planning site. Other development boards include our Current Sprint wall and Current Cycle wall.

MediaInfo Entity Draft

 * ID: based on the name of the media description page
 * Labels (multi-lang, plain text): "short caption"
 * Description /multi-lang, plain text): "long caption"
 * (no Aliases!)
 * (no site links!)


 * original file name (read only) for reference. Mostly redundant to the ID, useful once the page name gets decoupled from the original file name.
 * perhpas additional "intrinsic" info (read only), e.g.
 * Media type / MIME type (read only)
 * Resolution of images
 * Duration of audio/video


 * Statements, with some well-known properties:
 * Free form description(s)
 * Possibly imported, provide source ref
 * mono-lingual (?!)
 * May want to allow inline wikitext (but no block-level markup)
 * Creator/Contributor (with role as a qualifier)
 * as URI (or have several properties for local users, Wikidata Items, plain names, etc)
 * Time of creation (could be a qualifier to the creator property)
 * Jurisdiction of creation (could be a qualifier to the creator property)
 * Licenses (URI or Wikidata Items)?
 * Topics (Subjects, Locations, Events) (Wikidate Items)
 * Quality badges (Wikidate Items) for featured images, etc
 * Restriction markers (Wikidate Items) for personality rights, insignia, currency, etc
 * Source (for imported media)
 * External Identifiers (for media imported from archives, etc)

Links
As this project gets ramped up in coming weeks, we will be using shared notepads to quickly gather our initial findings, then will turn them into more structured wiki pages like this one. Here are a few working documents we're using to get organized:


 * Structured Data Slides (from Wikimania roundtable discussion)
 * Roundtable Notepad (from Wikimania roundtable discussion)
 * User stories
 * Structured data examples
 * Structured Data List
 * Multimedia data API
 * Wikidata for Media Info - Proposal by Daniel Kinzler
 * Structured Data Wall
 * Wikidata: Wikimedia Commons
 * Commons Wikidata Roadmap
 * Wikidata Model
 * First Planning Meeting - June 3, 2014