Reading/Multimedia/Structured Data

Structured Data is a new project that aims to implement structured, machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites. It is likely to be implemented with the Wikidata technology and Wikibase and related extensions.

The Multimedia team and the Wikidata team are planning to develop this project together in 2014 and 2015, in collaboration with the volunteer communities on Wikimedia Commons and other projects impacted by this initiative.

Purpose
The purpose of this project is to:
 * use structured data for all media files on our sites
 * make it easier for users to read / write file information
 * enable developers to build better tools

We aim to support the use of machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites, so they are easier to view, search, edit, curate and use. To that end, we propose to investigate this opportunity together through community discussions and small experiments. If these initial tests are successful, we would develop new tools and practices for structured data, then work with our communities to gradually migrate unstructured data into a machine-readable format over time.

Rationale
Today, multimedia file information on Wikimedia sites is stored using an aging software system that is hard to use and support. The current Commons database is poorly designed, very hard to search, confusing to users and impractical for new feature development. Instead of using machine-readable data as most modern sites do nowadays, Wikimedia Commons relies on a cumbersome patchwork of plain text data embedded in a range of overlapping templates, with a set of English-only categories that are often incompatible with other sites or tools.

Wikidata now offers a practical way to maintain structured data in MediaWiki, and is widely considered as a useful tool to support the growth of the free knowledge movement. As a result, many community members have proposed we use this mechanism to store and retrieve media metadata on Wikimedia Commons. This would provide a wide range of benefits to all users of Wikimedia Commons, as well as to the many sites which rely on its multimedia content. Associating files with Wikidata items or geo-location would support more effective ways to search and browse files on Commons. This would also make it a lot easier to show the appropriate attribution and license information when re-using a file. More benefits are listed below.

The Wikimedia Foundation's multimedia team hosted a number of roundtable discussions with community members last year to ask what it should focus on in coming years. In every roundtable, the top request from participants was to implement structured data on Commons -- even when this topic was not on the agenda to begin with. Our community advisors pointed out that search does not work well on Commons, making it hard to find what you are looking for. Others also pointed out that categories are now mostly in English, making it difficult for non-English speakers to contribute on Commons. Many have also suggested that categories be complemented with more granular topics that could be linked to Wikidata's knowledge base in your language -- as well as intersected to provide better search results.



Activities
In coming months, we propose to:
 * Plan & discuss this proposal with our communities
 * Design the data structure and user interface
 * Develop the code and tools needed for this project
 * Migrate unstructured data to the new format
 * Test, measure and adjust as needed

User Groups
Target users for Structured Data include these user groups:
 * readers
 * contributors
 * curators
 * editors
 * developers

This project will impact all key stakeholders that use media files on Commons and other Wikimedia sites. We aim to support these user groups evenly, with an initial focus on Commons users -- then contributors on Wikipedia and other sites. We will also consult with other external user groups such as content providers and site operators who are also an important part of our multimedia ecosystem.



Benefits
Here are some of the ways structured data can benefit our users:
 * offer a better user experience
 * make it easier to find relevant content
 * drive editing and re-use (license-compliant)
 * provide new ways to contribute
 * improve our infrastructure

User Stories
Here are some of the user stories which this project is expected to support.

As a user, I want to:
 * Search for files by multiple criteria
 * Learn more with related information
 * View file information '''in my language
 * Edit file information more easily
 * Edit file info from other pages
 * Add more topics and data types
 * Upload / transfer files from any wiki
 * Get smart suggestions

See more user stories. These stories and the workflows they support are work in progress, and will be further defined after our first community discussions.

Example
Here's an example of what metadata for a typical file could look like in a structured form:


 * Title (Label): White Russian Hamster
 * Description: The Djungarian hamster, also known as the Siberian hamster, Siberian
 * dwarf hamster or Russian winter white dwarf hamster, is one of three species of hamster
 * in the genus Phodopus.


 * File name: Hamster ruso blanco.JPG
 * File type: image/jpeg
 * Resolution: 4,320 × 3,240 pixels
 * Statements:
 * Contributor: Nanny99
 * Role: Photographer
 * Rights: http://creativecommons.org/license/CC-BY-SA-3.0
 * Terms of use: BY-SA
 * Topic: hamster
 * Topic: pet
 * Type of work: photograph
 * Coordinate location: 52’31 N, 13’22 E
 * Location: Berlin

See more examples

Data structure


One of the challenges of this project is to define a data structure that does not limit the scope of items to be stored and is compatible with external tools, while avoiding redundancy and confusion.

Throughout the planning and experimentation phases, we will aim to discuss with our communities which data are most needed to support key workflows, and aim to collectively develop a comprehensive list of basic data, to make sure that all important items are accounted for.

We expect the data structure to include these main data clusters:
 * file
 * work
 * contributor
 * license
 * category
 * topics

Some basic concepts are being discussed as possible building blocks for our data structure. For example, we note that a media file can include one or more works, and that each work can have one or more contributors, as well as one or more licenses, categories or topics. This organizing principle could support a wide range of use cases, which we look forward to discussing and prototyping with our communities in coming months.

See multimedia data list



Storage
Where might data be stored?
 * Wikibase on Commons
 * title
 * description
 * work link
 * contributor link
 * license link
 * category link
 * topic link
 * file link

(links on Commons to Wikidata items)
 * Wikidata.org
 * work
 * contributor
 * license
 * category
 * topic
 * file

(but not necessarily in Wikibase)
 * Commons MediaWiki
 * file name
 * format
 * resolution
 * size
 * date
 * geo-tags
 * version
 * EXIF metadata

Components
Here are some of the first modules we might work on:
 * Platform
 * High-level API
 * Media Info Page
 * File Page

First features to use structured data could include:
 * Features
 * Search
 * Upload Wizard
 * Media Viewer

Roadmap
This are some of the phases we are considering for fiscal year 2014-15, for discussion purposes.


 * Phase 1
 * planning with Wikidata
 * community discussions
 * first data structure
 * first specifications


 * Phase 2
 * developer bootcamps in Berlin/Amsterdam
 * first prototypes
 * develop high-level API
 * small experiments (e.g.: location)
 * first metrics (usage, migration, performance)
 * design media info page, file page


 * Phase 3
 * develop and improve platform code modules
 * build media info page, upgrade file page
 * upgrade key feature to support structured data (e.g.: Search, Upload Wizard)
 * design editing tools
 * start data migration on Commons


 * Phase 4
 * Follow up work and bug fixes
 * design file page/editing tools
 * continue data migration on Commons
 * build edit tools
 * develop cross-wiki support

All goals above are tentative, for discussion purposes.

New discussions
All community members are invited to join our upcoming discussions, which will take place here on Commons, on IRC and via Google Hangouts.

What do you think of this structured data project? Please share your feedback on this talk page.
 * Structured Data Talk Page

You're invited to join our first office hours chat on this IRC channel: #wikimedia-commons -- please sign up below if you plan to attend.
 * Structured Data Q&A (IRC) - Wed. Sep. 3 @ 19:00 UTC


 * 

We'll be hosting more discussions in coming weeks and hope you can join us for one of these events, so we can review this project and plan our next steps together.

To be notified of these events, you're welcome to subscribe to one of these public mailing lists, if you haven't already: Multimedia list or Wikidata list.

Past discussions
Here are some recent roundtable discussions about this project, which we hosted at Wikimania 2014 in London in early August:


 * Structured Data Roundtable - Thursday, August 7 - 10:00 GMT+1 - 90 mins. (Frobisher 3) (Slides | Etherpad | Photos)


 * Multimedia Roundtable - Sunday, August 10 - 11:30 GMT+1 - 90 mins. (Boardroom) (Slides | Notepad | Photos)

Workgroups
We will want to form small workgroups to address open issues over time. For example:


 * Workflows
 * Which workflows should we support first?


 * Data structure
 * How do we define a basic data structure?


 * Research
 * How can we measure and validate each feature?


 * Platform
 * What modules need to be built first?
 * How do we develop the high-end API?


 * Features
 * Which features do we develop first? Who builds them?
 * How do we gradually roll out these features?


 * Migration
 * How do we coordinate the data migration?

Are you interested in participating regularly to one of the workgroups above? If so, sign up here -- and help start a sub-page for your workgroup.

FAQ




Here are some frequent questions that we hear often about this project. More information will be provided as the project develops.

This project will be a collaboration between the WMF’s Multimedia team, the Wikidata team, and communities on Wikimedia Commons, Wikidata, Wikipedia and sister projects.
 * • '''Who will develop this project?

The Wikimedia Commons, Wikidata and Wikipedia communities will need to make decisions about the new data structure and will also need to help migrate data from the old structure to the new structure. Some of this can be automated, some of it will be manual work.
 * • '''How will the community be involved?

This is likely to be a long project, taking several years to complete. We started planning and discussions in summer 2014, will prototype and test small experiments in the fall, and aim to start major development and data migration in 2015. A product plan will be based on first test results and community discussions.
 * • '''How long will this take?

As a first step, we aim to create mock interfaces which behave like Wikidata interfaces but use wikitext to read/write data. The storage interface would involve basic concepts like "file", "work", "contributor", "license", etc. For UploadWizard (where we only write information, in the form of information/license templates on a newly created page) we are considering hiding the current code of generating template text behind an API similar to wbeditentity/wbcreateclaim. After an initial period of experimentation and testing, we would build a high-end API to support a range of features.
 * • '''What will be developed first?

The current proposal is to store structured data on a “media info” page attached to the file’s page that is similar to Wikidata’s item pages (e.g.: 'Info:Berlin.jpg'). To learn more about this proposal, visit this page.
 * • '''Where will the structured data be stored?

File pages would include an editing tool to make it easy to enter and edit metadata, just like on Wikidata. There might be a more narrow, special purpose editing interface for media info, but that's for later. Initially, editing would be very similar to Wikidata. We also expect to use an updated version of Upload Wizard for entering the structured data when a file is uploaded. Other editing methods may be provided later on.
 * • '''How will people edit the structured data?

Categories will be fully supported by the structured data system -- and we hope to provide them in a variety of languages, not just English. Many have suggested that we also complement categories with a more granular set of 'topics', which can be linked to corresponding items on Wikidata, as well as intersected to improve search.
 * • '''Will we still use categories?

Templates will still be used to format the data from the media info page. For example, the template would be changed to pull information like the contributor or description from the media info page. Template parameter would only be used in rare cases to override the values from the data page. Ideally, file description pages would in the end only contain a sole call with no parameters.
 * • '''Will we still use templates?

The data structure will be developed by the multimedia and Wikidata teams, in collaboration with community members. We expect that some of the current data used on Commons will be stored in structured Wikidata format, while others will be stored by MediaWiki software. Since many users and contributors seem confused by the many different data and templates now in use on our sites, we invite community members to help streamline and consolidate these options, to eliminate redundancies.
 * • '''What data structure will be used?

The Upload Wizard code should have a nice & simple interface for storing metadata, which could at first be implemented to just add the appropriate templates to the description page. We could then later swap that implementation for one backed by Wikibase. See Mingle Ticket #309: Prepare for Wikidata Integration on Commons https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/309
 * • '''What interface should the Upload Wizard use to store metadata?

This should be compatible with, but need not be tightly bound to, the high-level interface for storing (and retrieving) the metadata. The high level interface would probably need some domain knowledge (e.g. which property would be used for the license, which property for the creator(s), etc).
 * • '''What is the logical data structure of the future MediaInfo entity?

Planning Tasks
In a first planning phase, we would like to focus on these tasks in coming weeks:
 * Review current discussions, proposals and related documents
 * Review current Wikidata code and technology
 * Host community discussions (onwiki, IRC, roundtables)
 * Design specification based on community and team feedback
 * Estimate development time for key tasks
 * Determine top priorities and overall plan

These planning tasks will require several weeks in June and July, before we can seriously determine our next steps based on good information.

Development Tasks
Here are some of the tasks we expect to take on this year:


 * prototyping
 * entity id/entity per page refactoring
 * entity class refactoring
 * arbitrary access/usage tracking
 * implement media info entity
 * commons gets phase 2
 * topic UI (Wikidata item = topic) - includes integration in upload, editing and search
 * filter by meta data
 * rdf
 * linked data interface

We will track our development work for this project on this Structured Data board on our Mingle planning site. Other development boards include our Current Sprint wall and Current Cycle wall.

Team
This project is a collaboration between:
 * Multimedia Team (Wikimedia Foundation)
 * Wikidata Team (Wikimedia Deutschland)
 * Wikimedia Community (from Commons and other projects)


 * Wikimedia Foundation
 * Fabrice Florin - Product Manager
 * Gilles Dubuc - Senior Software Engineer
 * Gergő Tisza - Back-end Software Engineer
 * Pau Giner - Interaction Designer
 * Keegan Peterzell - Community Liaison
 * Rob Lanphier - Platform Engineering Director
 * Erik Moeller - VP Engineering / Product


 * Wikidata
 * Lydia Pintscher - Product Manager
 * Daniel Kinzler - Senior Software Engineer
 * more to be added here


 * Wikimedia Community
 * Maarten Dammers - Community advisor
 * Derk-Jan Hartman - Community advisor
 * more to come

Links
As this project gets ramped up in coming weeks, we will be using shared notepads to quickly gather our initial findings, then will turn them into more structured wiki pages like this one. Here are a few working documents we're using to get organized:


 * Structured Data Slides (from Wikimania roundtable discussion)
 * Roundtable Notepad (from Wikimania roundtable discussion)
 * User stories
 * Structured data examples
 * Structured Data List
 * Multimedia data API
 * Wikidata for Media Info
 * Structured Data Wall
 * Wikidata: Wikimedia Commons
 * Commons Wikidata Roadmap
 * Request for comment: Structured Commons
 * Wikidata Model
 * DBpedia for commons
 * DBpedia commons extraction
 * DBpedia mappings
 * DBpedia mapping stats