Reading/Multimedia/Structured Data

Structured Data is a new project that aims to implement structured, machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites. It is likely to be implemented with the Wikidata technology and Wikibase Client and related extensions.

The Multimedia team and the Wikidata team are planning to develop this project together in 2014 and 2015, in collaboration with the volunteer communities on Wikimedia Commons and other projects impacted by this initiative.

Goals
The goal of this project is to implement structured, machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites, so it is easier to view, search, edit, curate and use. To do that, we plan to migrate the unstructured data now on Commons into a structured format and container.

In coming months, we propose to:
 * Plan & discuss this plan with our community
 * Design the data structure and user interface
 * Develop the code and tools needed for this project
 * Migrate unstructured data to the new format
 * Measure and adjust as needed

Timeline
This is a first draft of our proposed roadmap for fiscal year 2014-15, for discussion purposes.


 * Q1 (July-Sep. 2014)
 * planning with Wikidata
 * new metadata structure
 * community discussions
 * design file page/editing tools
 * mockups/prototypes


 * Q2 (Oct-Dec. 2014)
 * developer bootcamp in Berlin
 * develop first code modules
 * upgrade Upload Wizard to support structured data
 * migrate data on Commons with Wikidata and community


 * Q3 (Jan-Mar. 2015)
 * develop and improve code modules
 * continue data migration on Commons
 * prepare tools for other sites


 * Q4 (Apr.-June 2015)
 * Follow up work and bug fixes

Users
Our users for Structured Data include all key stakeholders that use media files on Commons:
 * readers
 * contributors
 * curators
 * editors
 * campaign organizers
 * developers

We will aim to support these user groups evenly, with an initial focus on Commons users -- then contributors on Wikipedia and other sites.

Use Cases
Here are some of the first use cases which this project is expected to support.


 * Search by multiple tags - find media files by typing one or more tags in my language.


 * Search by single tag - find media files by typing one tag in my language.


 * Search by Commons category - find media files by Commons category


 * Search by Location - find media files near this page or in a related location


 * Multi-lingual Metadata - see metadata in my language (e.g. file title, caption, description, place name, category or tag name, etc.)


 * Full Attribution - see all relevant attributions and automatically embed them when using a file (e.g. author/source attribution, license information)


 * File Assessment - see all relevant file assessments and use them for surfacing the best content (e.g. featured, feedback, curation tags or ratings)


 * Auto-suggest media - get recommendations of media files for my page, based on other language versions of this page.


 * More information - see more information from Wikidata about a topic on Commons (or a media file's topic), to make it more interesting.

To learn more, check this first notepad. Many of these requirements are shared by VE and multimedia teams, and are sorted by joint priority.

More use cases and workflows will be prepared after our first community discussions.

Discussions
In coming weeks, we will invite community members to give us more feedback on this project through a variety of discussions, which will take place onwiki, on IRC and via Google Hangouts.

We hope you can join us for one of these events, which will aim to review our plans and engage key stakeholders as active participants in this project.

To participate in these conversations, we invite you to check this page again in early July and/or join our public Multimedia mailing list, where we will announce discussion dates shortly.

Questions
Here are preliminary answers to some of the first questions that are being asked about this project. More information will be provided as the project develops.

This project will be a collaboration between the WMF’s Multimedia team, the Wikidata team, the Wikimedia Commons and Wikipedia communities.
 * • '''Who will develop this project?

The Wikimedia Commons and Wikipedia communities will need to make decisions about the new data structure and will also need to help migrate data from the old structure to the new structure. Some of this can be automated, some of it will be manual work.
 * • '''How will the community be involved?

We are starting the planning process in June, will prototype and hold community discussions through September, and aim to start development in October 2014. This would be followed by data migration with the community, and we hope to provide a first minimum viable product in early 2015.
 * • '''How long will this take?


 * • '''Where will the structured data be stored?
 * The current proposal is to store structured data on a “data” page attached to the file’s page that is similar to Wikidata’s item pages (e.g.: 'File:Berlin.jpg/info'). Commons is thereby a repository and at the same time its own client, as well as a client of wikidata.org. To learn more about this proposal, visit this page.
 * Before committing to that path, should we consider a stand-alone data-base, separate from the wiki pages?

We expect to use an updated version of Upload Wizard for entering the structured data when a file is uploaded, as well as the Wikidata editing tool that makes it easy to enter and edit metadata afterwards. These tools can help validate that the correct values are entered for each field or property. Other editing methods may be provided later on.
 * • '''How will people edit the structured data?

The data structure will be developed by the multimedia and Wikidata teams, in collaboration with community members. We expect that some of the current data used on Commons will be stored in structured Wikidata format, while others will be stored by MediaWiki software. Since many users and contributors seem confused by the many different data and templates now in use on our sites, we will encourage our community to help streamline and consolidate these options.
 * • '''What data structure will be used?

The Upload Wizard code should have a nice & simple interface for storing metadata, which could at first be implemented to just add the appropriate templates to the description page. We could then later swap that implementation for one backed by Wikibase. See Mingle Ticket #309: Prepare for Wikidata Integration on Commons https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/309
 * • '''What interface should the Upload Wizard use to store metadata?

This should be compatible with, but need not be tightly bound to, the high-level interface for storing (and retrieving) the metadata. The high level interface would probably need some domain knowledge (e.g. which property would be used for the license, which property for the creator(s), etc.
 * • '''What is the logical data structure of the future MediaInfo entity?

As a first step, we aim to create mock interfaces which behave like Wikidata interfaces but use wikitext to read/write data. The storage interface would involve concepts like "creator/contributor", "license", etc. For UploadWizard (where we only write information, in the form of information/license templates on a newly created page) we are considering hiding the current code of generating template text behind an API similar to wbeditentity/wbcreateclaim.
 * • '''What will be developed first?

Planning Tasks
In a first planning phase, we would like to focus on these tasks in coming weeks:
 * Review current discussions, proposals and related documents
 * Review current Wikidata code and technology
 * Host community discussions (onwiki, IRC, roundtables)
 * Design specification based on community and team feedback
 * Estimate development time for key tasks
 * Determine top priorities and overall plan

These planning tasks will require several weeks in June and July, before we can seriously determine our next steps based on good information.

Development Tasks
Here are some of the tasks we expect to take on this year:


 * prototyping
 * entity id/entity per page refactoring
 * entity class refactoring
 * arbitrary access/usage tracking
 * implement media info entity
 * commons gets phase 2
 * tag UI (Wikidata item = tag) - includes integration in upload, editing and search
 * filter by meta data
 * rdf
 * linked data interface

We will track our development work for this project on this Structured Data board on our Mingle planning site. Other development boards include our Current Sprint wall and Current Cycle wall.

Links
As this project gets ramped up in coming weeks, we will be using shared notepads to quickly gather our initial findings, then will turn them into more structured wiki pages like this one. Here are a few working documents we're using to get organized:


 * First Planning Meeting - June 3, 2014
 * Examples of Structured Data for Commons
 * Wikidata for Media Info - Proposal by Daniel Kinzler