Multimedia/Structured Data

From MediaWiki.org
Jump to: navigation, search

Structured Data is a new project that aims to implement structured, machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites. It is likely to be implemented with the Wikidata technology and Wikibase and related extensions.

The Multimedia team and the Wikidata team are planning to develop this project together in 2014 and 2015, in collaboration with the volunteer communities on Wikimedia Commons and other projects impacted by this initiative.

Goals[edit | edit source]

Purpose[edit | edit source]

The purpose of this project is to:

  • use structured data for all media files on our sites
  • make it easier for users to read / write file information
  • enable developers to build better tools

We aim to support the use of machine-readable data to store and retrieve information for images and media files on Wikimedia Commons and other sites, so they are easier to view, search, edit, curate and use. To that end, we propose to investigate this opportunity together through community discussions and small experiments. If these initial tests are successful, we would develop new tools and practices for structured data, then work with our communities to gradually migrate unstructured data into a machine-readable format over time.

Rationale[edit | edit source]

Today, multimedia file information on Wikimedia sites is stored using an aging software system that is hard to use and support. The current Commons database is poorly designed, very hard to search, confusing to users and impractical for new feature development. Instead of using machine-readable data as most modern sites do nowadays, Wikimedia Commons relies on a cumbersome patchwork of plain text data embedded in a range of overlapping templates, with a set of English-only categories that are often incompatible with other sites or tools.

Wikidata now offers a practical way to maintain structured data in MediaWiki, and is widely considered as a useful tool to support the growth of the free knowledge movement. As a result, many community members have proposed we use this mechanism to store and retrieve media metadata on Wikimedia Commons. This would provide a wide range of benefits to all users of Wikimedia Commons, as well as to the many sites which rely on its multimedia content. Associating files with Wikidata items or geo-location would support more effective ways to search and browse files on Commons. This would also make it a lot easier to show the appropriate attribution and license information when re-using a file. More benefits are listed below.

The Wikimedia Foundation's multimedia team hosted a number of roundtable discussions with community members last year to ask what it should focus on in coming years. In every roundtable, the top request from participants was to implement structured data on Commons -- even when this topic was not on the agenda to begin with. Our community advisors pointed out that search does not work well on Commons, making it hard to find what you are looking for. Others also pointed out that categories are now mostly in English, making it difficult for non-English speakers to contribute on Commons. Many have also suggested that categories be complemented with more granular topics that could be linked to Wikidata's knowledge base in your language -- as well as intersected to provide better search results.

Slides: Structured Data on Commons

Activities[edit | edit source]

In coming months, we propose to:

  • Plan & discuss this proposal with our communities
  • Design the data structure and user interface
  • Develop the code and tools needed for this project
  • Migrate unstructured data to the new format
  • Test, measure and adjust as needed

Users[edit | edit source]

User Groups[edit | edit source]

Target users for Structured Data include these user groups:

  • readers
  • contributors
  • curators
  • editors
  • developers

This project will impact all key stakeholders that use media files on Commons and other Wikimedia sites. We aim to support these user groups evenly, with an initial focus on Commons users -- then contributors on Wikipedia and other sites. We will also consult with other external user groups such as content providers and site operators who are also an important part of our multimedia ecosystem.

Many Uses for Structured Data

Benefits[edit | edit source]

Learn more with Structured Data

Here are some of the ways structured data can benefit our users:

  • offer a better user experience
  • make it easier to find relevant content
  • drive editing and re-use (license-compliant)
  • provide new ways to contribute
  • improve our infrastructure

User Stories[edit | edit source]

Here are some of the user stories which this project is expected to support.

As a user, I want to:

  • Search for files by multiple criteria
  • Learn more with related information
  • View file information in my language
  • Edit file information more easily
  • Edit file info from other pages
  • Add more topics and data types
  • Upload / transfer files from any wiki
  • Get smart suggestions

See more user stories. These stories and the workflows they support are work in progress, and will be further defined after our first community discussions.

Product[edit | edit source]

Sample File Metadata

Example[edit | edit source]

Here's an example of what metadata for a typical file could look like in a structured form:

  • Title (Label): White Russian Hamster
  • Description: The Djungarian hamster, also known as the Siberian hamster, Siberian
dwarf hamster or Russian winter white dwarf hamster, is one of three species of hamster
in the genus Phodopus.

See more examples

Data structure[edit | edit source]

Many Uses for Structured Data

One of the challenges of this project is to define a data structure that does not limit the scope of items to be stored and is compatible with external tools, while avoiding redundancy and confusion.

Throughout the planning and experimentation phases, we will aim to discuss with our communities which data are most needed to support key workflows, and aim to collectively develop a comprehensive list of basic data, to make sure that all important items are accounted for.

We expect the data structure to include these main data clusters:

  • file
  • work
  • contributor
  • license
  • category
  • topics

Some basic concepts are being discussed as possible building blocks for our data structure. For example, we note that a media file can include one or more works, and that each work can have one or more contributors, as well as one or more licenses, categories or topics. This organizing principle could support a wide range of use cases, which we look forward to discussing and prototyping with our communities in coming months.

See multimedia data list


Structured Data Touchpoints

Storage[edit | edit source]

Where might data be stored?

Wikibase on Commons
  • title
  • description
  • work link
  • contributor link
  • license link
  • category link
  • topic link
  • file link
Wikidata.org

(links on Commons to Wikidata items)

  • work
  • contributor
  • license
  • category
  • topic
  • file
Commons MediaWiki

(but not necessarily in Wikibase)

  • file name
  • format
  • resolution
  • size
  • date
  • geo-tags
  • version
  • EXIF metadata

Components[edit | edit source]

Platform

Here are some of the first modules we might work on:

  • High-level API
  • Media Info Page
  • File Page
Features

First features to use structured data could include:

  • Search
  • Upload Wizard
  • Media Viewer

Roadmap[edit | edit source]

This are some of the phases we are considering for fiscal year 2014-15, for discussion purposes.

  • Phase 1
    • planning with Wikidata
    • community discussions
    • first data structure
    • first specifications
  • Phase 2
    • developer bootcamps in Berlin/Amsterdam
    • first prototypes
    • develop high-level API
    • small experiments (e.g.: location)
    • first metrics (usage, migration, performance)
    • design media info page, file page
  • Phase 3
    • develop and improve platform code modules
    • build media info page, upgrade file page
    • upgrade key feature to support structured data (e.g.: Search, Upload Wizard)
    • design editing tools
    • start data migration on Commons
  • Phase 4
    • Follow up work and bug fixes
    • design file page/editing tools
    • continue data migration on Commons
    • build edit tools
    • develop cross-wiki support

All goals above are tentative, for discussion purposes.

Discussions[edit | edit source]

New discussions[edit | edit source]

All community members are invited to join our upcoming discussions, which will take place here on Commons, on IRC and via Google Hangouts.

  • Structured Data Talk Page

What do you think of this structured data project? Please share your feedback on this talk page.

  • Structured Data Q&A (IRC) - Wed. Sep. 3 @ 19:00 UTC

You're invited to join our first office hours chat on this IRC channel: #wikimedia-commons -- please sign up below if you plan to attend.

  • <your user name, with link to your profile>

We'll be hosting more discussions in coming weeks and hope you can join us for one of these events, so we can review this project and plan our next steps together.

To be notified of these events, you're welcome to subscribe to one of these public mailing lists, if you haven't already: Multimedia list or Wikidata list.

Past discussions[edit | edit source]

Here are some recent roundtable discussions about this project, which we hosted at Wikimania 2014 in London in early August:

Workgroups[edit | edit source]

We will want to form small workgroups to address open issues over time. For example:

Workflows
  • Which workflows should we support first?
Data structure
  • How do we define a basic data structure?
Research
  • How can we measure and validate each feature?
Platform
  • What modules need to be built first?
  • How do we develop the high-end API?
Features
  • Which features do we develop first? Who builds them?
  • How do we gradually roll out these features?
Migration
  • How do we coordinate the data migration?

Are you interested in participating regularly to one of the workgroups above? If so, sign up here -- and help start a sub-page for your workgroup.

FAQ[edit | edit source]

Editing File Info - Before Structured Data
Editing File Info - After Structured Data

Here are some frequent questions that we hear often about this project. More information will be provided as the project develops.

Who will develop this project?

This project will be a collaboration between the WMF’s Multimedia team, the Wikidata team, and communities on Wikimedia Commons, Wikidata, Wikipedia and sister projects.

How will the community be involved?

The Wikimedia Commons, Wikidata and Wikipedia communities will need to make decisions about the new data structure and will also need to help migrate data from the old structure to the new structure. Some of this can be automated, some of it will be manual work.

How long will this take?

This is likely to be a long project, taking several years to complete. We started planning and discussions in summer 2014, will prototype and test small experiments in the fall, and aim to start major development and data migration in 2015. A product plan will be based on first test results and community discussions.

What will be developed first?

As a first step, we aim to create mock interfaces which behave like Wikidata interfaces but use wikitext to read/write data. The storage interface would involve basic concepts like "file", "work", "contributor", "license", etc. For UploadWizard (where we only write information, in the form of information/license templates on a newly created page) we are considering hiding the current code of generating template text behind an API similar to wbeditentity/wbcreateclaim. After an initial period of experimentation and testing, we would build a high-end API to support a range of features.

Where will the structured data be stored?

The current proposal is to store structured data on a “media info” page attached to the file’s page that is similar to Wikidata’s item pages (e.g.: 'Info:Berlin.jpg'). To learn more about this proposal, visit this page.

How will people edit the structured data?

File pages would include an editing tool to make it easy to enter and edit metadata, just like on Wikidata. There might be a more narrow, special purpose editing interface for media info, but that's for later. Initially, editing would be very similar to Wikidata. We also expect to use an updated version of Upload Wizard for entering the structured data when a file is uploaded. Other editing methods may be provided later on.

Will we still use categories?

Categories will be fully supported by the structured data system -- and we hope to provide them in a variety of languages, not just English. Many have suggested that we also complement categories with a more granular set of 'topics', which can be linked to corresponding items on Wikidata, as well as intersected to improve search.

Will we still use templates?

Templates will still be used to format the data from the media info page. For example, the {{Information}} template would be changed to pull information like the contributor or description from the media info page. Template parameter would only be used in rare cases to override the values from the data page. Ideally, file description pages would in the end only contain a sole {{Information}} call with no parameters.

What data structure will be used?

The data structure will be developed by the multimedia and Wikidata teams, in collaboration with community members. We expect that some of the current data used on Commons will be stored in structured Wikidata format, while others will be stored by MediaWiki software. Since many users and contributors seem confused by the many different data and templates now in use on our sites, we invite community members to help streamline and consolidate these options, to eliminate redundancies.

What interface should the Upload Wizard use to store metadata?

The Upload Wizard code should have a nice & simple interface for storing metadata, which could at first be implemented to just add the appropriate templates to the description page. We could then later swap that implementation for one backed by Wikibase. See Mingle Ticket #309: Prepare for Wikidata Integration on Commons https://wikimedia.mingle.thoughtworks.com/projects/multimedia/cards/309

What is the logical data structure of the future MediaInfo entity?

This should be compatible with, but need not be tightly bound to, the high-level interface for storing (and retrieving) the metadata. The high level interface would probably need some domain knowledge (e.g. which property would be used for the license, which property for the creator(s), etc).

Tasks[edit | edit source]

Planning Tasks[edit | edit source]

In a first planning phase, we would like to focus on these tasks in coming weeks:

  • Review current discussions, proposals and related documents
  • Review current Wikidata code and technology
  • Host community discussions (onwiki, IRC, roundtables)
  • Design specification based on community and team feedback
  • Estimate development time for key tasks
  • Determine top priorities and overall plan

These planning tasks will require several weeks in June and July, before we can seriously determine our next steps based on good information.

Development Tasks[edit | edit source]

Here are some of the tasks we expect to take on this year:

  • prototyping
  • entity id/entity per page refactoring
  • entity class refactoring
  • arbitrary access/usage tracking
  • implement media info entity
  • commons gets phase 2
  • topic UI (Wikidata item = topic) - includes integration in upload, editing and search
  • filter by meta data
  • rdf
  • linked data interface

We will track our development work for this project on this Structured Data board on our Mingle planning site. Other development boards include our Current Sprint wall and Current Cycle wall.

Team[edit | edit source]

This project is a collaboration between:

  • Multimedia Team (Wikimedia Foundation)
  • Wikidata Team (Wikimedia Deutschland)
  • Wikimedia Community (from Commons and other projects)
Wikimedia Foundation
Wikidata
Wikimedia Community

Links[edit | edit source]

As this project gets ramped up in coming weeks, we will be using shared notepads to quickly gather our initial findings, then will turn them into more structured wiki pages like this one. Here are a few working documents we're using to get organized: