License integration MediaWiki/Current structure on Commons

From mediawiki.org

Unfortunately right now both author and license information is not stored in a structured way that would allow fetching it from the MediaWiki API.

In the case of Wikimedia Commons (commons.wikimedia.org, the media repository for Wikipedia) there is a somewhat structured way to extract it from the generated HTML.

Examples[edit]

File:Example.svg (public-domain file)[edit]

..
<td id="fileinfotpl_desc" class="fileinfo-paramfield">Description<span class="summary fn" style="display:none">Example.svg</span></td>
<td class="description">
<ul>
<li>
<div class="description mw-content-ltr en" dir="ltr" lang="en" style=""><span class="language en" title=""><b>English:</b></span> Image sample for example, in <a href="//en.wikipedia.org/wiki/SVG" class="extiw" title="en:SVG">SVG</a></div>
</li>
<li>
<div class="description mw-content-ltr fr" dir="ltr" lang="fr" style="font-family: sans-serif;"><span class="language fr" title="Français"><b>Français :</b></span> Échantillon d'image pour exemple, en <a href="//fr.wikipedia.org/wiki/SVG" class="extiw" title="fr:SVG">SVG</a></div>
</li>
</ul>
</td>
..
<td id="fileinfotpl_date" class="fileinfo-paramfield">Date</td>
<td><time class="dtstart" datetime="2006-07-10">10 July 2006</time></td>
..
<td id="fileinfotpl_src" class="fileinfo-paramfield">Source</td>
<td><span class="int-own-work">Own work</span></td>
..
<td id="fileinfotpl_aut" class="fileinfo-paramfield">Author</td>
<td><a href="/wiki/User:Nethac_DIU" title="User:Nethac DIU" class="mw-redirect">Nethac DIU</a></td>
..
<table class="licensetpl" style="display:none">
 <span class="licensetpl_short">Public domain</span>
 <span class="licensetpl_long">Public domain</span>
 <span class="licensetpl_link_req">false</span>
 <span class="licensetpl_attr_req">false</span>
</table>

File:Bustaxi.jpg (Creative Commons file)[edit]

..
<td id="fileinfotpl_desc" class="fileinfo-paramfield">Description<span class="summary fn" style="display:none">Bustaxi.jpg</span></td>
<td class="description">
<div class="description en" lang="en" style="direction:ltr;"><span class="language en" title=""><b>English:</b></span> A taxi-bus is used on bus lines with little traffic; here shown next to a 'normal' bus. Assen, the Netherlands.</div>
<div class="description de" lang="de" style="direction: ltr; font-family: sans-serif;"><span class="language de" title="Deutsch"><b>Deutsch:</b></span> Ein Taxi-Bus wird auf Bus-Linien mit wenig Verkehr verwendet; hier neben einem „normalen“ Bus in Assen, Niederlande</div>
</td>
..
<td id="fileinfotpl_date" class="fileinfo-paramfield">Date</td>
<td><time class="dtstart" datetime="2004-07">July 2004</time></td>
..
<td id="fileinfotpl_src" class="fileinfo-paramfield">Source</td>
<td><span class="int-own-work">Own work</span></td>
..
<td id="fileinfotpl_aut" class="fileinfo-paramfield">Author</td>
<td><b>Photograph:</b> <a href="/wiki/User:Andre_Engels" title="User:Andre Engels">Andre Engels</a><br>
Own picture from <a href="/wiki/User:Andre_Engels" title="User:Andre Engels">Andre Engels</a>.</td>
..
<table class="licensetpl_wrapper">
 <span class="licensetpl_aut">
  <a href="/wiki/User:Andre_Engels" title="User:Andre Engels">Andre Engels</a>
 </span>
 ..
 <span style="font-size: larger;" class="licensetpl_attr">
  <a href="/wiki/User:Andre_Engels" title="User:Andre Engels">Andre Engels</a>
 </span>
 ..
 <span class="licensetpl_link" style="display:none;">
  http://creativecommons.org/licenses/by/1.0
 </span>
 <span class="licensetpl_short" style="display:none;">CC-BY-1.0</span>
 <span class="licensetpl_long" style="display:none;">Creative Commons Attribution 1.0</span>
 <span class="licensetpl_link_req" style="display:none;">true</span>
 <span class="licensetpl_attr_req" style="display:none;">true</span>
..

Parsing[edit]

Depending on your environment, there's a couple different approaches. Again, note that none of this is by design. The below has organically formed due to the lack of native support using/abusing only what was created by the community using the tools available.

The File page[edit]

If your code runs as a user script or gadget in javascript on the actual file page on Wikimedia Commons, I recommend using jQuery to extract the data and using .text() conversion to easily extract the information that matters (ignoring any extra styling, presentational elements etc.).

The Stockphoto gadget on Wikimedia Commons is specifically designed to extract this information to allow users to easily get a boilerplate of code to re-use an image honouring the license and attribution requirements.

A small sample:

	$authorElement = $( '#fileinfotpl_aut + td' );
	$sourceElement = $( '#fileinfotpl_src + td' ); 
	authorTxt = $.trim( $authorElement.text()) ;
	sourceTxt = $.trim( $sourceElement.text() );

External[edit]

Standalone[edit]

If you're running a web service or some other server-side script that needs this information, you'll have to extract the HTML from the API (or wikitext, and pass it to api.php?action=parse to get the HTML, then interpret that HTML and look for the elements manually (either with very creative use of regexing, substring searching or a fabulous DOM library that can handle it). Then take it and strip the tags to extract the plain text value.

Use the above examples to know which elements to look for.

Can use Toolserver[edit]

If you're OK with using a third-party service to do it for you, you can use Magnus' Commons API (source code). Magnus has walked the brave path described above in the "Standalone" section and made it available for others to use.

Alternatively, you could use that code as a base and still do it standalone.