citoid

From MediaWiki.org
(Redirected from Cite-from-id)
Jump to: navigation, search
Citoid in VisualEditor

The citoid node.js service generates citation data given a URL, DOI, ISBN, PMID, or PMCID. It has a companion extension, Citoid, which aims to provide use of the citoid service to VisualEditor. It is currently deployed in all VisualEditor-enabled WMF-Wikis,[1] though the extension is only configured on a few of them.[2]

Contents

Public API[edit]

To request metadata about a URL, DOI, ISBN, PMCID or PMID, you can use the English language API endpoint at https://en.wikipedia.org/api/rest_v1/#!/Citation/getCitation. Or for language localised request, use your preferred language Wikipedia.

Issue tracker and project management[edit]

Bugs, issues, and suggestions for improvement can be added to the Citoid phabricator project.

Installation[edit]

Best results are obtained if the URLs that are popular on your site are already available in Zotero. If they're not, performance will be better if you create Zotero translators for your popular sites first. WMF also has a fork of the Zotero translators which have slightly better coverage here.

Citoid is a nodejs app that also requires a working installation of Zotero's translation server, which uses the Zotero translators library, and xulrunner. Please note that the most recent version of xpcshell doesn't work with translation-server; the latest version known to work is 29.0.

Install nodejs and npm[edit]

Install nodejs and npm. When you are using Ubuntu and depending on OS version you will not end up with the most recent version of nodejs. You are recommended to use nvm[1] to manage nodejs installations.

sudo apt-get install nodejs npm
nodejs --version # should now print v0.10.x Note: not on Ubuntu Server 12.04 LTS, you end up with v0.6.x

Warning Warning: in order to run the current version of Citoid server, you should need the nodejs version 6, at least.

For other systems, see:

Install from scratch[edit]

Install and configure Zotero's translation server[edit]

See: Translation-server installation instructions

Note: The most recent version of xpcshell does not work with translation-server. Install version 29.0.

Install and configure citoid service[edit]

Get the code[edit]

If you want to do an anonymous checkout:

git clone https://gerrit.wikimedia.org/r/p/mediawiki/services/citoid

Or if you plan to hack citoid, then please follow the Gerrit 'getting started' docs and use an authenticated checkout url instead, such as:

git clone ssh://<user>@gerrit.wikimedia.org:29418/mediawiki/services/citoid
JS dependencies[edit]

Install the JS dependencies. Run this command in the citoid directory:

npm install
Modify config.yaml[edit]

Config.yaml contains the configuration for the citoid service. The defaults should work out of the box for development, however, they may need to be modified in a deployment set-up.

Run the translation server[edit]

You'll first need to run translation-server; see the directions on its github page, but generally from the translation-server directory you should run:

build/run_translation-server.sh
Run Citoid server[edit]

You should be able to start the citoid web service from the citoid directory using:

node server.js

This will start the citoid service on port 1970. To test it, navigate to http://localhost:1970 in your browser. You'll be able to test sample queries from this page.

Install Citoid extension[edit]

In order to have citoid functioning on your wiki in conjunction with VisualEditor, you'll need the following: VisualEditor and Parsoid, VisualEditor's Citation Tool, and the Citoid extension.

It is recommended that you have the following extensions in your extension folder: Extension:VisualEditor, Extension:Scribunto, Extension:Cite, Extension:TemplateData, and Extension:ParserFunctions, and Extension:Citoid.

VisualEditor and Citation Tool[edit]

  1. Set up MediaWiki: Manual:Installation_guide
  2. Set up Parsoid: Parsoid/Setup
  3. Set up VisualEditor: Extension:VisualEditor
  4. Set up Citation Tool: VisualEditor/Citation tool

Citoid extension[edit]

If you want to do an anonymous checkout:

git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Citoid

Or if you plan to hack citoid, then please follow the Gerrit 'getting started' docs and use an authenticated checkout url instead, such as:

git clone ssh://<user>@gerrit.wikimedia.org:29418/mediawiki/extensions/Citoid

Then add the following line to your wiki's LocalSettings.php after you have downloaded the extension:

wfLoadExtension( 'Citoid' );

Set the location to your citoid service instance in your wiki's LocalSettings.php

// If the wiki is being served over https, the https 
// protocol is required in the citoid service url; otherwise the 
// browser will block the request.
$wgCitoidServiceUrl = 'http://localhost:1970/api';

Configure Citoid on a Citoid-enabled wiki[edit]

The citoid extension must be configured using special TemplateData maps and a special citoid-specific message. It is currently deployed in all VisualEditor-enabled WMF-Wikis, but it must be configured before it can be used.

Ensure each template to be used in MediaWiki:Citoid-template-type-map.json has an 'citoid' maps value[edit]

Warning Warning: Each template listed in Citoid-template-type-map.json MUST have a citoid maps value! Otherwise completely empty templates will be inserted. If there is no maps value for a given template, it is better to use a different template that does in the Mediawiki message, even if the types don't match well.

Since Citoid has its own set of fields for each document type (for instance, the journal name is called 'publicationTitle' in citoid, but 'journal' in Template:Cite_journal), each Template must have TemplateData defined that creates a map between citoid's fields and the Template's field. Calling the map 'citoid' lets the citoid extension know which map to look for. If the map 'citoid' doesn't suit your purposes for use with say, a userscript, you may create a citoid service related map that is called something else; an unlimited number of maps with unique keys are allowed in the maps object. Note that you can only see TemplateData maps in edit mode; they are not visible in the TemplateData table.

Fields issn and isbn can have Arrays [] in the citoid map; using them ensures that only one ISBN is in the field. If you do not place the parameter inside an Array (i.e. isbn: "[ISBN]"), multiple ISBNs or ISSNS will be concatenated in the field (i.e. "issn: 1234-5678, 7777-7777'). All 'person' fields, e.g. author, editor, translator, contributor etc, require a 2D Arrays [[]] in the citoid map. See sample templateData below for examples.

Examples of a map objects that are compatible with the citoid extension are on English Wikipedia:

Configure special MediaWiki namespace Citoid message[edit]

Warning Warning: It is best if you add this message after each template has a template data map as specified in the above section. Otherwise, in the meantime empty templates will be inserted, because each template needs a template data map to be usable.

You'll need to configure a special MediaWiki: namespace message. This message maps the native citoid types (website, book, journalArticle) to the appropriate template (Cite web, Cite book, Cite journal). You should match a template to every single citoid type; there is no default behaviour if no template is matched to a particular type. It's better to have a bad match (there may be some fields in common between video liner notes and a book, or video liner notes and a video, for example) than none at all.

You may consider using en wiki's Template:Citation as a catch-all for types where there is no good type match as it is designed for this situation.

An sample namespace message is found here: Citoid/MediaWiki:Citoid-template-type-map.json

Every available citoid type is listed as a key in the sample namespace message.

Troubleshooting VisualEditor Extension[edit]

Inspector does not appear in the toolbar[edit]

An icon for the inspector should appear in the toolbar menu. If the icon does not appear in the toolbar, it most likely means there's a problem with MediaWiki:Citoid-template-type-map.json. If there is no message at that location, or if the JSON is invalid, the inspector will not load. Alternatively, you may need to refresh your JavaScript cache.

Empty references appear[edit]

Empty references most commonly appear when the citation template being inserted contains no maps data, or if the maps data is there but not making it to the MediaWiki API. First, determine the template that the inspector is attempted to insert, for example, Template:Cite web/doc. View source of the template or documentation page to verify that

"maps": {

    "citoid": { 

is present and contains fields. Then verify that these data are making it to the MediaWiki API by visiting the API page, i.e. http://localhost/api.php?action=templatedata&titles=Template:Cite%20web/doc&format=jsonfm on your local installation, or https://en.wikipedia.org/w/api.php?action=templatedata&titles=Template:Cite%20web&format=jsonfm on en wiki.

If the maps object is present in TemplateData, but not in the API response, try editing the template where the TemplateData is transcluded i.e. Template:Cite_web (but making no changes) and saving it, a.k.a. a "null edit". There is a known bug with transcluded TemplateData where it can take a long for the API to update; null edits force the change.

If the response from the API looks okay, there may be an issue with the map itself.

The inspector is still "pending" after a really long time following insertion[edit]

This typically means there is a bug. If you open your JavaScript console, you will likely find error messages that will help you debug.

Access date is formatted differently on my wiki[edit]

The dates are in ISO format, which is an international standard. On the back-end, we're sticking to ISO and in the future all dates will all be in ISO, not just access date. This is because it is an unambiguous way to present the date in any language. If the community doesn't like the way this looks to the user, it is possible to edit the citation template to format the ISO dates to something that is standard in your language. For instance, you can add logic to the template such that if the date is detected to be in ISO yyyy/mm/dd format, the date is reformatted *to appear* to be dd/mm/yyyy on the page. However, if you do this, the underlying data (i.e. when you edit the wikitext, or the form in VisualEditor) will still remain the same.

Troubleshooting the citoid service[edit]

My favourite site isn't recognised by citoid and only gets basic information[edit]

The citoid service relies on the Zotero community for much of its "magic". We use Zotero translators to convert a web page into detailed information for citations, and a translator needs to be written for each website that will be used as a source. Currently, support (existing translators) is the best for English language sources, but even that is far from complete. We need your help to increase the number of sources for which there is a translator.

Zotero translators[edit]

All translators are short pieces of code that share a similar structure, and hence are easy to create. Translators often work both in the browser and translation-server. They can be written for various browser support, namely, Firefox, Chrome, Safari, Internet Explorer. For citoid's use, it is required for any new translator to work in translation-server.

Zotero translators are scripts written in JavaScript to parse web pages into citations. They can be written for journal articles, manuscripts, blog posts, newspaper articles, etc. A feature of Zotero, an open-source software for reference management, a translator can be created for any site and then contributed to the Zotero repository of translators. You can see a list of all Zotero translators in that respository.

Zotero translator for Wikimedia blog[edit]

If you need to build a translator quickly and wish to skip the tutorial given below, you can have a look at this page that has a web translator example with all the code.

Setting development environment[edit]

Translator development can be done on translation-server side or through Scaffold. While development through Scaffold is easier since it is interactive, if you like to work on the console and keep things simpler, you can work on the server. We will show how to set up an environment and then proceed to writing translators.

For translation-server side development[edit]

Install Sublime Text[edit]

  1. Go to the download page of Sublime Text.
  2. Choose the link according to your operating system.
  3. Download the binary files or follow the steps as provided.

Install translation-server[edit]

  1. Open terminal and go to the location where you would like to set up the environment.
  2. Get the code for development: git clone --recursive https://github.com/zotero/translation-server
  3. Move into the cloned repository: cd translation-server
  4. Build a docker image from the Dockerfile: docker build -t translation-server .
  5. Run the docker container from the image and make the server available at localhost at port 1969: docker run --name server --rm -p 1969:1969 translation-server
  6. Use docker stop server to stop the running container.
  7. Install Firefox SDK that will be needed to run a script used for testing changes: ./fetch_sdk
  8. Compile the client code by the following set of command.
    cd modules/zotero
    npm i
    npm run build
    

For development through IDE[edit]

Install Zotero 4.0[edit]

With Zotero 5.0, there is a single standalone application for Zotero that will work for all browser supports. It does not support Scaffold and hence we need to install Zotero 4.0 till the 5.0 version gets upgraded. Follow the steps below for the installation-

  1. Go to the download page to get Zotero.
  2. Click on the "Download" button for suitable platform, that is, Windows, Linux, or Mac.
  3. Extract the compressed folder and double-click on "Zotero" to launch the application.

Note: In case you have already installed Zotero 5.0, you can simultaneously use the previous version on your Firebox browser. You will need to reset your database in that case. The method for achieving this is explained by the Zotero community here.

Install Scaffold[edit]

Scaffold is an integrated development environment for creating Zotero translators. It makes it easy to write and debug a translator. You can also add test cases for a translator very conveniently using Scaffold. Scaffold is a Firefox add-on. (In case you don't have Firefox in your system, get it from the Mozilla site .)

(Still to be done: explain the type of translators above and add which all are supported by Scaffold here).

Follow the steps given below to install Scaffold-

  1. Open this link to get Scaffold in your Firefox browser.
  2. Double-click the XPI file (Extension Archive file) to download the software.
  3. If Firefox prevents the installation, choose the "Allow" option.
  4. After the add-on is downloaded and verified, click on Install.
  5. Restart Firefox to apply the changes. You can now access Scaffold from the Tools.

Required Concepts[edit]

There are a few concepts that you should know that will help you in creating translators. These concepts are discussed briefly.

HyperText Markup Language[edit]

Knowing the basics of HTML is crucial as it makes it easy for you to understand the source code of the web page you want to write a translator for. HTML is a language for creating web pages and applications and along with CSS and JavaScript form the foundation of web pages all over the internet. Fortunately, HTML is easy to read and understand.

HTML contains tags that group the content of a page. Tags can form markup elements which have an opening <> and the closing tag </> or empty elements, which have only the opening tag<>. Tags can also have attributes which help in identifying elements, styling them, and so on. With HTML, web browsers can process a document and present it to the user.

HTMLtags .png

Document Object Model[edit]

DOM is a language-independent interface that structures a web page into a tree-like pattern. It recognizes parts of a document as nodes and organizes them into a hierarchical structure. For example, consider a section of an HTML document below and representation of its DOM:

<html>
    <head>
        <title>
            Hello
        </title>
    </head>
    <body>
        <h1>
            About DOM
        </h1>
        <p>
            This is for the DOM structure of this very document.
        </p>
    </body>
</html>

DOM Citoid.png

CSS Selectors[edit]

CSS selectors are used to target specific elements in an HTML document for styling. These selectors can be used to identify HTML nodes through their class, id, attributes, DOM position and relation, and so on. Once identified, we can scrape information from these tags through HTML DOM methods like querySelector() and querySelectorAll(). These methods can be invoked on a document or an element by passing CSS selectors as parameters. Within a document, querySelector() returns the first element that matches the selectors and with element, it returns the first descendent element that matches the selectors. The selectors that will be used frequently are mentioned below:

  1. .classname - It selects all the HTML elements that have class = "classname".
  2. #idname - It selects all the HTML elements that have id = "idname".
  3. elementType - It selects all the HTML elements that are of type "elementType".
  4. elementType1 elementType2 - It selects all HTML elements that are of type "elementType2" present inside any element of type "elementType1".
  5. elementType1, elementType2 - It selects all HTML elements that are either of type "elementType1" or of type "elementType2".
  6. [attributename = value] - It selects all the HTML elements that have an attribute named "attributename" with the specified value.

To understand how to get CSS selectors of a node in an HTML document, consider the following example:

  1. Open Citoid documentation in a new window of Firefox.
  2. Open Toolbox by pressing Ctrl+Shift+I.
  3. Inspect the title of the document with the node picker. On the selected element Right click->Copy->CSS path.
  4. The CSS path returned will be html.client-js.ve-available body.mediawiki.ltr.sitedir-ltr.mw-hide-empty-elt.ns-0.ns-subject.page-Citoid.rootpage-Citoid.skin-vector.action-view div#content.mw-body h1#firstHeading.firstHeading. We can shorten this path by ignoring most of the selectors. The title is present in a <h1> tag which has class="firstHeading" and id="firstHeading". We can either use the selector #firstHeading or .firstHeading to uniquely identify the title node. In this way, we'll try to shorten CSS paths and use them in the translators.

JavaScript[edit]

JavaScript is a programming language used in web browsers, servers, game development, databases, and so on. Zotero translators are primarily JS files that have above-mentioned concepts in action. You need to have a clear idea of the following concepts of JS before starting to write a translator.

  1. Variables
  2. Statements
  3. Loops
  4. Methods
  5. Functions

Common code blocks in translators[edit]

Before we jump to writing a translator, below are the functions that are useful when we prepare a translator. If you wish to quickly write your web translator, you can open Scaffold and simply copy-paste the following blocks in the code tab and make changes as required. For filing metadata and testing, you can refer to the working example.

attr[edit]

This function returns the value of the attribute we pass to it for a node or set of nodes that match the CSS selectors. Since this function is not available in Zotero 4.0, we need to pass the document explicitly as one of the parameters. Next, we pass the selector/selectors for identifying the node/nodes for which we want to get the information. an attribute name for the node, like class, id, or name, is passed as the attr variable. For index zero, querySelector runs and returns the first element; otherwise querySelectorAll runs.

function attr(docOrElem, selector, attr, index) {
	var elem = index ? docOrElem.querySelectorAll(selector).item(index) : docOrElem.querySelector(selector);
	return elem ? elem.getAttribute(attr) : null;
}

text[edit]

This function will return the text content of the specific node and its descendants or set of nodes that match the CSS selectors. We pass the document, the selector/selectors, and the index, similar to how we did above for the attr function. It will also be used as polyfill untill it gets included in the Zotero code.

function text(docOrElem, selector, index) {
	var elem = index ? docOrElem.querySelectorAll(selector).item(index) : docOrElem.querySelector(selector);
	return elem ? elem.textContent : null;
}

Note: You can use the minimized code for attr and text

function attr(docOrElem,selector,attr,index){var elem=index?docOrElem.querySelectorAll(selector).item(index):docOrElem.querySelector(selector);return elem?elem.getAttribute(attr):null}function text(docOrElem,selector,index){var elem=index?docOrElem.querySelectorAll(selector).item(index):docOrElem.querySelector(selector);return elem?elem.textContent:null}

detectWeb[edit]

The detectWeb function is used to classify the type of data on a web page. It should return one of the item types defined by Zotero. Once a web page falls in a category, its retrieval can then be carried out. There is a wide list of available item types. Each item has relevant fields which can hold data. A book item type has fields such as title, publisher, ISBN, author, edition, and the number of pages.

For example, for an article on Wikipedia, we can use "encyclopediaArticle". A complete list of the types is available here.

function detectWeb(doc, url) {
    // Adjust the inspection of url as required
	if (url.indexOf('search') != -1 && getSearchResults(doc, true)) {
		return 'multiple';
	}
	// Adjust the inspection of url as required
	else if (url.indexOf('mediawiki.org/wiki') != -1){
		return 'encyclopediaArticle';
	}
	// Add other cases if needed
}

doWeb[edit]

doWeb is a function that initiates the retrieval of data. This function is generally written such that if a page has multiple items, it calls getSearchResults (explained below) and provides the user with a pop-up window to select which items to save; if the page has a singleton entry, then it calls scrape (explained below) to save the item information.

function doWeb(doc, url) {
	if (detectWeb(doc, url) == "multiple") {
		Zotero.selectItems(getSearchResults(doc, false), function (items) {
			if (!items) {
				return true;
			}
			var articles = [];
			for (var i in items) {
				articles.push(i);
			}
			ZU.processDocuments(articles, scrape);
		});
	} else {
		scrape(doc, url);
	}
}

getSearchResults[edit]

This function contains the logic to collect multiple items. Each item is stored as a key-value pair. Generally the href ( hypertext reference) of the item is chosen to be the key and the title of the item is chosen to be the value. Once the set of all items is ready, it is returned to the doWeb function.

function getSearchResults(doc, checkOnly) {
	var items = {};
	var found = false;
	// Adjust the CSS Selectors 
	var rows = doc.querySelectorAll('.mw-search-result-heading a');
	for (var i=0; i<rows.length; i++) {
	    // Adjust if required, use Zotero.debug(rows) to check
		var href = rows[i].href;
		// Adjust if required, use Zotero.debug(rows) to check
		var title = ZU.trimInternal(rows[i].textContent);
		if (!href || !title) continue;
		if (checkOnly) return true;
		found = true;
		items[href] = title;
	}
	return found ? items : false;
}

scrape[edit]

The scrape function is called to save a single item. It is the most interesting function to code in a translator. We first create a new item as returned by detectWeb and then store the metadata in the relevant fields of that item. Along with the metadata, attachments can be saved for an item. These attachments become available even when one is offline. In the function shown below, we make use of another translator called Embedded Metadata. We load this translator and it scrapes information from the meta tags of the web page, filling fields and reducing our work. We can always insert and update information of fields on top of what Embedded Metadata provided.

function scrape(doc, url) {
	var translator = Zotero.loadTranslator('web');
	// Embedded Metadata
	translator.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48');
	translator.setDocument(doc);
	
	translator.setHandler('itemDone', function (obj, item) {
		// Add data for fields that are not covered by Embedded Metadata
		item.section = "News";
		// Add custom fields if required
		trans.addCustomFields({
			'twitter:description': 'abstractNote'
		});
		item.complete();
	});

	translator.getTranslatorObject(function(trans) {
	    // Adjust for multiple item types
		trans.itemType = "newspaperArticle";
		trans.doWeb(doc, url);
	});
}

License block[edit]

This block should be added at the beginning of a translator if you wish to submit your translator to the Zotero upstream.

/*
	***** BEGIN LICENSE BLOCK *****

	Copyright © 2017 YourName
	
	This file is part of Zotero.

	Zotero is free software: you can redistribute it and/or modify
	it under the terms of the GNU Affero General Public License as published by
	the Free Software Foundation, either version 3 of the License, or
	(at your option) any later version.

	Zotero is distributed in the hope that it will be useful,
	but WITHOUT ANY WARRANTY; without even the implied warranty of
	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	GNU Affero General Public License for more details.

	You should have received a copy of the GNU Affero General Public License
	along with Zotero. If not, see <http://www.gnu.org/licenses/>.

	***** END LICENSE BLOCK *****
*/

Working example of a translator[edit]

Write the code[edit]

We will prepare a translator for mediawiki.org to scrape information by using the above mentioned functions. Open the text editor and create a new JavaScript file and name it Mediawiki.js. One can use Scaffold to develop translators or test them on translation-server. Below you will find explanation for creating translators in both ways. You can refer to the code snippets provided in the previous section as they are used for the same translator that we will now be preparing.

  1. Include the attr() and text() functions at the top of the file.
  2. We will first write the detectWeb function. For a multiple entries page ( example search page) you can notice that the url has a substring "?search=". So we'll write an "if" clause that checks whether the url of the target page contains the keyword "search" . To prevent misidentifying pages that are not search pages but still have a similar substring in the url(example), we will check if getSearchResults returns true. When both conditions are satisfied, the function should return "multiple". For other pages that are wiki pages, we can check if their url contains substring "mediawiki.org/wiki" and if that is satisfied, we can make the function return "encyclopediaArticle". For running this function we will need to write getSearchReuslts first. After that, we can run and test it.
  3. For getSearchResults, we will generate a CSS path that contains all the items on the page. Then for each item we will take its href as the key and store the title of that articular search result as its value. Open this search page in a new tab within the same window and inspect the first search result with node picker (Ctrl+Shift+I) . Copy its CSS path. The CSS path generated by the inspector will be html.client-js.ve-not-available body.mediawiki.ltr.sitedir-ltr.mw-hide-empty-elt.ns--1.ns-special.mw-special-Search.page-Special_Search.rootpage-Special_Search.skin-vector.action-view div#content.mw-body div#bodyContent.mw-body-content div#mw-content-text div.searchresults ul.mw-search-results li div.mw-search-result-heading a which is quite long. This can be shortened to .mw-search-result-heading a as it will uniquely identify the node in the entire document, reading that we are looking for <a> tag nested in a <div> element that has class name "mw-search-result-heading". We will pass these selectors to querySelectorAll which will return a list of all nodes that match these selectors, hence scraping all results of the search.
  4. Lets move to the doWeb function. This function has the same template for almost all the translators. We check for multiple entries and provide the user with a select window containing all items provided by getSearchResults. The URLs of item/items that the user selects are stored in an array (here the variable named articles) and the Zotero utility processDocuments sends a GET request to each of these URL and then pass the DOM of each page to the scrape function which is the callback function for processDocuments. In case the page contains a single item, doWeb directly calls scrape function on it. This is done through the else clause.
  5. The scrape function gets all the information from the DOM and saves it. The Embedded Metadata.js is a translator that you can include in any of your web translator and it will get information from the meta tags that are well defined. Refer the code snippet of scrape in the above section to see how it can be loaded. We need to create an object of the correct item type before we start storing the information. For this, we get the result from detectWeb and based on it, create an object. For this example, we are categorizing pages as encyclopediaArticle and so we can simply create an object of that type. In case we have different options to choose from, like if your resource holds articles on books, newspaper, journal, etc. you can use a conditional loop to check the item type and then create an appropriate object as required.
    function scrape(doc, url) {
    	item = new Zotero.Item("encyclopediaArticle");
    	....
    }
    
    For the object we have created, we need to know what all information we should scrape. You can find a list of all valid item types and their fields here. For the title of the article, use node inspector to inspect it. Since this node has id as firstHeading, we can extract the title as follows.
    item.title = ZU.trimInternal(doc.getElementById('firstHeading').textContent);
    
    trimInternal() is a Zotero Utility that will remove trailing white spaces if there are any. Articles on Mediawiki don't support mentioning any author or contributor, so these fields will be skipped here, but for almost all the items, this is an important information. A Zotero utility that often comes handy is cleanAuthor which splits the input into the first and last name, which makes it easy to store the names in the creator field. Next, we can store the rights under which each article is available. This is mentioned in the footer of each page. Examining it return us the css path html.client-js.ve-available body.mediawiki.ltr.sitedir-ltr.mw-hide-empty-elt.ns-0.ns-subject.page-Citoid.rootpage-Citoid.skin-vector.action-view div#footer ul#footer-info li#footer-info-copyright a. We can shorten it to #footer-info-copyright a and then write the code as follows. We can also hard-code this information if it is not subject to any change.
    item.rights = text(doc, '#footer-info-copyright a');
    
    We can hard-code the archive as Mediawiki and get the language of the article as shown below.
    item.language = doc.documentElement.lang; // check this: showing en for all, as en is written in html node, try to fix it.
    item.archive = "Mediawiki";
    
    We can store the article tags that are at the bottom the page mentioned under Categories. Inspect the element using the inspector and generate the CSS path. The CSS selectors will be .mw-normal-catlinks ul li a. You'll notice that the division that holds the list of tags is given the class name mw-normal-catlinks which we can use to get all elements that match the specified group of selectors. We can then use the text() function to get the content of each element.
    var tags = doc.querySelectorAll('.mw-normal-catlinks ul li a');
    if(tags.length)
    {
        for (var i=0; i<tags.length; i++) {
    	    item.tags.push(tags[i].text);
        }
    }
    
    We can save links/files/PDFs along with metadata as attachments. For this translator, we can save the active webpage through its url. The mime type can be set "text/html" for links and "application/pdf" for PDFs.
    item.attachments.push({
        url : url,
    	title : "Wikimedia Snapshot",
    	type : "text/html"
    });
    
    Finally, the item can be saved by the following line of code
    item.complete();
    

Run on translation-server[edit]

Add metadata header[edit]

Add the JSON metadata header given below at the top of the Mediawiki.js and save this translator file at the location /translation-server/modules/zotero/translators/. translatorID is used to uniquely identify a file and one way to generate it to use the command md5sum filename on the terminal. It will generate an md5 hash of 32 hexadecimal digits. Enter it as the value of translatorID field in size of 8-4-4-8 digits. For example, if the md5 hash generated is d862e24a041b4f1480be44cfbf390346, enter it in the JSON header as d862e24a-041b-4f14-80be-44cfbf390346. These identifiers should not be same in any two or more files. You can use git grep id to check if the generated ID is already used by other translator or not. In case it is, try generating a new md5 hash. The label should have name of the translator such that it is easy to identify the purpose of the translator. It is convenient to have same entry for label as the name of the translator. The creator field is where you enter your name. For the lastUpdated field, the date needs to be in the format YYYY-MM-DD HH:MM:SS. You can get the system time using the command date "+%Y-%m-%d %H:%M:%S" in the terminal and enter that against lastUpdated. browserSupport must have the 'v' flag as it signifies that the translator is supported on the server. For details on other fields, refer this section.

{
	"translatorID": "",
	"label": "Mediawiki",
	"creator": "Your name",
	"target": "^https:?//www\.mediawiki\.org/",
	"minVersion": "3.0",
	"maxVersion": "",
	"priority": 100,
	"inRepository": true,
	"translatorType": 4,
	"browserSupport": "gcsibv",
	"lastUpdated": ""
}
Build and run the changes[edit]

Once we make any changes in the docker image,i.e, in the translation-server repository that we have, we can test the changes by modifying the docker run command a bit instead of rebuilding the image over and over. For example, we saved our translator file in /translation-server/modules/zotero/translators/. Now to test it, run the following command to get the server running.

./build.sh && docker run --name server -p 1969:1969 -ti --rm -v `pwd`/build/app/:/opt/translation-server/app/ translation-server
Test and update[edit]

While writing or updating files, you can simultaneously test it on the server. For printing intermediary results on the terminal, we can use the Zotero.debug() command. The item will also be displayed as an output if we save it with item.complete(). We can also use Zotero.debug(item) to print it such that it is easier to read. For example, let us see the output of Zotero.debug(text(doc, '#footer-info-copyright a')).

  1. Press Ctrl+C to stop the server if it is running.
  2. Open Mediawiki.js and write Zotero.debug(text(doc, '#footer-info-copyright a')) in the scrape function.
  3. Save the file and re-run the server using the command mentioned in previous section
    Translation output on terminal for Mediawiki transla
  4. Open another terminal and send a query with curl
    curl -d '{"url":"https://www.mediawiki.org/wiki/Citoid","sessionid":"abc123"}' --header "Content-Type: application/json" 127.0.0.1:1969/web
    
Write test cases[edit]

Once you are done with creating a translator, it is recommended to add test cases. These test cases that you add at the time of development are run daily once a translator is merged in the Zotero upstream. The test results help to identify if a translator needs an update, in case the test cases fail due to any reason in near future. It is enough to add one test case for each type of item type that the translator identifies.

Test cases for multiple items[edit]

To begin with, we will test if a page with multiple entries is getting translated successfully. Run the server and use the following query to test a url of this search page.

curl -d '{"url":"https://www.mediawiki.org/w/index.php?search=Zotero+&title=Special:Search&go=Go&searchToken=2pwkmi9qkwlogcnknozyzpco1","sessionid":"abc123"}' \
 --header "Content-Type: application/json" \
127.0.0.1:1969/web

You should receive following output in case the translation is successful. It shows all they key-value pairs of urls and their respective titles present on the search page.

{"https://www.mediawiki.org/wiki/Citoid/Determining_if_a_URL_has_a_translator_in_Zotero":"Citoid/Determining if a URL has a translator in Zotero",
"https://www.mediawiki.org/wiki/Citoid/Creating_Zotero_translators":"Citoid/Creating Zotero translators",
"https://www.mediawiki.org/wiki/Citoid/Zotero%27s_Tech_Talk":"Citoid/Zotero's Tech Talk",
"https://www.mediawiki.org/wiki/Citoid":"Citoid",
....
....
....
"https://www.mediawiki.org/wiki/Tech_talks":"Tech talks"}

To create a test case for this url, we will write a JSON object and store it in a JavaScript variable. The type field shows that it is web translator. The url field holds the url of the source of the test case. The items field shows that it is a multiple items entry.

var testCases = [{
		"type": "web",
		"url": "https://www.mediawiki.org/w/index.php?search=Zotero+&title=Special:Search&go=Go&searchToken=2pwkmi9qkwlogcnknozyzpco1",
		"items": "multiple"
	},
	{
	    //more testCases
	}
]

Test cases for single item[edit]

Server side output of Zotero.debug(item) for Citoid's Mediawiki page.

The test case for single item follows the similar structure as that shown above. The item field here will now hold the information scraped and stored by the translator. For eg, we will take the output of translation of the citoid's page and modify it a little. The adjacent image shows the result of Zotero.debug(item) and following is the way we should store that information.

{
		"type": "web",
		"url": "https://www.mediawiki.org/wiki/Citoid",
		"items": [{
			"itemType": "encyclopediaArticle",
			"title": "citoid",
			"creators": [],
			"archive": "Mediawiki",
			"language": "en",
			"libraryCatalog": "Mediawiki",
			"rights": "Creative Commons Attribution-ShareAlike License",
			"attachments": [{
				"title": "Wikimedia Snapshot",
				"type": "text/html"
			}],
			"tags": [
				"Extensions with VisualEditor support",
				"WMF Projects"
			],
			"notes": [],
			"seeAlso": []
		}]
	}

Run on Scaffold[edit]

Scaffold is an integrated development environment provided by Zotero to write web and import translators. The latest release is Zotero 5.0 but Scaffold doesn't work with Zotero standalone and as a result, we need to work with Zotero 4.0 for Firefox. It is available for download on the download page of Zotero.

Fill in metadata[edit]

Fill metadata and test regex in Scaffold for Mediawiki
  1. Open mediawiki.org in a new window in Firefox browser. You can open the web page you want to translate in a new tab but for Scaffold to detect the source, you will need to keep switching between tabs. So it is convenient to use another window to keep the web page as the active tab.
  2. From the menu bar of Firefox, through the Tools drop-down, open Scaffold.
  3. It has six buttons on top, to load an existing translator, to save the current translator, to run detectWeb, doWeb, detectImport and doImport, respectively.
  4. In the Metadata tab, you will see an automatically generated translator id, which is unique to each translator.
  5. In the label field, enter the name of the translator such that it is easy to recognize the source for which it works. For example, for mediawiki.org, enter the label as Wikimedia.
  6. (Include Target Regex)
  7. Let the other fields have default values. It the bottom, for the translator type, check the "Web" option since we are building a web translator.
  8. For the Browser support, it is convenient to check all the modes, but in case you want to choose a limited list, you can also do that. For citoid's use, it is compulsory for a translator to run in translation-server mode. So do check the last option that says "Server".

Fill in code and test[edit]

  1. In the code tab, enter the JavaScript code we saved in Mediawiki.js. Alternatively, you can write the functions directly in the space provided while testing them simultaneously.
    Output of detectWeb for Citoid's article on Mediawiki
  2. Once you enter the code for detectWeb and the getSearchResults, you can test the output of detectWeb for an individual article and search page by clicking on the eye-like button and make changes if necessary.
    Output of doWeb For Citoid's article on Mediawiki
  3. doWeb for a single article will call the scrape function, fill in the information into the fields of item and present it. To try and test, you can use Zotero.debug() command to print stuff on the test frame.
    Selection window in Scaffold for search page showing results of Zotero on Mediawiki
  4. For a page with multiple entries, Scaffold will show a selection window from where you can choose to save one or multiple items in Zotero library.

Generate test cases[edit]

Once the code of a translator is prepared, it is recommended to create test cases. These test cases are run daily and help the community to figure out if a translator fails in future and needs any update or complete rewriting. We will generate test cases for MediaWiki translator through Scaffold.

  1. Open mediawiki in a new tab. Launch Scaffold and open the translator we have created.
  2. Open the "Testing" tab of Scaffold. We need to give a web page as input. For eg, open citoid's page. Keeping this web page as the active tab, simply click on the "New Web" button. It will load the web page in the Input pane as a new unsaved test.
  3. Select the input entry and click the save button to have the output of test be saved as JSON data.
  4. Similarly lets create a test case for a search page. Open this link in a new tab as the active one and then click on "New Web". Once it is loaded, save it. You can see the saved test cases in the "Test" tab of Scaffold. For this search page, you can notice a JSON object as follows.
    var testCases = [
        {
    		"type": "web",
    		"url": "https://www.mediawiki.org/w/index.php?search=Zotero+&title=Special:Search&go=Go&searchToken=2pwkmi9qkwlogcnknozyzpco1",
    		"items": "multiple"
    	}
    ]
    

Locate the translator file[edit]

Locate the local data directory of Zotero.

The translator file we create through Scaffold is saved locally.

  1. To access it, open Zotero application and choose "Edit" from the menu bar.
  2. In the dropdown list, you will find "Preferences". Click on it to open Zotero Preferences.
  3. Under the Advanced option, you will find a tab "File and Folders"
  4. From there click on the "Show Data Directory" button.
  5. It will open a Zotero directory, where the folder named translators will contain the file we created, named Mediawiki.js (in coherence with the label we gave in metadata).

Submit the translator[edit]

Once your translator is ready, submit it to Zotero's repository for translators on github by creating a pull request from your fork.

Verify the deployment of translator in Citoid[edit]

The changes in the Zotero translator upstream are manually pulled into Wikimedia's mirror of these translators and deployed on Citoid. To check if a translator works in Citoid, you can have a look on the running tests and for translators under the "server" link, check if the required translator has entry "Yes" under the "Supported" column. Only the translators with "Yes" entry work with Citoid. In case a translator is supported and yet not passing the tests, it is probably outdated. The updated Wikimedia's repository of translators is https://github.com/wikimedia/mediawiki-services-zotero-translators. All the translators that are present in the server tests may not be yet deployed and hence not be in Wikimedia's mirror. Though it is updated regularly, you can still ping the community by creating a Phabricator ticket for the same. An example of one such ticket is here.

Useful links[edit]

Testing for the translators using the "server" option or 'v' flag[edit]

To test with translation-server, download and install https://github.com/zotero/translation-server Your translator will need to have the 'v' flag enabled for 'browserSupport'. More here on that: https://www.zotero.org/support/dev/translators As an example, see https://github.com/zotero/translators/blob/master/3news.co.nz.js ; you will see there are a bunch of letters, one of which is v, which corresponds to translation-server. If server support is not enabled by testing it/then adding the 'v' flag, we won't be able to use the translator.

Hacking citoid[edit]

Installation[edit]

If you are using vagrant, you can enable the citoid role to hack on the citoid service. Please note, Zotero is only compatible with Ubuntu 14.04 and will not run on Debian's Jessie.

If you plan to hack on the Citoid extension, you should enable the following roles:

parserfunctions
scribunto
visualeditor

After enabling the roles, you will further need to add wiki pages to get citoid working. The most expedient way to do this is to export the following pages from en-wiki:

Template:Citation
Template:Cite_web
Template:Cite_news
Template:Cite_journal
Template:Cite_book
MediaWiki:Citoid-template-type-map.json
MediaWiki:Visualeditor-cite-tool-definition.json
MediaWiki:Visualeditor-cite-tool-name-web
MediaWiki:Visualeditor-cite-tool-name-book
MediaWiki:Visualeditor-cite-tool-name-journal
MediaWiki:Visualeditor-cite-tool-name-news

And then import them: localhost:8080/wiki/Special:Import.

After importing the Templates, you may need to navigate to the template, hit the edit button, and then hit save (a "null" edit), in order for the templatedata to propagate to the DB.

Running tests[edit]

citoid service[edit]

npm test only runs jshint.

mocha runs the full set of tests.

npm run-script coverage runs the tests and reports code coverage.

In order for all tests to pass, and you have manually installed Zotero, you must use our translator fork instead of the Zotero translators (if you are using vagrant and the citoid role this is already done for you):

git clone https://gerrit.wikimedia.org/r/mediawiki/services/zotero/translators

Citoid will also work with the official Zotero translators repo as well, but the output data will be different in some cases, and this will cause some tests to fail.

Another reason why some tests may fail erroneously is if your DNS will redirect invalid domains to a valid IP (Such as BT Internet's DNS); in some cases, this causes a 520 response instead of a 400 response to be returned. This can be fixed by configuring your internet connection to use OpenDNS or another DNS that does not do this.

Citoid extension[edit]

See: Manual:JavaScript unit testing

See also[edit]

References[edit]

  1. CommonSettings.php on phabricator.wikimedia.org (no feature flag set)
  2. As of June 2017, the citoid service was configured for these Wikipedias: mk, ht, cs, bs, de, el, en, es, fi, fr, he, id, it, nl, no, pl, pt, simple, sl, sv, uk, and zh.