Zotero translator for Wikimedia blog

This page shows how to build a Zotero web translator for the Wikimedia blog. We will be testing it on translation-server. The fields that should be filled for blog entries are provided here.

Create translator file and add Metadata

 * 1) Set-up the environment for development of translators through translation following the steps as explained here.
 * 2) Create a new file in you text editor named Wikimedia Blog.js and save it in translation-server/modules/zotero/translators.
 * 3) Add the metadata in the format as shown at the top of the file. Generate a hash code by running md5sum  /exact-path/Wikimedia blog.jn the terminal and enter it as the translatorID.
 * 4) Generate the system time using the command date "+%Y-%m-%d %H:%M:%S" in the terminal. Add your name under the creator and let other fields be as shown.

Add Licence block
Add this licence block and edit the year and name of creator in the first sentence of the block.

Add polyfill functions and detectWeb

 * 1) Next, add the polyfill functions for text and attr.
 * 2) Open this Wikimedia blog entry in a browser tab and let us find out a way to identify any url if it is a single entry or not. Press Ctrl+Shift+I and check the class attribute in the body tag of the page.It can be used as a way of identifying the page type. We can use the string "single-post" to check if a page is a single blog entry or not.
 * 3) Open this Search page of the blog in a browser tab and see the body tab to find a way to identify search results. Here the class inside the body tag has substrings like "search", "search-results", etc. So for any url we receive, we can check the class attribute of body tag to know the type of document in which we want to classify it. There are other multiple entry pages on the blog that are categorized as technology, community, foundation, etc and have been archived. We can handle such pages by checking if their class name has the substring "archive". Following is the detectWeb code using this logic.

Add getSearchResults and doWeb function

 * 1) For detectWeb to work, we need to write the getSearchResults. This method should be able to pick all results from a multiple items page and save it in key,value pairs with the corresponding url. We will use CSS selectors to reach all the nodes holding information about articles. Open this Search page in a new tag and Press Ctrl+Shift+I. Using the node picker, inspect the title of the first article. On inspecting the title, an HTML tag will get highlighted in the Inspector window corresponding to it. Right click on the tag and go for Copy -> CSS path. The following will be the CSS path.BlogTitleCSS.png
 * 2) We can reduce the length of this path in a way that it continues to uniquely identify the correct nodes we are looking for. Picking the last few selectors will serve the similar purpose as the whole path. So we will use li.article header.article-head h4.article-title a to get nodes holding title as text and url in the href attribute. Following you can see how getSearchResults uses this path and then we can simple insert the template of doWeb function without any changes.

Test multiple entry pages
Once detectWeb, getSearchResults and doWeb are coded, we can go ahead and test the multiple entry pages to know if it works fine. Open terminal and from the translation-server directory, rebuild the image to take in the changes and run the docker container. Open another terminal window and pass the query for search page we have been using. We should also make sure that our methods are handling the archived articles well. Lets pass another url to test.

Add scrape method
We will use embedded metadata.js to scrape information. It will get the data from the meta tags and fill them automatically in the relevant fields. Over it, we will extract other information missed by the translator, if necessary. Let us first check the output provided by embedded metadata. Include the following code that loads one translator into another. Rebuild the docker image and run the server. Pass the following query which now has a single entry page's url for testing. You can replace item.complete with Zotero.debug(item) to see output in the terminal in a readable form. The embedded metadata translator will pull fields like url and title. For a blog entry, we can provide the author's name, the time of publishing the article, the tags it has, etc. Let us see how to extract some of these fields.

Extract creator information
Open this article in a new tab. Under the title, we have author information. Inspect the first author name 'Ruby Mizrahi" with the help of node picker. Right click and copy the CSS path for this node. The node is . As we can see it has class name as author. The CSS path returned is We can just use the last few CSS selectors   to identify all nodes that have author/authors information. We will edit the handler of translator to use this path as follows. We take in all the nodes that have the same CSS path and from those nodes, we get the text they hold and make use of Zotero utility

Add rights
We can add the Wikimedia blog as the value for catalog field. We can use the license provided in the footer of the blog as the rights for each blog entry. Inspect the Creative commons license with node picker and you will get the CSS path as follows. We can identify this node by  and add this information in the item object as shown below.

Test Single entry Pages
Rebuild the container to reflect the updates in translator and run the server. Test a single post entry using curl to make sure there is no syntax or logical bug in the code.

Add test cases
We should add test cases for each type of item that can be identified by the translator. Here we have identified pages as either single blog posts or multiple blog posts. On successfully testing urls in the above sections, we can manually write test cases using the results generated. The best way to write test case for single entry item is to use Zotero.debug(item) while removing item.complete in the scrape method temporarily, use the output and save it in the following format.