Zotero translator for Wikimedia blog

This page shows how to build a Zotero web translator for the Wikimedia blog. We will be testing it on translation-server.

Create translator file and add Metadata

 * 1) Set-up the environment for development of translators through translation following the steps as explained here.
 * 2) Create a new file in you text editor named Wikimedia Blog.js and save it in translation-server/modules/zotero/translators.
 * 3) Add the metadata in the format as shown at the top of the file. Generate a hash code by running md5sum  /exact-path/Wikimedia blog.jn the terminal and enter it as the translatorID.
 * 4) Generate the system time using the command date "+%Y-%m-%d %H:%M:%S" in the terminal. Add your name under the creator and let other fields be as shown.

Add polyfill functions and detectWeb

 * 1) Next, add the polyfill functions for text and attr.
 * 2) Open this Wikimedia blog entry in a browser tab and let us find out a way to identify any url if it is a single entry or not. Press Ctrl+Shift+I and check the class attribute in the body tag of the page.It can be used as a way of identifying the page type. We can use the string "single-post" to check if a page is a single blog entry or not.
 * 3) Open this Search page of the blog in a browser tab and see the body tab to find a way to identify search results. Here the class inside the body tag has substrings like "search", "search-results", etc. So for any url we receive, we can check the class attribute of body tag to know the ti\ype of document in which we want to classify it. There are other multiple entry pages on the blog that are categorized as technolody, community, foundation, etc and have been archived. We can handle such pages by checking if their class name has the substring "archive". Following is the detectWeb code using this logic.

Add getSearchResults and doWeb function

 * 1) For detectWeb to work, we need to write the getSearchResults. This method should be able to pick all results from a multiple items page and save it in key,value pairs with the corresponding url. We will use CSS selectors to reach all the nodes holding information about articles. Open this Search page in a new tag and  Press Ctrl+Shift+I. Using the node picker, inspect the title of the first article. On inspecting the title, an HTML tag will get highlighted in the Inspector window corresponding to it. Right click on the tag and go for Copy -> CSS path. The following will be the CSS path.
 * 2) We can reduce the length of this path in a way that it continues to uniquely identify the correct nodes we are looking for. Picking the last few selectors will serve the similar purpose as the whole path. So we will use li.article header.article-head h4.article-title a to get nodes holding title as text and url in the href attribute. Following you can see how getSearchResults uses this path and then we can simple insert the template of doWeb function without any changes.

Test multiple entry pages
Once detectWeb, getSearchResults and doWeb are coded, we can go ahead and test the multiple entry pages to know if it works fine. Open terminal and from the translation-server directory, rebuild the image to take in the changes and run the docker container. Open another terminal window and pass the query for search page we have been using. We should also make sure that our methods are handling the archived articles well. Lets pass another url to test.

Add scrape method
We will use embedded metadata.js to scrape information. It will get the data from the meta tags and fill them automatically in the relevant fields. Over it, we will extract other information missed by the translator, if necessary. Let us first check the output provided by embedded metadata. Include the following code that loads one translator into another. Rebuild the docker image and run the server. Pass the following query which now has a single entry page's url for testing. The embedded metadata translator will pull fields like url and title. For a blog entry, we can provide the author's name, the time of publishing the article, the tags it has, etc. Let us see how to extract these fields. Extract creator information

Open this article in a new tab. Under the title, we have author information. Inspect the first author name 'Ruby Mizrahi" with the help of node picker. Right click and copy the CSS path for this node. The node is Ruby Mizrahi. As we can see it has class name as author. The CSS path returned is