User:Mine0901/sandbox

Zotero translators
Zotero translators are scripts written in Javascript to parse web pages into citations. They can be written for journal articles, manuscripts, blog posts, newspaper articles, etc. A feature of Zotero, an open-source software for reference management, a translator can be created for any specific site and then submitted to the Zotero repository of translators. You can see a list of all Zotero translators at https://github.com/zotero/translators.

Creating Zotero translators
The citoid service relies on the Zotero community for much of the "magic". We use Zotero translators to convert a page link into detailed information, and these translators need to be written for each site. Currently, the support is the best for English language sources, and we need your help to improve coverage of other sites. All translators share a similar structure, are short pieces of code and hence easy to create. Translators often work both in the browser and translation-server. They can be written for various browser support, namely, Firefox, Chrome, Safari, Internet Explorer. For citoid's use, it is required for any new translator to work in translation-server.

Required software
To make the process of creating a translator easier, we will download a few free programs as listed below.

Install Zotero
Zotero has several ways to be downloaded - Zotero add-on for Firefox, Zotero standalone, Zotero Bookmarklet and Zotero for portable Firefox. For the purpose of writing a translator, it is the best suited to install the Firefox add-on. Follow the steps given below for the installation-
 * 1) Go to the download page to get Zotero.
 * 2) Choose the Install Zotero for Firefox option.
 * 3) If Firefox prevents the installation, choose the "Allow" option.
 * 4) After the add-on is downloaded and verified, click on Install.
 * 5) Restart Firefox to apply the changes. You will now be able to see Zotero's icon, that is a 'Z', in the toolbar of Firefox.

Install Scaffold
Scaffold is an integrated development environment for creating Zotero translators. It makes it easy to write and debug a translator. You can also add test cases for a translator very conveniently using Scaffold. Scaffold is a Firefox add-on, in case you don't have Firefox in your system, get it from the Mozilla site (Explain the type of translators above and add which all are supported by Scaffold here). Follow the steps given below to install Scaffold-
 * 1) Open this link to get Scaffold in your Firefox browser.
 * 2) Double click the XPI file (Extension Archive file) to download the software.
 * 3) If Firefox prevents the installation, choose the "Allow" option.
 * 4) After the add-on is downloaded and verified, click on Install.
 * 5) Restart Firefox to apply the changes. You can now access Scaffold from the Tools.

Install Firebug
For writing translator, XPaths are used to extract information from HTML or XML. Firebug is an extension for Firefox that we can be used to analyze any website's source code and create XPaths. Firebug is no longer maintained and is deprecated in favor of Firefox DevTools but the Firefox DevTools doesn't have this feature yet. For this reason, you can install Firebug or any other XPath tools. Follow the steps given below to install Firebug-
 * 1) Open this link to get this add-on.
 * 2) Double click on the button that says "+ Add to Firefox".
 * 3) After the add-on is downloaded and verified, click on Install.
 * 4) Restart Firefox to apply the changes. You will now be able to inspect web pages with Firebug.

Required Concepts
There are a few concepts that you should know that will help you in creating translators. These concepts are discussed briefly.

HyperText Markup Language
Knowing the basics of HTML is crucial as it makes it easy for you to understand the source code of the web page you want to write a translator for. The good point is it is very easy to read and understand. HTML is a language for creating web pages and applications and along with CSS and JavaScript form the foundation of webpages all over the internet. HTML contains tags that group content. Tags can form markup elements which have an opening <> and the closing tag  or empty elements, which have only the opening tag<>. Tags can also have attributes which help in identifying elements, styling them, etc. Web browsers can then process HTML documents and present it to the user.



Document Object Model
DOM is a language-independent interface that structures a web page into a tree-like pattern. It recognizes parts of a document as nodes and organizes them into a hierarchical structure. For example, consider a section of an HTML document below and representation of its DOM:

XML Path Language
The primary purpose of XPath is to address parts of an XML document. Through DOM, we know how to get a well-defined unique path from the root node to any node in the document. We need to put this chain of nodes forming a path in a format that JavaScript can understand. TO understand hoe Xpaths are used in writing translators, consider the following example. To make this XPath short and not prone to any failure, we can make some changes. For that, we will go through the building blocks of XPaths
 * 1) Open Citoid documentation in a new window of Firefox.
 * 2) Open firebug by pressing Ctrl+Shift+C.
 * 3) Inspect the title of the page, that is, "citoid" through the inspect element feature.
 * 4) Right-click and choose the option "Copy XPath".
 * 5) The XPath for the title is /html/body/div[3]/h1
 * 1) HTML elements - For the above generated XPath /html/body/div[3]/h1 html, body, h1 and div are html elements.
 * 2) Delimiter (// and /) - All XPaths starting from the root node have the backslash '/' as their first character. Each node occurring in the XPath starts with a backslash, or we can say all nodes are separated by a backslash. For two XPaths, /html/body/h1/p and /html/body/h2, we can see that /html/body is the common path. We can go ahead and remove this common subpath from the XPaths and convert them into //h1/p and //h1, respectively.
 * 3) Node specifiers -  [3] in /html/body/div[3]/h1 is the node specifier. It tells that we need the third div node after we have traveled down the nodes html and body. The order of the nodes in the source code matters.
 * 4) Attributes - We can identify nodes by their attributes defined inside the opening tags. We can change /html/body/div[3]/h1 to simply //h1[@class="firstHeading"].

JavaScript
JavaScript is a programming language used in web browsers, servers, game development, databases, etc. Zotero translators are primarily JS files that have above-mentioned concepts in action. You need to have a clear idea of the following concepts of JS before starting to write a translator.
 * 1) Variables
 * 2) Statements
 * 3) Loops
 * 4) Methods
 * 5) Functions

detectWeb
detectWeb function is used to classify the type of data on a web page. It should return one of the item types defined by Zotero. Once a web page falls in a category, its retrieval can then be carried out. There is a wide list of available item types. You can view it here. Each item has relevant fields which can hold data. A book item type has fields like title, publisher, ISBN, author, edition, nhe umber of pages, etc.

For example, for an article on Wikipedia, we can use "encyclopediaArticle". A complete list of the types is available here.

doWeb
doWeb is a function that initiates the retrieval of data. This function is generally written such that if a page has multiple items, it calls getSearchResults (explained below) and provide the user a pop-up window to select which all items to save and if the page has a singleton entry, then it calls scrape (explained below) to save the item information.

getSearchResults
This function contains the logic to collect multiple items. Each item is stored as a key-value pair. Generally the href ( hypertext reference) of the item is chosen to be the key and the title of the item is chosen to be the value. Once the set of all items is ready, it is returned to the doWeb function.

scrape
scrape function is called to save a single item. It is the most interesting function to code in a translator. We first create a new item as returned by detectWeb and then store the metadata in the relevant fields of that item. Along with the metadata, attachments can be saved for an item. These attachments become available even when one is offline.

License block
This block should be added at the top of a translator if you wish to submit your translator to the Zotero upstream.

Working example of a translator
Let us now see how everything we learned comes into play. We will build a translator for mediawiki.org and see how to get it live in production.

Filling in metadata

 * 1) Open mediawiki.org in a new window. You can open the web page you want to translate in a new tab but for Scaffold to detect the source, you will need to keep switching between tabs. So it is convenient to use another window to keep the web page as the active tab.
 * 2) From the menu bar of Firefox, through the Tools dropdown, open Scaffold.
 * 3) It has six buttons on top, to load an existing translator, to save the current translator, to run detectWeb, doWeb, detectImport and doImport.
 * 4) In the Metadata tab, you will see an automatically generated translator id, which is unique to each translator.
 * 5) In the label field, enter the name of the translator such that it is easy to recognize the source for which it works. For example, for mediawiki.org, enter the label as Wikimedia. (Include Target Regex)
 * 6) Let the other fields have default values. It the bottom, for the translator type, check the "Web" option since we are building a web translator.
 * 7) For the Browser support, it is convenient to check all the modes, but in case you want to choose a limited list, you can also do that. For citoid's use, it is compulsory for a translator to run in translation-server mode. So do check the last option that says "Server".

Writing the code

 * 1) In the Code tab, we will write the functions that will retrieve and save the metadata.
 * 2)  To begin with, we will write the detectWeb function. The "if" clause checks the url of the target page and if it contains the keyword "search" then it maybe a page which has multiple items. To prevent pages that may not be search pages (example), the function checks if getSearchResults return true. When both conditions are satisfied, the function returns "multiple". For other pages that are wiki pages, we can check if their url contains substring "mediawiki.org/wiki" and if found, we will make the function return "encyclopediaArticle". For running this function we will need to write getSearchReuslts first. After that, we will see how it works. You can run this code in Scaffold to see how it works.  Open this search page in a new tab within the same window. Click on the detectWeb button .XpathForSearch.png
 * 3) For getSearchResults, we will generate an xpath that contains all the items on the page and then for each item, we will take its href as the key and store the title of that articular search result as its value. Open this search page in a new tab within the same window and inspect the first search result with firebug. Copy its xpath. The xpath generated will be . This can be shortened to   reading that we are looking for  tag nested in a element that has class name "mw-search-result-heading".
 * 4)  We will now write the doWeb function. This function has the same template for almost all the translators. We check for multiple entries and provide the user with a select window containing all items provided by getSearchResults. The URLs of item/items that the user selects are stored in an array (here the variable named articles) and the Zotero utility processDocuments sends a GET request to each of these URL and then pass the DOM of each page to the scrape function which is the callback function for processDocuments here. In case the page contains a single item, doWeb will directly call scrape function on it. This is done through the else clause.
 * 5) The scrape function gets all the information from the DOM and saves it. We need to create an object of the correct item type before we start storing the information. For this, we get the result from detectWeb and based on it create an object. For this example, we are categorizing pages as encyclopediaArticle and so we can simply create an object of that type. In case we have different options to choose from, like if your resource holds articles on books, newspaper, journal, etc. you can then use a conditional loop to check the item type and then create the appropriate one. For the object we have created, we need to know what all information we should scrape. You can find a list of all valid fields here. For the title of the article, use firebug to inspect it. Since this node has id as firstHeading, we can extract the title as follows. trimInternal is a Zotero Utility that will remove trailing white spaces if there will be any. Since articles on Mediawiki don't support mentioning any author or contributor, these fields will be skipped here but for almost all the items, this is an important information. A Zotero utility that often comes handy is cleanAuthor which splits the input into the first and last name which makes it easy to store the names in the creator field. Next, we can store the rights under which each article is available. This is mentioned in the footer of each page. Examining it with firebug return us the XPath . We can shorten it to   and then write the code as follows. We can also hard-code this information if it is not subject to any change. We can hard-code the archive as Mediawiki and get the language of the article as shown below We can store the article tags that are at the bottom the page mentioned under Categories. Inspect the element using Firebug and generate the XPath. The XPath will be  . You'll notice that the division that holds the list of tags is given an id. Using this, the XPath can be updated as shown in the code below. We can save links/files/PDFs along with metadata as attachments. For this translator, we can save the active webpage through its url. The mime type can be set "text/html" for links and "“application/pdf” for PDFs. Finally, the item can be saved by the following line of code.