WikidataEntitySuggester/SoftwareSetup

Create a directory where you'd like to do all Entity Suggester work and setup.

Tomcat Setup
Let's begin by setting up Tomcat.

wget http://archive.apache.org/dist/tomcat/tomcat-7/v7.0.39/bin/apache-tomcat-7.0.39.tar.gz tar xzf apache-tomcat-7.0.39.tar.gz cd apache-tomcat-7.0.39 mkdir ROOT_backup mv webapps/ROOT* ./ROOT_backup/ rm -rf work

Set some JVM parameters for Tomcat (Set the heap according to your available resources, but at least 4GB is recommended. You might need 5-6 GB to support both the Myrrix instances and the REST API, not including the memory you should keep aside for Hadoop):

There, it's ready.

Hadoop Setup
There is no dearth of info on Hadoop on the internet, it’s easy to find helpful tutorials and beginner’s guides on what it is, how to set it up, the works. This section is meant to point you to the right places and get you running. You may take a look at this blog post to get a more elaborate version of this.

Hadoop consists of four kinds of services -


 * NameNode: Stores metadata for the HDFS. This runs on the master node.
 * DataNode: These services store and retrieve the actual data in the HDFS. This service is run on the slave nodes.
 * JobTracker: This service runs on the master node; it coordinates and schedules jobs and distributes them on the TaskTracker nodes.
 * TaskTracker: Runs the tasks, performs computations and communicates its progress with the JobTracker nodes.

Assuming there is only one node with multiple cores (say, 4 cores) in this scenario, here's how to proceed:

Disable iptables if it's enabled, or open up the Hadoop-specific ports.

Add a user for hadoop and its own directory: sudo useradd -m hadoop sudo mkdir /hadoop sudo chown -R hadoop:hadoop /hadoop

Download the CDH3 Hadoop tarball from Cloudera and set it up: sudo su hadoop cd /hadoop wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u6.tar.gz tar xzf hadoop-0.20.2-cdh3u6.tar.gz mv hadoop*/* ./ rm *.tar.gz

Enable passwordless SSH authentication for the hadoop user for localhost, so that you can ssh into localhost from the shell without requiring a password. Remember to chmod the contents of ~hadoop/.ssh/ to 600.

Edit  and   so that both contain the word "localhost" without quotes.

Properly configure the JAVA_HOME variable in. Next, modify the *-site.xml files in the /hadoop/conf directory and add these:

/hadoop/conf/core-site.xml: hadoop.tmp.dir /home/hadoop/tmp fs.default.name hdfs://node1:54310

/hadoop/conf/hdfs-site.xml: dfs.replication 1                true dfs.permission false true

/hadoop/conf/mapred-site.xml: mapred.reduce.tasks 4                mapred.job.reuse.jvm.num.tasks -1                mapred.map.tasks 4                mapred.tasktracker.map.tasks.maximum 4                mapred.tasktracker.reduce.tasks.maximum 4                mapred.job.tracker hdfs://node1:54311 mapred.child.java.opts -Xmx2048m mapred.child.ulimit 5012m

Configuring done. Let’s format the namenode and fire up the cluster:

sudo su hadoop cd /hadoop bin/hadoop namenode -format bin/start-dfs.sh bin/hadoop dfsadmin -report bin/hadoop dfs -df bin/start-mapred.sh
 * 1) Start the namenode and datanodes
 * 1) Check if the DFS has been properly started:
 * 1) Start the jobtracker and tasktrackers

Check the log files. They are invaluable. Also, after starting up the services, check all the nodes with   to see if all the services are running correctly. Congratulations, by now you should have your own Hadoop cluster running and kicking. Do   to shut down the cluster.

Lastly, I would recommend Michael Noll’s tutorials on setting up single-node and multi-node clusters with Ubuntu. It does not use this cloudera distribution, but it’s a pretty good resource.