WikidataEntitySuggester/SoftwareSetup

From mediawiki.org

Create a directory where you'd like to do all Entity Suggester work and setup.

Tomcat Setup[edit]

Let's begin by setting up Tomcat.

wget http://archive.apache.org/dist/tomcat/tomcat-7/v7.0.39/bin/apache-tomcat-7.0.39.tar.gz
tar xzf apache-tomcat-7.0.39.tar.gz
cd apache-tomcat-7.0.39
mkdir ROOT_backup
mv webapps/ROOT* ./ROOT_backup/
rm -rf work

Set some JVM parameters for Tomcat (Set the heap according to your available resources, but at least 4GB is recommended. You might need 5-6 GB to support both the Myrrix instances and the REST API, not including the memory you should keep aside for Hadoop):

echo 'export CATALINA_OPTS="-Xmx6g -XX:NewRatio=12 $CATALINA_OPTS"' > bin/setenv.sh

There, it's ready.

Hadoop Setup[edit]

There is no dearth of info on Hadoop on the internet, it’s easy to find helpful tutorials and beginner’s guides on what it is, how to set it up, the works. This section is meant to point you to the right places and get you running. You may take a look at this blog post to get a more elaborate version of this.

Hadoop consists of four kinds of services -

  • NameNode: Stores metadata for the HDFS. This runs on the master node.
  • DataNode: These services store and retrieve the actual data in the HDFS. This service is run on the slave nodes.
  • JobTracker: This service runs on the master node; it coordinates and schedules jobs and distributes them on the TaskTracker nodes.
  • TaskTracker: Runs the tasks, performs computations and communicates its progress with the JobTracker nodes.

Assuming there is only one node with multiple cores (say, 4 cores) in this scenario, here's how to proceed:

Disable iptables if it's enabled, or open up the Hadoop-specific ports.

Add a user for hadoop and its own directory:

sudo useradd -m hadoop
sudo mkdir /hadoop
sudo chown -R hadoop:hadoop /hadoop

Download the CDH3 Hadoop tarball from Cloudera and set it up:

sudo su hadoop
cd /hadoop
wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u6.tar.gz
tar xzf hadoop-0.20.2-cdh3u6.tar.gz
mv hadoop*/* ./
rm *.tar.gz

Enable passwordless SSH authentication for the hadoop user for localhost, so that you can ssh into localhost from the shell without requiring a password. Remember to chmod the contents of ~hadoop/.ssh/ to 600.

Edit /hadoop/conf/masters and /hadoop/conf/slaves so that both contain the word "localhost" without quotes.

Properly configure the JAVA_HOME variable in /hadoop/conf/hadoop-env.sh. Next, modify the *-site.xml files in the /hadoop/conf directory and add these:

/hadoop/conf/core-site.xml:

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/tmp</value> <!-- By default it is /tmp. Change it to wherever you can find enough space. -->
    </property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://node1:54310</value>
    </property>
</configuration>

/hadoop/conf/hdfs-site.xml:

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value> <!-- Read up on this in Hadoop's docs. Basically this means how many nodes to replicate the HDFS data too. You may omit this property if you are using a single node cluster. -->
                <final>true</final>
        </property>
        <property>
                <name>dfs.permission</name>
                <value>false</value>
                <final>true</final>
        </property>
</configuration>

/hadoop/conf/mapred-site.xml:

<configuration>
        <property>
                <name>mapred.reduce.tasks</name>
                <value>4</value> <!-- Run 4 simultaneous reduce tasks for a job -->
        </property>
        <property>
                <name>mapred.job.reuse.jvm.num.tasks</name>
                <value>-1</value> <!-- Reuse a JVM for further mappers and reducers rather than spawning a new one. -->
        </property>
        <property>
                <name>mapred.map.tasks</name>
                <value>4</value>
        </property>
        <property>
                <name>mapred.tasktracker.map.tasks.maximum</name>
                <value>4</value> <!-- Max number of mappers to run on a node -->
        </property>
        <property>
                <name>mapred.tasktracker.reduce.tasks.maximum</name>
                <value>4</value> <!-- Max number of mappers to run. -->
        </property>
        <property>
                <name>mapred.job.tracker</name>
                <value>hdfs://node1:54311</value>
        </property>
        <property>
                <name>mapred.child.java.opts</name>
                <value>-Xmx2048m</value> <!-- Read up on its details on hadoop's docs. I used this value on the 8G RAM labs instance for the Entity Suggester. -->
        </property>
        <property>
                <name>mapred.child.ulimit</name>
                <value>5012m</value> <!-- Read up on its details on hadoop's docs. I used this value on the 8G RAM labs instance for the Entity Suggester. -->
        </property>
</configuration>

Configuring done. Let’s format the namenode and fire up the cluster:

sudo su hadoop
cd /hadoop
bin/hadoop namenode -format
# Start the namenode and datanodes
bin/start-dfs.sh
# Check if the DFS has been properly started:
bin/hadoop dfsadmin -report
bin/hadoop dfs -df
# Start the jobtracker and tasktrackers
bin/start-mapred.sh

Check the log files. They are invaluable. Also, after starting up the services, check all the nodes with ps aux | grep java to see if all the services are running correctly. Congratulations, by now you should have your own Hadoop cluster running and kicking. Do bin/stop-all.sh to shut down the cluster.

Lastly, I would recommend Michael Noll’s tutorials on setting up single-node and multi-node clusters with Ubuntu. It does not use this cloudera distribution, but it’s a pretty good resource.