hadoop on ARM (part 1)

Posted: July 8, 2012 in linaro, server

Updated: Slight HTML error on my part caused missing elements in some of the xml conf files.

The next step in my linaro based armhf server image, is to install and run hadoop. This blog post follows the general hadoop instructions found in the apache wiki for ubuntu with updates specifically for a Linaro based install and for ARM.

First we need java.

# apt-get install openjdk-6-jdk

after it’s installed, let’s validate things are fine.

# java -version
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.1) (6b24-1.11.1-4ubuntu3)
OpenJDK Zero VM (build 20.0-b12, mixed mode)

Now we need to create a user and group for hadoop

# addgroup hadoop
# adduser --ingroup hadoop hduser

If you haven’t already make sure you have openssh-server installed

# apt-get install openssh-server

Now we need to gen keys for the hduser account.

# su - hduser
$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
...
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Yes we’ve created a key with an empty password. This is a test setup. Not a production setup. Now let’s connect locally to make sure everything is ok.

$ slogin localhost

Be sure to connect all the way in to a command line. If that works, you’re in good shape. Exit.

Now on the advice of others, we’re going to disable ipv6.

echo 'net.ipv6.conf.all.disable_ipv6 = 1' >> /etc/sysctl.conf
echo 'net.ipv6.conf.default.disable_ipv6 = 1' >> /etc/sysctl.conf
echo 'net.ipv6.conf.lo.disable_ipv6 = 1' >> /etc/sysctl.conf

And then to make things take affect

# sysctl -p

Now we’re ready to install hadoop. Unfortunately there are not as of yet hadoop packages so we’ll have to install it from source. hadoop as it turns out is written in java, so it’s a just matter of installation, and not a build from source. Download hadoop-1.0.3.tar.gz from here.

# cd /usr/local
# tar xfz hadoop-1.0.3.tar.gz
# ln -s hadoop-1.0.3 hadoop
# mkdir hadoop/logs
# chown -R hduser:hadoop hadoop

Now we need to set up some environment vars for the hduser.

# su - hduser
$ vi ~/.bashrc

and add the following to the end of the file

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-armhf

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Save and exit from the hduser account.

Now as root again, edit /usr/local/hadoop/conf/hadoop-env.sh and add

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-armhf

Now edit /usr/local/hadoop/conf/core-site.xml and add the following between the configure tags. Feel free to change the temp directory to a different location. This will be where HDFS, the Hadoop Distributed File System will be putting it’s temp files.

<!-- In: conf/core-site.xml -->
<property>
  <name>hadoop.tmp.dir</name>
  <value>/fs/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

Now we need to create the directory

# mkdir -p /fs/hadoop/tmp
# chown hduser:hadoop /fs/hadoop/tmp
# chmod 750 /fs/hadoop/tmp

Now edit /usr/local/hadoop/conf/mapred-site.xml and again drop the following after the configuration tag.

<!-- In: conf/mapred-site.xml -->
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

Now edit /usr/local/hadoop/conf/hdfs-site.xml and again add the following after the configuration tag

<!-- In: conf/hdfs-site.xml -->
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

Now we are going to setup the HDFS filesystem.

# su - hduser
$ /usr/local/hadoop/bin/hadoop namenode -format

You will see output that should resemble the following

hduser@linaro-server:~$ /usr/local/hadoop/bin/hadoop namenode -format
Warning: $HADOOP_HOME is deprecated.

12/07/09 03:58:09 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = linaro-server/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012
************************************************************/
12/07/09 03:58:12 INFO util.GSet: VM type       = 32-bit
12/07/09 03:58:12 INFO util.GSet: 2% max memory = 19.335 MB
12/07/09 03:58:12 INFO util.GSet: capacity      = 2^22 = 4194304 entries
12/07/09 03:58:12 INFO util.GSet: recommended=4194304, actual=4194304
12/07/09 03:58:18 INFO namenode.FSNamesystem: fsOwner=hduser
12/07/09 03:58:20 INFO namenode.FSNamesystem: supergroup=supergroup
12/07/09 03:58:20 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/07/09 03:58:20 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/07/09 03:58:20 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/07/09 03:58:20 INFO namenode.NameNode: Caching file names occuring more than 10 times 
12/07/09 03:58:21 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/07/09 03:58:23 INFO common.Storage: Storage directory /fs/hadoop/tmp/dfs/name has been successfully formatted.
12/07/09 03:58:23 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at linaro-server/127.0.1.1
************************************************************/

Now it’s time to start our single node cluster. Run this as the hduser

$ /usr/local/hadoop/bin/start-all.sh

If all is well you’ll see something like the following:

hduser@linaro-server:~$ /usr/local/hadoop/bin/start-all.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-namenode-linaro-server.out
localhost: starting datanode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-datanode-linaro-server.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-secondarynamenode-linaro-server.out
starting jobtracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-jobtracker-linaro-server.out
localhost: starting tasktracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hduser-tasktracker-linaro-server.out

Last but not least, the jps command will show you the hadoop processes. Tho technically the jps tool is showing you the java processes on the system.

Comments
  1. Nelson says:

    Have you ever benchmarked hadoop on ARM and would you care to share what you found?

Leave a reply to Nelson Cancel reply