Friday, March 23, 2012

Setting up a local hadoop cluster

I am running a hadoop cluster on my local machine, for development purposes, in a pseudo distributed mode. In pseudo distributed mode, hadoop daemons run in separate processes.
Just thought I will outline the basic setup process here, in case someone else finds it handy. It is actually quite straightforward to get a hadoop cluster running locally. Here are the basic steps:
  • You need to download a hadoop implementation. The one I am using is from Cloudera, and you can obtain a tarball from here.
  • Expand the tarball into some convenient directory on your local machine. You will find a bin directory which contains all necessary files for managing hadoop, and a conf directory where hadoop configurations are located. You will have to modify conf-site.xml, hdfs-site.xml, and mapred-site.xml to specify name nodes, replications, etc. to customize your cluster setup. More on this at the bottom.
  • You need to ensure remote login, that is, check if you can do ssh localhost without being asked for a password. If you already have a public key, then you can run the following command (assuming dsa key) to ensure password-less local ssh login.

    cat $HOME/.ssh/id_dsa.pub >> $HOME/.ssh/authorized_keys

  • Set the following environment variables:

    HADOOP_HOME: Path to hadoop directory

    HADOOP_CONF_DIR: Path to hadoop conf directory

    Also add the hadoop bin directory in you path.

  • Create a hadooptemp directory (see below in core-site.xml) using the following command:

    sudo mkdir -m 777 hadooptemp

  • Format the data node:

    hadoop namenode -format

  • Bring up pseudo mode by invoking start-all.sh. You can stop pseudo mode using stop-all.sh.
You can set up override configurations for hadoop following the examples below.

An example core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadooptemp/${user.name}</value>
</property>

</configuration>
An example hdfs-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

<property>
<!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
<name>dfs.name.dir</name>
<value>/tmp/hadoop/cache/hadoop/dfs/name</value>
</property>

</configuration>
And an example mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>

<property>
<name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>
<value>true</value>
</property>

</configuration>



Happy hacking with hadoop!

No comments:

Post a Comment