A rambling soul: life, technology, and everything there is: Setting up a local hadoop cluster

Friday, March 23, 2012

Setting up a local hadoop cluster

I am running a hadoop cluster on my local machine, for development purposes, in a pseudo distributed mode. In pseudo distributed mode, hadoop daemons run in separate processes.

Just thought I will outline the basic setup process here, in case someone else finds it handy. It is actually quite straightforward to get a hadoop cluster running locally. Here are the basic steps:

You need to download a hadoop implementation. The one I am using is from Cloudera, and you can obtain a tarball from here.
Expand the tarball into some convenient directory on your local machine. You will find a bin directory which contains all necessary files for managing hadoop, and a conf directory where hadoop configurations are located. You will have to modify conf-site.xml, hdfs-site.xml, and mapred-site.xml to specify name nodes, replications, etc. to customize your cluster setup. More on this at the bottom.
You need to ensure remote login, that is, check if you can do ssh localhost without being asked for a password. If you already have a public key, then you can run the following command (assuming dsa key) to ensure password-less local ssh login.
```
cat $HOME/.ssh/id_dsa.pub >> $HOME/.ssh/authorized_keys
```
Set the following environment variables:
HADOOP_HOME: Path to hadoop directory

HADOOP_CONF_DIR: Path to hadoop conf directory

Also add the hadoop bin directory in you path.
Create a hadooptemp directory (see below in core-site.xml) using the following command:
```
sudo mkdir -m 777 hadooptemp
```
Format the data node:
```
hadoop namenode -format
```
Bring up pseudo mode by invoking start-all.sh. You can stop pseudo mode using stop-all.sh.

You can set up override configurations for hadoop following the examples below.

An example core-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>

  <property> 
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadooptemp/${user.name}</value>
  </property>

</configuration>

An example hdfs-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

  <property>
    <name>dfs.permissions</name>
    <value>false</value>
  </property>

  <property>
    <!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
    <name>dfs.name.dir</name>
    <value>/tmp/hadoop/cache/hadoop/dfs/name</value>
  </property>

</configuration>

And an example mapred-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>

  <property>
    <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>
    <value>true</value>
  </property>

</configuration>

Happy hacking with hadoop!

A rambling soul: life, technology, and everything there is

Friday, March 23, 2012

Setting up a local hadoop cluster

No comments:

Post a Comment