I am running a hadoop cluster on my local machine, for development purposes, in a pseudo distributed mode. In pseudo distributed mode, hadoop daemons run in separate processes.
Just thought I will outline the basic setup process here, in case someone else finds it handy. It is actually quite straightforward to get a hadoop cluster running locally. Here are the basic steps:
- You need to download a hadoop implementation. The one I am using is from Cloudera, and you can obtain a tarball from here.
- Expand the tarball into some convenient directory on your local machine. You will find a bin directory which contains all necessary files for managing hadoop, and a conf directory where hadoop configurations are located. You will have to modify conf-site.xml, hdfs-site.xml, and mapred-site.xml to specify name nodes, replications, etc. to customize your cluster setup. More on this at the bottom.
- You need to ensure remote login, that is, check if you can do ssh localhost without being asked for a password. If you already have a public key, then you can run the following command (assuming dsa key) to ensure password-less local ssh login.
cat $HOME/.ssh/id_dsa.pub >> $HOME/.ssh/authorized_keys
- Set the following environment variables:
HADOOP_HOME: Path to hadoop directory
HADOOP_CONF_DIR: Path to hadoop conf directory
Also add the hadoop bin directory in you path. - Create a hadooptemp directory (see below in core-site.xml) using the following command:
sudo mkdir -m 777 hadooptemp
- Format the data node:
hadoop namenode -format
- Bring up pseudo mode by invoking start-all.sh. You can stop pseudo mode using stop-all.sh.
You can set up override configurations for hadoop following the examples below.
An example core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadooptemp/${user.name}</value>
</property>
</configuration>
An example hdfs-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
<name>dfs.name.dir</name>
<value>/tmp/hadoop/cache/hadoop/dfs/name</value>
</property>
</configuration>
And an example mapred-site.xml:
Happy hacking with hadoop!
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>
<value>true</value>
</property>
</configuration>
Happy hacking with hadoop!
No comments:
Post a Comment