Hadoop, the technology caught in the eye of the big data marketers. A simple concept but one that has changed the way we do things in regards to how data is processed.

Setting up a single node cluster – configuration

I would wager for most users and businesses that a single node cluster will be fine. Running things in local mode (as I did with Surgeon scripts) keep things simple. There are times though you just want to do things properly.

Assuming that we have the tar file downloaded from the Apache Mirrors we can uncompress it in a directory of our choosing.

tar xvzf hadoop-x.x.x-bin.tar.gz

In the configuration directory (called “conf”) edit the file and edit the JAVA_HOME line:

export JAVA_HOME=/path/to/wherever/it/is

Also, a part that’s very rarely mentioned, if your ssh configuration has a different port set up (most run 22 and the paranoid among us run it on another) then you’ll also need to comment out the HADOOP_SSH_OPS and amend it so it reads:

export HADOOP_SSH_OPS="-p <my ssh port>"

(changing <my ssh port> to the actual number that your sshd accepts login at)

Next, create a rsa key on your machine but make sure it’s passwordless.

ssh-keygen -t rsa -P ''

That should save in your home directory. ~/.ssh/id_rsa

Now copy that file to the authorised keys on the same machine.

cat ~/.ssh/ >> ~/.ssh/authorized_keys

Export Hadoop’s bin path to your working path.

export PATH=$PATH:/path/to/hadoop/bin

Last thing to do is to define Hadoop’s file system settings in conf/core-site.xml


Formatting the HDFS Filesystem

From the command line run:

hadoop namenode -format

When reformatting the namenode and you’re prompted Y or N to proceed make sure you enter “Y” and not “y”, it’s case sensitive. Something that’s tripped me up many a time.

Starting and Stopping Hadoop

Nice and simple really.

To start:

And to stop:

Process list of a basic job

Start up the server:

Copy the data you want to process to the HDFS filesystem:

hadoop fs -put mydata.txt mydata.txt

Run the Hadoop job:

hadoop jar /path/to/my/jar.jar --input mydata.txt --output output

Hadoop will then process the job. To see the results we need to copy back the result data from the HDFS filesystem back to our local filesystem.

hadoop fs --getmerge output output.txt

Stop the server (no point wasting memory).

Next time I’ll cover something a bit more meaty but I get this one asked a lot, so I thought I’d cover it now.