Just jumping in on part 3? You can read part 1 and part 2 to get you up to speed.

Dovima with elephants, Evening dress by Dior, Cirque d'Hiver, Pa

The picture reference gag could be wasted on some….

Bringing in Hadoop

So far we’ve used Spring 😄 to pull a stream of tweets into a file. And, normally, at this point we’d be looking to copy the file into HDFS and then run some MapReduce job on it.

Historically you’d be saying:

hadoop fs -put /tmp/xd/output/tweetFashion.out tweetFashion.out

Do the job and then pull the results from HDFS into a file again.

hadoop fs -getmerge output output.txt

If you need a reminder on setting up a single node Hadoop cluster then you can read a quick intro here.

Spring 😄 and Hadoop

Creating a stream from with the shell we can stream the data into HDFS without the hassle of doing it manually.

First we must tell Spring 😄 default to Hadoop 1.2.1 (in 😄 build M4 at time of writing).

Assuming that Hadoop is installed and running you can run the single command:

xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#fashion'| twitterstreamtransformer | hdfs"

With the rollover parameter we can ensure that periodic writes are happening within HDFS and not too much data is being stored in memory.

xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#fashion'|twitterstreamtransformer | hdfs"

Within the HDFS directory you’ll start to see the data appear very much in the same way as it was being dumped out to the tmp directory.

The MapReduce Job

The Mapper and Reducer are a very simple word count demo but looking for hashtags instead of words. The difference here is the output, I don’t want tab delimited as I’m not a great fan of it. I want to output a comma separated file.

Depending on the version of Hadoop you are using defines the property setting you need to set. So the easiest way to sort this is to cover all bases.

static void setTextoutputformatSeparator(final Job job, final String separator){
        final Configuration conf = job.getConfiguration(); //ensure accurate config ref
        conf.set("mapred.textoutputformat.separator", separator); //Prior to Hadoop 2 (YARN)
        conf.set("mapreduce.textoutputformat.separator", separator); //Hadoop v2+ (YARN)
        conf.set("mapreduce.output.textoutputformat.separator", separator);
        conf.set("mapreduce.output.key.field.separator", separator);
        conf.set("mapred.textoutputformat.separatorText", separator); // ?
        }
Then within the job definition code:
Job job = new Job();
job.setJarByClass(TwitterHashtagJob.class);
setTextoutputformatSeparator(job, ",");
When this job is run the output will be comma separated.
#Aksesuar,1
#AkshayJewellery,1
#AlSalamanty,1
#Alabama,1
#Aladin,1
#AlaskanWomen,2
#Alaventa,1
#Albright,8
#Albuquerque,1
#Alcohol,1
#Aldo,1
#AldoShoes,2
#AlecBaldwin,3
#Alejandro…,1
#Alencon,1
#Alert,2
#AlessandraAmbrosio!,2
#AlessiaMarcuzzi,2
#AlexSekella,1
#AlexTurner,3
#Alexa,3
#AlexaChung,8
#AlexaChungIt,1
#AlexanderMcQueen,7
#AlexanderWang,7
#Alexandra,1
#Alexandria,2
#Alexis,1
#Alfani,1
#Alfombraroja,1
#AlfredDunner,1
#Ali,1
#Alice,1

With a little command line refining we can start to pick out the data we actually want. The last thing that’s pleasing on the eye is a visualisation on every hashtag.  For example I want to see the counts for Asos, Primark and TopShop.

jason@bigdatagames:~$ egrep "#(asos|primark|topshop)," output20131117.txt 
#asos,42
#primark,16
#topshop,64

Next time….

We’ll get into the visualisation side of things and also look at maintaining a process to keep the output up to date.

Advertisements