A quick recap

In part 1 of the tutorial we got going with some sample data and crafted the basics of the algorithms we need to run against our data.  Next we’re going to apply that logic and move it to a MapReduce job so it can be run with Hadoop.

More coffee may be required…

More coffee!

More coffee!

Don’t know about Hadoop?

If you don’t know Hadoop, how to install it or how to use it then here’s a three step plan:

Logic into MapReduce

We have to think about a few things here. Our CSV file of users is one row per user. So there’s no need for reducing anything to a single answer, we’ve got all the consolidated numbers, it’s just a case of working through every customer and applying the previous logic.

I’m neither going to complicate matters.  There’s text coming in to the Mapper and there’s text going out of it.

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { }

All the previous logic is within this class.  You can view the full code here on github.

One interesting note, we’re not reducing anything, just mapping each line to be processed.  So it’s worth telling the job configuration that we’re not attempting to reduce.



Running the job

I’ve exported the code base out to a jar file (File -> Export in Eclipse) and moved my csv file in to a directory called input (I’m noting HDFS as it’s a small job compared to most).

To run the job:

hadoop jar sdhadoop.jar uk.co.dataissexy.hadoopsales.part2.SalesMRJob input output

So this basically tells Hadoop which jar file to use and what the job to run (SalesMRJob), the data to work on is in a directory called “input” and Hadoop with dump the results in a directory called “output”.

The verbose output while the job is running is worth checking:

13/10/21 06:59:11 INFO mapred.JobClient: Map-Reduce Framework
13/10/21 06:59:11 INFO mapred.JobClient: Map input records=20000
13/10/21 06:59:11 INFO mapred.JobClient: Spilled Records=0
13/10/21 06:59:11 INFO mapred.JobClient: Total committed heap usage (bytes)=128974848
13/10/21 06:59:11 INFO mapred.JobClient: Map input bytes=715595
13/10/21 06:59:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=100
13/10/21 06:59:11 INFO mapred.JobClient: Map output records=20000

Especially the “Map input records” and “Map output records” lines. How much lines of read were read in and how many results were saved.  We know in this example we’re not reducing anything so they should be the same.  In this instance, they are.

A quick look at the output directory and we have results:

1 8.75 8.520833333333334 3.75 1
2 5.916666666666667 22.743055555555557 -8.083333333333332 4
3 7.583333333333333 21.243055555555557 5.583333333333333 4
4 7.833333333333333 30.472222222222225 -1.166666666666667 4
5 6.166666666666667 17.305555555555554 0.16666666666666696 2

The userid, the mean number of sales, the variance, the sales drop of month 13 to the previous year and the number of months below the mean.

Next time….

We’ll modify the mapping job to segment customers in to sales buckets depending on our criteria.