Getting beyond WordCount….

There are the naysayers and doubters out there that put the perception across that Hadoop is some one trick pony that sorts out text. It’s easy to see why with most of the examples gravitating towards word counts, mining social media data (ie text) and so on. In this walk through I’m going to attempt to show that Hadoop can be helpful to business from a sales point of view too.

Welcome to my fictitious coffee shop!

It sells lattes and nothing else, I mean if you have lattes what else do you need?

A coffee shop, yesterday.

A coffee shop, yesterday.

I’ve got a customer loyalty system in operation and now I’ve been open a year I want to start seeing some insight from my data. I’m using CSV files for all my data sales output.  I’ve got 20,000 customers (yup some coffee shop I’ve got here) but here’s the first ten so you get an idea.


The first number you see is the customer id number. The next 13 numbers are my previous 13 month sales for that customer, how many coffee’s a month did they purchase. Why 13 months? That will become apparent in a moment.

First things first, stakeholder questions!

Before code, before Hadoop clusters, before drowning ourselves in more coffee we need a set of outline questions that we’re aim towards. Something we need to learn from the data we have. So let’s get some questions down.

1. How many lattes do my customers purchase?

2. What’s the month on month variation on sales?

3. If there’s a drop in sales by how much was it?

4. What’s the duration of that sales drop?

From what we can craft some action plans.

1. Calculate the average (months 1 – 12) , that gives us a good number.

2. Calculate the variance on the monthly purchases.

3. Take the month 13 sales and subtract that from the average month sales (1 – 12).

4. Show the number of months where the sales were 40% less than the average.

Time to craft some code….

Spot the old dullard in the room.

Yes spot old farty pants Jase who’s still coding in Java when all the cool kids are coding in Clojura, Scala and Python. Well I don’t really mind what you program it in but here, it’s Java 🙂 

The approach will be this. Part 1, get the very basics of our algorithm down. Part 2 look at adopting it into a MapReduce job and then Part 3 getting Hadoop to process all my customers.

We need a small sample set of our data to test our simple code against. I really don’t want to experiment against all 20,000 customer each time. With a simple Unix command we can sort that out.

head -n 20 salesdata.csv > sample.csv

The head command outputs the first 20 lines of the file and I then redirect that output to new file called sample.csv.  Job done.

Calculating the mean and the variance

Cast your mind back to maths classes of years gone by.  Calculating the mean is achieved by adding all the numbers together and then dividing the number of elements you added.

So user number 1 has the monthly values of:


Adding all those together and diving by 12 gives me a mean of 8.75.

In code it looks like:

double getMean() {
  double sum = 0.0;
  for(double a : data) {
     sum += a;
  return sum/size;

The variance is calculated by (mean – month value) * (mean – month value) for each of the months.

In code this looks like:

double getVariance() {
   double mean = getMean();
   double temp = 0;
   for(double a : data) {
      temp += (mean - a)*(mean - a);
   return temp/size;

Two functions that give us some reliable numbers to work from.  Before you know it we’ve got two items checked off the list. Now for the final two.

Completing the rest of the task….

We have our functions for the mean and variance so it’s a fairly trivial task to get the other functions complete.

Taking the month 13 sales away from the mean:

return (mean – sales);

And the number of months where the sales were below 40% of the mean:

int count = 0;
for(double a : data){
  if((a < (mean * 0.40))) count++;
return count;

That’s it, we’re done.  A quick test against our sample data and we come out looking nice.

User id: 1
Mean: 8.75
Variance: 8.520833333333334
Month 13 Sales Drop = 3.75
Months 40% below avg: 1
User id: 2
Mean: 5.916666666666667
Variance: 22.743055555555557
Month 13 Sales Drop = -8.083333333333332
Months 40% below avg: 4
User id: 3
Mean: 7.583333333333333
Variance: 21.243055555555557
Month 13 Sales Drop = 5.583333333333333
Months 40% below avg: 4

So we’re getting some insight in to our data now. This is a good start and it’s also a good place to stop for this section.

Next time….

The next step it to create a MapReduce job that we can execute via Hadoop so we can work on the 20,000+ (and beyond I have big plans for 2M plus customers) and get insight across the whole loyalty system.