The story so far….

This is part 3, for it might be worth reading part 1 and part 2 first. The code for these posts is up on Github.

We’ve created lots of customers and monthly data. Done a basic set of algorithms, then transferred it to a MapReduceJob and now we’re upping the ante to 2,000,000 customers (yes it’s some coffee shop) and we’re creaking at the sides.

So who’s doing what?

Let’s pretend I have three mailing lists: one gives coupons out to great customer sales, one gives coupons out to incentivise customers who’s sales have dropped more that four months in the previous year and, finally, where last month’s sale drop was more than five units compared to the customer’s average sales.

Or to put it another way….

1. Where the mean is greater (>) than 9.

2. Where the months below is greater (>) than 4.

3. Where the sales drop is greater (>) than 5 against the month 1-12 average.

Introducing MultipleOutputs.

We can introduce a quick way to “segment” these customers with Hadoop. Within the Mapper we can define a segment can be written to.  This means while the job is running the output can be directed to one or more outputters. Handy.

private MultipleOutputs<Text,NullWritable> multi = null;
protected void setup(Context context) {
   multi = new MultipleOutputs<Text,NullWritable>(context);

Where we could normally direct the output to a map writer we can now run some conditional statements and direct to our newly defined multiple output. You can read the full source code here.

if(salesdrop > 5) {
  multi.write("segments", userid, userinfo, "salesdrop");
} else if(monthsbelow > 4){
  multi.write("segments", userid, userinfo, "monthsbelow");
} else if(mean > 9) {
  multi.write("segments", userid, userinfo, "goodsalestorewards");

Running the job again with our new Mapper:

hadoop jar sdhadoop.jar input output

The job will run as normal but now take a look in the output directory.

-rwxrwxrwx 1 Jason staff 0 22 Oct 07:15 _SUCCESS
-rwxrwxrwx 1 Jason staff 1746521 22 Oct 07:15 goodsalestorewards-m-00000
-rwxrwxrwx 1 Jason staff 1711285 22 Oct 07:15 goodsalestorewards-m-00001
-rwxrwxrwx 1 Jason staff 439772 22 Oct 07:15 goodsalestorewards-m-00002
-rwxrwxrwx 1 Jason staff 2432268 22 Oct 07:15 monthsbelow-m-00000
-rwxrwxrwx 1 Jason staff 2428601 22 Oct 07:15 monthsbelow-m-00001
-rwxrwxrwx 1 Jason staff 603182 22 Oct 07:15 monthsbelow-m-00002
-rwxrwxrwx 1 Jason staff 0 22 Oct 07:15 part-m-00000
-rwxrwxrwx 1 Jason staff 0 22 Oct 07:15 part-m-00001
-rwxrwxrwx 1 Jason staff 0 22 Oct 07:15 part-m-00002
-rwxrwxrwx 1 Jason staff 8259779 22 Oct 07:15 salesdrop-m-00000
-rwxrwxrwx 1 Jason staff 8181373 22 Oct 07:15 salesdrop-m-00001
-rwxrwxrwx 1 Jason staff 2056843 22 Oct 07:15 salesdrop-m-00002

All our data is neatly in different segment files. The user ids and, more importantly, the user data is in each. Now we can react on rewarding or getting some incentive back to customers that are leaving.