The commonality with most data projects is that they start with this rather ambiguous, “well, we’ve got this data”, statement. And without a care in the world it’s spin up the Amazon EMR instances, yes instances, and whack all the data up there. The mainstream tech media will focus on the commodity hardware and Yahoo having 50,000 nodes in it’s cluster but the fact remains for most customers we’re talking small amounts of nodes.


The peak number of clusters really only count for a small number of the companies using it (the Google’s, Facebook’s, Twitter’s and Yahoo’s of this world). For most people the single core may actually be fine when dealing with batch processing, especially when those runs are planned during quiet times where the data isn’t going to be changing fast.

Adding cores adds latency, regardless of whether within the local network or a wide area network. When you are combining a number of latency points such as network access, disk i/o and the actual processing the total time added can start to hurt.

Hadoop is usually copying blocks of 64Mb data to a node for processing, done with TCP and RPC. Why 64Mb? Well it’s a Goldilocks number, not too big, not too small but just right.

So the key to powering up any single core system is to find ways of reducing any latency you can. Whether that’s making HDFS a complete in-memory implementation making block read/writes faster or just adding a ton more machine RAM to make things faster will give big time differences (well as much as the JVM will let you add) machines using 48Gb will find other uses for disk caching and so on, it all adds to the performance.

Once you run out of options you are entering the realms of hardware acceleration and there are a few companies working on this now, notable NI company is Analytics Engines. At this point it doesn’t really become a Hadoop issue, just eeking the  juices out of the machine you are working on.

The Hadoop cluster count has become the ego point for Hadoop developers, without so much thinking about the cost considerations of running such a complex cluster of machines (devops won’t save you now). I’ve found that even the mid-tier racks will give you up to 6TB of storage and a enough RAM to sink a small startup, in the grand scheme of things SME’s don’t have much to worry about, perhaps we don’t need everything on the cloud after all…..