Like BigData tools you won’t need AI 99% of the time . #bigdata #data #machinelearning #ai #hadoop #spark #kafka

The Prologue.

Recently I’ve been very curious, I know that alone makes people in tech really nervous. I was curious to find out the first mentions of BigData and Hadoop in this blog, April 2012 and the previous year I’d been doing a lot of reading on cloud technologies and moreover data, my thirty year focus is data and right now in 2017 I’m halfway through.

The edge as I saw it would be to go macro on data and insight, that had been my thought ten years earlier. The whole play with customer data was clear in my mind then. In 2002 though we didn’t have the tooling, we made it ourselves. Crude, yes. Worked, it did.

When I moved to Northern Ireland I kept talking about the data plays to mainly deaf ears, some got it. Most didn’t. “Hadoop, never heard of it”. Five years later everyone has heard of Hadoop… too late.

It’s usually about now we have a word cloud with lots of big data related words on it.

Small Data, Big Data Tools

Most of the stories I hear about Big Data adoption are just this, using Big Data tools to solve small data problems. On the face of it the amount of data an organisation has rarely amounts to the need for huge tooling like Hadoop or Spark. My guess is (and I’ve seen partially confirmed) that the larger platforms like Cloudera, MapR and Hortonworks compete on a very narrow field of real big customers.

Let’s be honest with ourselves, Netflix and Amazon sized data are more deviations of the mean than the mean itself and the probability of it being given to you is very small unless it’s made public.

I personally found out in 2012 when I put together Cloudatics, using big data tools is a very hard sell. Many companies just don’t care, not all understand the benefits and those who cared still didn’t see how it would apply to them. Your pipeline is slim, at a guess 100:1 ratio would apply, that was optimistic then let alone five years on.

Most of us aren’t near “Averaged Sized Data” let alone Big Data.

When first met Bruce Durling back in late 2013 (he probably regretted that coffee) we talked about all the tools, how there’s no need to write all this Java stuff when a few lines of Pig will do and how solving a specific problem with existing big data tools was far better than trying to launch a platform (yup, know that, already tried).

What Bruce and I also know that we work with average sized data…. it’s not big data but it’s not small data. Do we need Hadoop or Spark? Probably not, can we code and scale it on our own, yes we can. Do we have the skills to do huge data processing, you betcha.

I sat in a room a few weeks ago where mining 40,000 tweets was classed as a monumental achievement, I don’t want to burst anyone’s bubble, it’s not. Even 80 million tweets is not a big data problem, neither an average sized data one. On my laptop doing sentiment analysis took under a minute.

Now enter all life saving AI!

And guess what, it looks like the same mistake is going to be repeated. This time with artificial intelligence. It’ll save lives! It’ll replace jobs! It’ll replace humans! It can’t tell the difference between a turtle and a gun! All that stuff is coming back.

If you firmly believe that a black box is going to revolutionise your business then please be my guest. Just be ready with the legals and customer service department, AI is rarely 100% accurate.

Like big data you’ll needs tons of data to train your “I have no idea how it works it’s all voodoo” black box algorithm. The less you train the more error prone your predictions will be. Ultimately the only the only thing it will harm is the organisation who ran the AI in the fist place. Take it as fact that customers will point the finger straight back at you, very publicly, if you get prediction wildly wrong.

I’ve seen Google video and Amazon Alexa voice classification neural works do amazing things, the usual startup on the street may have access to the tools but rarely the data to train. And my key takeaway of learning since doing all that Nectar card stuff, without quality data and lots of it, you’re fight will be a hard one.

I think there is still a good few years at the R&D coalface trying to figure it all out where AI could fit properly. Yes jobs will be replaced by AI, new jobs will be created. Humans will sit aside robotic machines that take the heavy lifting away (that was going on for a long time before the marketers got hold of AI and started scaring the s**t out of people with it.

It’s not impossible to start something in the AI space and put it on the cloud, though, the costs can add up if you take your eye off the ball. The real question is, “do you really have to do it that way? Is there an easier method?”. Most crunching could be done on a database (not blockchain may I add), hell even an Excel spreadsheet is capable for some without the programming knowledge or money to spend on services.

Popular learning methods are still based on the tried and true methods: decision trees, logistical regression and k-means clustering, not black boxes. The numbers can be worked out away from code as confirmation, though who does that is a different matter entirely. The most well known algorithms can be reverse engineered: decision trees, Bayes networks, Support Vector Machines, Logistic Regression there’s maths laid down bare showing how they work. The rule of thumb is simple: if traditional machine learning methods are not showing good results then try a neural network (the backbone of AI) but only as a last resort, not the first go to.

If you want my advice try the tradition, well tested, algorithms first with the small data you have. I even wrote a book to help you…..

Like BigData, you more than likely don’t need AI.