Simple Linear Regression in 2 minutes. #machinelearning #linearregression #java

With certain data Simple Linear Regression wins and while the rest of the ML/AI world push tools that are far larger scope than needed for most, sometimes our best tools are hidden in plain sight.

Apache Commons Math, old, kinda forgotten but kinda cool, well Simple Linear Regression is hiding in there and is easy to put together.

1. Add the dependency

Put this in your pom.xml file…..

<!-- -->

2. Import the class

In your Java class add this import statement.

import org.apache.commons.math3.stat.regression.SimpleRegression;

3. Add your two data points

I’m reading in a list of comma delimited strings so I’m parsing and converting them. The basic premise of building the model is simple though….

public SimpleRegression getLinearRegressionModel(List<String> lines) {
  SimpleRegression sr = new SimpleRegression();
  for(String s : lines) {
    String[] ssplit = s.split(",");
    double x = Double.parseDouble(ssplit[0]);
    double y = Double.parseDouble(ssplit[1]);

return sr;

3. Make some predictions

The SimpleLinearRegression class will give you back the slope and intercept, from there is plain sailing to make a prediction.

private String runPredictions(SimpleRegression sr, int runs) {
  StringBuilder sb = new StringBuilder();
  // Display the intercept of the regression
  sb.append("Intercept: " + sr.getIntercept());
  // Display the slope of the regression.
  sb.append("Slope: " + sr.getSlope());
  // Display the slope standard error
  sb.append("Standard Error: " + sr.getSlopeStdErr());
  // Display adjusted R2 value
  sb.append("Adjusted R2 value: " + sr.getRSquare());
  sb.append("Running random predictions......");
  Random r = new Random();
  for (int i = 0 ; i < runs ; i++) {
    int rn = r.nextInt(10);
    sb.append("Input score: " + rn + " prediction: " + Math.round(sr.predict(rn)));
  return sb.toString();

Job done.

Now remember the key metric is the R2 score, sr.getRSquare() from your model. It’s a number between 0 and 1. 0 is pointless and the model shouldn’t be used, 1 is basically the most accurate model you can get. Anything less than 50% is basically less reliable than a coin flip. Aim for a minimum of 0.8 (80%) and you’re well on your way to bragging about your predictions at the pub, or on Twitter, or Facebook or at the pub on Twitter and Facebook……




Basic calculation for hidden nodes in a Neural Network #hackthehub18 #ai #machinelearning

I have this written down in a number of notebooks but I’m leaving it here for two reasons:

  1. It just took me twenty minutes to find the right notebook.
  2. It’s Hack The Hub in Belfast and it’s all about Machine Learning and AI.

How Many Nodes In A Hidden Layer?

I want to get a rough idea of the number of nodes to use in a hidden layer in my neural network. Too few or too many and this can have an impact on the accuracy of your training model. You’ll see the outputs of your training accuracy during evaluation (accuracy and F1 scores).

There is a common equation available to give us a rough number.

A scaling factor multiplied by the total number of input and output nodes, divided by the number of samples in the training.

In Clojure it looks like this:

user> (defn node-calc [inputs outputs sample-size scaling]
 (double (/ sample-size (* scaling (+ inputs outputs)))))

The scaling factor is just an arbitrary number between 2 and 10. It’s worth mapping through the range to get a feel for the scores.

Let’s Build a Case

My neural network has 1 input node and ten output nodes (ten possible prediction results), to train I’ve got 474 instances of input data to train. I’m going to map the scaling factor from 2 to 10 so I can see the range of node results.

user> (clojure.pprint/pprint (map (fn [s] (node-calc 1 11 474 s)) (range 2 11))) 

Guess how many times I’d run the model training and evaluation? I’d test all the rounded up/down results and see how the F1 score looks.


The Fact Your #Data is Being Used Should Be a Surprise to No-one

It’s been an interesting weekend for my field of work. Especially in an industry where I do stuff with data….

Ellie Mae O’Hagen wrote a piece called “No one can pretend Facebook is harmless fun anymore” and it’s not a bad overview of where things are. The last line says it all:

“…because people with Facebook profiles aren’t the company’s customers: they are the product it sells to advertisers.”

Which is basically the worst kept secret in technology companies, entrepreneurs and tech “thought leaders”. The value is in the data and once you figure out how to monetise that then a free product to customers is no bad thing.

Anyone who knows me knows my love of customer loyalty data, I’ve worked with it since 2002, mined Nectar card data and came up with recommendations via vouchers and offers on how to get customers to change behavior. The Cambridge Analytica approach is far from new it’s just the domain it was applied to.

Once you know you can change another persons behavior there comes a sense of responsibility with it. As the custodians of the data you now have the power the change the course of another person’s future without their knowledge. That thought alone is scary as I know some that would exploit it for profit like squeezing a grape until no more juice would come out.

So think about it all, every card, whether it’s loyalty cards, bank cards, your medical records on the GP’s system. Do the likes of Tesco/Dunn Humby have a public list of where their Clubcard data is sold? Probably not.  I asked a question during a Big Data Week panel in 2015, “Who has a Clubcard?”, pretty much all the room, “Who wouldn’t mind if your shopping habits were passed on to the insurance company?”, all hands with the exception of two went down very quick.

Telephone call logs are another and the classic line, “we may record your call for training purposes”, training what exactly? Another customer representative or an machine learning or AI tool to decide whether to keep your custom. How do we know, well we never do because we never find out.

Will the events of the weekend turn the tide against Facebook, it’s 50/50. I mean 50m users of Facebook is only about 2.5% of the user base and most hardened cat/dog/baby picture posting users won’t care. If I were to bet, I’d said probably nothing much will happen.

The only people who need to change are you and me, about what data goes were, how it will be used and how to have it deleted when we’re done with that service.


Setting up Org Mode and Babel for the Nervous #emacs #vi #babel

I’m claiming a moral victory for my sanity here…..

As a die hard vi user Emacs occasionally confuses me, I’m happy to admit that. Many a time I’ve pressed :wq instead of C-x C-s when it comes to saving files.

Thing is Emacs has loads of goodies that I never get to quite try out. Org mode being able to run scripts and stuff is one of them. Curiosity has now given way to requirement so it was a wrestling match to get it working (and reading the documentation did help…. I admit). There’s probably better ways to do this but it worked for me.

Installing Org mode

Open your init.el file and add the following line:

(add-to-list 'package-archives '("org" . "") t)

I’m assuming that you’ve already got (require 'package) in your init.el file, if you don’t then add that above the add-to-list line.

Open your Emacs, you now need to install the org package and org-contrib package.

M-x list-packages

You will see org and org-contrib listed (mine are at the top). Install them both, click on the title then click “Install”. Emacs will output a load of stuff but all is normally well. With that done we can now make sure that we can run bash from within Org mode.

Enabling Bash within Emacs

Open your init.el file again. You will now add the org-babel command to load the shell so it can be called from your org file. I usually add this stuff at the bottom of my init file.

(org-babel-do-load-languages 'org-babel-load-languages
 '((shell . t)))

Save the file and restart Emacs.

Testing within Org mode.

So far so good, now to test.

Either open an org mode file or create one. Now add the following:

#+BEGIN_SRC bash
echo 'this is a test'

Then evaluate the output by using C-c C-c and you will be prompted Evaluate this bash code block on your system? Respond with yes.

Look at your org file again and you should see the output.

#+BEGIN_SRC bash
echo 'this is a test'

: this is a test

That’s good enough for me and now I have notes on how to get it working on my work machine tomorrow morning (I’ll forget if I don’t write it down).


Walking as a debugging technique. #programming #debugging #code #learning

Kris is totally on the money, this tweet is 100% true. One story I tell to developers is from personal experience.

While working on the Sporting Life* website for the Press Association I was working on a Perl script, quite a beefy one, to populate pools coupons so people could play online.

All morning I was fixated on a bug and I couldn’t see the wood for the trees. My boss sat opposite but didn’t say a word, nor did I realise he was teaching me. After a while he decided the time was right, “Jase, go for a walk.”, I was blunt, “No, not until I’ve fixed this bug….”, “Jase, go for a WALK!”. I got the hint…..

The Press Association car park is a fair size so I did a lap, just the one. All the while during that lap I was talking under my breath about such an absurd command from my boss. My first proper programming job and I was than impressed…..

That all changed in an instant. I opened the door to the office, walked to my desk and before I even sat down pointed at the screen and said, “Oh look, there’s a comma missing….”, made the correction and it worked first time.

Stuck with a problem? Go for a walk.


* Two milestones of my programming career being one of the first involved in the very first online betting platform and second, the first online pools coupon….. this coming from the man who has no interest in sport at all.


I’m talking about streaming apps at ClojureX 4-5th December, London at @Skillsmatter #clojure #onyx #streaming #kafka #kinesis

Who Let Him Up There Again??

Last year at ClojureX I did an introduction to Onyx, this year it’s about what I really learned at the coal face. I’ll be talking about how I bled all over Onyx with a really big project.

This time though, no naff jokes, no Strictly Come Dancing and Linear Regression*, no temptation to use that Japanese War Tuba picture. It will be about designing streaming applications, task life cycles, heartbeats, docker deployment considerations and the calculating log volume sizes for when you’re on holiday.

I’m looking forward to it. If you are interested in the current schedule you can read that here, if you want more information on the conference then that’s on the SkillsMatter website.

* If you’re interesed the Darcey Coefficient is (as a Clojure function):

(ns darceycoefficient.core)
(defn predict-score [x-score]
 (+ 3.031 (* 0.6769 x-score)))

Like BigData tools you won’t need AI 99% of the time . #bigdata #data #machinelearning #ai #hadoop #spark #kafka

The Prologue.

Recently I’ve been very curious, I know that alone makes people in tech really nervous. I was curious to find out the first mentions of BigData and Hadoop in this blog, April 2012 and the previous year I’d been doing a lot of reading on cloud technologies and moreover data, my thirty year focus is data and right now in 2017 I’m halfway through.

The edge as I saw it would be to go macro on data and insight, that had been my thought ten years earlier. The whole play with customer data was clear in my mind then. In 2002 though we didn’t have the tooling, we made it ourselves. Crude, yes. Worked, it did.

When I moved to Northern Ireland I kept talking about the data plays to mainly deaf ears, some got it. Most didn’t. “Hadoop, never heard of it”. Five years later everyone has heard of Hadoop… too late.

It’s usually about now we have a word cloud with lots of big data related words on it.

Small Data, Big Data Tools

Most of the stories I hear about Big Data adoption are just this, using Big Data tools to solve small data problems. On the face of it the amount of data an organisation has rarely amounts to the need for huge tooling like Hadoop or Spark. My guess is (and I’ve seen partially confirmed) that the larger platforms like Cloudera, MapR and Hortonworks compete on a very narrow field of real big customers.

Let’s be honest with ourselves, Netflix and Amazon sized data are more deviations of the mean than the mean itself and the probability of it being given to you is very small unless it’s made public.

I personally found out in 2012 when I put together Cloudatics, using big data tools is a very hard sell. Many companies just don’t care, not all understand the benefits and those who cared still didn’t see how it would apply to them. Your pipeline is slim, at a guess 100:1 ratio would apply, that was optimistic then let alone five years on.

Most of us aren’t near “Averaged Sized Data” let alone Big Data.

When first met Bruce Durling back in late 2013 (he probably regretted that coffee) we talked about all the tools, how there’s no need to write all this Java stuff when a few lines of Pig will do and how solving a specific problem with existing big data tools was far better than trying to launch a platform (yup, know that, already tried).

What Bruce and I also know that we work with average sized data…. it’s not big data but it’s not small data. Do we need Hadoop or Spark? Probably not, can we code and scale it on our own, yes we can. Do we have the skills to do huge data processing, you betcha.

I sat in a room a few weeks ago where mining 40,000 tweets was classed as a monumental achievement, I don’t want to burst anyone’s bubble, it’s not. Even 80 million tweets is not a big data problem, neither an average sized data one. On my laptop doing sentiment analysis took under a minute.

Now enter all life saving AI!

And guess what, it looks like the same mistake is going to be repeated. This time with artificial intelligence. It’ll save lives! It’ll replace jobs! It’ll replace humans! It can’t tell the difference between a turtle and a gun! All that stuff is coming back.

If you firmly believe that a black box is going to revolutionise your business then please be my guest. Just be ready with the legals and customer service department, AI is rarely 100% accurate.

Like big data you’ll needs tons of data to train your “I have no idea how it works it’s all voodoo” black box algorithm. The less you train the more error prone your predictions will be. Ultimately the only the only thing it will harm is the organisation who ran the AI in the fist place. Take it as fact that customers will point the finger straight back at you, very publicly, if you get prediction wildly wrong.

I’ve seen Google video and Amazon Alexa voice classification neural works do amazing things, the usual startup on the street may have access to the tools but rarely the data to train. And my key takeaway of learning since doing all that Nectar card stuff, without quality data and lots of it, you’re fight will be a hard one.

I think there is still a good few years at the R&D coalface trying to figure it all out where AI could fit properly. Yes jobs will be replaced by AI, new jobs will be created. Humans will sit aside robotic machines that take the heavy lifting away (that was going on for a long time before the marketers got hold of AI and started scaring the s**t out of people with it.

It’s not impossible to start something in the AI space and put it on the cloud, though, the costs can add up if you take your eye off the ball. The real question is, “do you really have to do it that way? Is there an easier method?”. Most crunching could be done on a database (not blockchain may I add), hell even an Excel spreadsheet is capable for some without the programming knowledge or money to spend on services.

Popular learning methods are still based on the tried and true methods: decision trees, logistical regression and k-means clustering, not black boxes.  The numbers can be worked out away from code as confirmation, though who does that is a different matter entirely. The most well known algorithms can be reverse engineered: decision trees, Bayes networks, Support Vector Machines, Logistic Regression there’s maths laid down bare showing how they work. The rule of thumb is simple: if traditional machine learning methods are not showing good results then try a neural network (the backbone of AI) but only as a last resort, not the first go to.

If you want my advice try the tradition, well tested, algorithms first with the small data you have. I even wrote a book to help you…..

Like BigData, you more than likely don’t need AI.



How to run params in R scripts from Clojure – #clojure #r #datascience #data #java

You can read the main post here.

Passing parameters into Rscript.

A Mrs. Trellis of North Wales writes….

There’s always one, isn’t there? The decent chap as a point though so let’s plough on with it now.

New R code

First a new R script to handle arguments being passed to it.

#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

if(length(args)==0) {
 stop("No args supplied")
} else {

If I test this from the command line I get the following:

$ Rscript params.R 1 2 3 4
[1] "1" "2" "3" "4"

Okay, that works so far, now to turn our attention to the Clojure code.

The Clojure Code

Let’s assume we have a vector of numbers, these will be passed in the run command as arguments. So I need a way to converting the vector into a string that can be passed in to the sh command.

(defn prepare-params [params]
 (apply str (map #(str " " %) params)))

Which gives an output like this:

rinclojure.example1> params
[1 2 3 4]
rinclojure.example1> (prepare-params params)
" 1 2 3 4"

With a little amend to the run command function (I’m going to create a new function to handle it)….

(defn run-command-with-values [r-filepath params]
 (let [format-params (prepare-params params)
 command-output (sh "Rscript" r-filepath (prepare-params params))]
   (if (= 0 (:exit command-output))
     (:out command-output)
     (:err command-output))))

Running the example passes the string in to the R script.

rinclojure.example1> (run-command-with-values filename params)
"[1] \" 1 2 3 4\"\n"

Not quite going according to plan. We have one string of arguments meaning there’d be some parsing to do on within the R script. Let’s refactor this function a little more.

(defn run-command-with-values [r-filepath params]
 (let [sh-segments (into ["Rscript" r-filepath] (mapv #(str %) params))
       command-output (apply sh sh-segments)]
   (if (= 0 (:exit command-output))
      (:out command-output)
      (:err command-output))))

The (prepare-params)function is now useless and removed. Using the into function we create a single vector of instructions to pass into sh this includes the Rscript command, the filepath and mapping through the values in the parameters.

Instead of running sh on it’s own I’m applying the vector against sh. When it’s run against the R script we get the following output:

rinclojure.example1> (run-command-with-values filename params)
"[1] \"1\" \"2\" \"3\" \"4\"\n"

Now we’ve got what we’re after, separate entries being registered from within the R script. The R script will have to deal with the argument input, converting the strings to numbers but we’re passing Clojure things into R with parameters.

Mrs. Trellis, as the basics go, job done. I’m sure it could be done better. Each case is going to be different so you’ll have to prepare the vector for each R script you work on.




Reverse Engineering the Nonsense. #marketing #coco #eureka


It looks daft doesn’t it…’s either bats**t loonball or someone has just done their job very well indeed…. personally I think it’s marketing genius. This is how I imagine the phone call went…..

[Eureka] – “Hi, is that Coco? It’s marketing dude at Eureka here, we’ve got this idea to sell these vacuum cleaners.”

[Coco] – “Go on, I’m listening….”

[Eureka] – “Will you walk down the high street while one of our other dudes vacuums the street? We’ll give you 10% of the sales”

[Coco] – “Deal! Can I wear what I want? If I’m gonna look mad I might as well do it in style….”

[Eureka] – “Deal!”

(Disclaimer: The above is ALL MADE UP)

Back of the Beermat later….

Facebook views as I took the screenshot: 15,777,263….. nice.

One percent convert to sales? A long shot but hey, it’s madness this morning.

So 157,772 sales at $219 as let’s be honest you want the one that Coco get’s someone to clean the street with…. $34,552,205.97. Nice.

Coco walks about with $3.4m in her back pocket (assuming the getup has pockets).

Not bad for an hour’s work, a bit of mockery on Facebook and Youtube, so odd headlines about you but hey, the exposure is priceless. Eureka have saved a fortune on Youtube CPM fees and a full marketing campaign.

That doesn’t even take into account the outfit and what the baby is wearing. Now if you could scan the image into an app and find out about it…… Oh Kim’s working on that already….

How to run R scripts from Clojure – #clojure #r #datascience #data #java

An interesting conversation came up during a tea break in London meeting this week. How do run R scripts from within Clojure? One was simple, the other (mine) was far more complicated (see the “More Complicated Ways” section below).

So here’s me busking my way through the simple way.

Run it from the command line

The Clojure Code

Using the package gives you access the Java system command process tools. I’m only interested in running a script so all I need is the sh command.

(ns rinclojure.example1
 (:use [ :only [sh]]))

The shfunction produces a map with three keys: an exit code (:exit), the output (:out) and an error (:err). I can evaluate the output map and ensure there’s no error code, anything that’s not zero, and dump the error or if all is well send out the output.

(defn run-command [r-filepath]
 (let [command-output (sh "Rscript" r-filepath)]
   (if (= 0 (:exit command-output))
     (:out command-output)
     (:err command-output))))

The R Code

I’ve kept this function simple, I’m only interested in running Rscript and checking the error code. If all is well then we show output, otherwise we send out the error.

The now preferred way to run R scripts from the command line is the Rscript command which is bundled with the R software when you download it. If I have R scripts saved then it’s a case of running them through Rscript and evaluating the output.

Here’s my R script.

myvec <- c(1,2,3,2,3,4,5,4,3,4,3,2,1)

Not complicated I know, just a list of numbers and a function to get the average.

Running in the REPL

Remember the error is from the running of the command and not within your R code. If you mess that up then those errors will appear in the :out value.

A quick test in the REPL gives us…..

rinclojure.example1> (def f "/Users/jasonbell/work/projects/rinclojure/resources/r/meantest.R")
rinclojure.example1> (run-command f)
"[1] 2.846154\n"

Easy enough to parse by removing the \n and the [1] line which R have generated. We’re not interacting with R only dumping out the output from it. After that there’s an amount of string manipulation to do.

Expanding to Multiline Output From R

Let’s modify the meantest.Rfile to give us something multiline.

myvec <- c(1,2,3,2,3,4,5,4,3,4,3,2,1)

Nothing spectacular I know but it has implications. Let’s run it through our Clojure command function.

rinclojure.example1> (def f "/Users/jasonbell/work/projects/rinclojure/resources/r/meantest.R")
rinclojure.example1> (run-command f )
"[1] 2.846154\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n 1.000 2.000 3.000 2.846 4.000 5.000 \n"

Using clojure.string/split will give us the output in each line into a vector.

rinclojure.example1> (clojure.string/split x #"\n")
["[1] 2.846154" " Min. 1st Qu. Median Mean 3rd Qu. Max. " " 1.000 2.000 3.000 2.846 4.000 5.000 "]

There’s still an amount of tidying up to do though. Assuming I’ve created x to hold the output from the Rscript. Firstly split the \n’s out.

rinclojure.example1> (def foo (clojure.string/split x #"\n"))
rinclojure.example1> foo
["[1] 2.846154" " Min. 1st Qu. Median Mean 3rd Qu. Max. " " 1.000 2.000 3.000 2.846 4.000 5.000 "]

If, for example, I wanted the summary values then I have do some string manipulation to get them.

rinclojure.example1> (nth foo 2)
" 1.000 2.000 3.000 2.846 4.000 5.000 "

Split again by the space.

rinclojure.example1> (clojure.string/split (nth foo 2) #" +")
["" "1.000" "2.000" "3.000" "2.846" "4.000" "5.000"]

The final step is then to convert the values to numbers, forgetting the first as it’s blank. So I would end up with something like:

rinclojure.example1> (map (fn [v] (Double/valueOf v)) (rest (clojure.string/split (nth foo 2) #" +")))
(1.0 2.0 3.0 2.846 4.0 5.0)

We have no referencing to what the number means, if the min, max, average etc. At this point there would be more string manipulation required and you could convert them to keywords or just add your own.

More Complicated Ways.

With the R libraries exists the RJava package. This lets you run Java from R and R from Java. I wrote a chapter on R in my book back in 2014.

It’s not the easiest thing to setup but worth the investment. There is a Clojure project on Github that acts as a wrapper between R and Clojure, clj-jri. Once setup you run R as a REngine and evaluate the output that way. There’s far more control but it comes at the cost of complexity.

Keeping Things Simple

Personally I think it’s easier to keep things as simple as possible. Use Rscript to run the R code but it’s worth considering the following points.

  • Keep your R scripts as simple as possible, output to one line where possible.
  • Ensure that all your R packages are installed and working, it’s not idea to install them during the Clojure runtime as the output will become hard to parse. Also make sure that all the libraries are running on the same instance as your Clojure code.
  • In the long run have a set of solid string manipulation functions to hand for dealing with the R output. Remember, t’s one big string.