Quick Recipe for #Kafka Streams in #Clojure

logo

Kafka Streams were introduced in Kafka 0.10.x and act as a way of programatically manipulating the data from Kafka. William Hamilton from Funding Circle introduced the concepts in a lightening talk during ClojureX. As discussed by myself and William, make Java Interop your friend.

I’ve based my example from James Walton’s Kafka Stream example which you can find on GitHub.

The Quick and Dirty Basic Stream Demo

First add the dependencies to your project.

[org.apache.kafka/kafka-streams "0.10.0.1"]

Configuration

First of all some configuration, the properties we’re going to use give the application a name, the Kafka broker to work with and the key/value classes to use for each message (in this example they are both strings). With those properties we then create a StreamsConfig class.

(def props
 {StreamsConfig/APPLICATION_ID_CONFIG, "my-stream-processing-application"
 StreamsConfig/BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"
 StreamsConfig/KEY_SERDE_CLASS_CONFIG, (.getName (.getClass (Serdes/String)))
 StreamsConfig/VALUE_SERDE_CLASS_CONFIG, (.getName (.getClass (Serdes/String)))})

(def config
 (StreamsConfig. props))

Creating the Builder

The main builder is defined first then we’ll add the topic and config on when the stream is created.

(def builder
 (KStreamBuilder.))

Defining the Topic

Just a string array of topic names, Kafka Streams can read more than one topic.

(def input-topic
 (into-array String ["topic-input"]))

Working with the Stream

While the stream is running every event passed through the topic becomes a KStream object, it’s a case of passing that through a method to do some work on the content of that stream. In this case we’re mapping the values (.mapValues) and converting the value of the key/pair (v) to a string then counting the length. That thing to do is print out the results to the System.out.

(->
 (.stream builder input-topic)
 (.mapValues (reify ValueMapper (apply [_ v] ((comp str count) v))))
 (.print))

It’s worth looking at the actual Java API for the Kafka KStream class. There are lots of methods to manipulate the data passing through, this might result in a value being sent to another Kafka topic or it just being written out to a file. Take the time to study the options, you’ll save yourself time in the long run.

Setting It All Off

The final parts of the puzzle.

(def streams
 (KafkaStreams. builder config))

(defn -main [& args]
 (prn "starting")
 (.start streams)
 (Thread/sleep (* 60000 10))
 (prn "stopping"))

The main function starts the service and will keep it alive for ten minutes.

Packaging it all up

I’m using leiningen, it’s a simple case of creating an uberjar.

$ lein uberjar
Compiling kstream-test.core
log4j:WARN No appenders could be found for logger (org.apache.kafka.streams.StreamsConfig).
log4j:WARN Please initialize the log4j system properly.
Created /Users/jasonbell/work/dataissexy/kstream-test/target/uberjar+uberjar/kstream-test-0.1.0-SNAPSHOT.jar
Created /Users/jasonbell/work/dataissexy/kstream-test/target/uberjar/kafka_streams.jar

Testing the Service

So straight out of the box, Kafka 0.10 is installed in /usr/local, I’m going to be the root user while I run all this (it’s just a local machine).

Start Zookeeper

$KAFKA_HOME/bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka

$KAFKA_HOME/bin/kafka-server-start.sh config/server.properties

Create the Topic

$KAFKA_HOME/bin/kafka-topics.sh --create -zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic-input


Created topic "topic-input".

 

Start a Producer and Add Content

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic topic-input
This is some data
Thisis some more data
This is more data than the last time
asdas
asd
dfgd

Start the Uberjar’d Stream Service

$ java -jar target/uberjar/kafka_streams.jar
log4j:WARN No appenders could be found for logger (org.apache.kafka.streams.StreamsConfig).
log4j:WARN Please initialize the log4j system properly.
"starting"
null , 5
null , 3
null , 3
null , 3
null , 3
null , 6
null , 3
null , 3
null , 6
null , 0
null , 3
null , 4
null , 5
"stopping"

Concluding

A really quick walkthrough but it gets the concepts across. Ultimately there’s no way of doing things that’s better than the other. Part of me wants to stick with Onyx, the configuration works well and the graph workflow is easier to map and change. Kafka Streams is important though and certainly worth a look if you are using Kafka 0.10.x, if you are still on 0.8 or 0.9 then Onyx, in my opinion, is still the best option.

 

 

@jennschiffer Delivers the Best Talk on the State of Tech I’ve Seen #tech #talks

jennschiffer

I love the tech industry, the diversity, the people, everything. I don’t like the nonsense that goes with it, it’s pointless and not needed. And it’s not often I’ll watch something that leaves me peeing myself laughing but also angry about the treatment that some get.

So, Jenn Schiffer, your talk at the XOXO Festival, brilliant and thank you. As for everyone else, her talk is definitely required viewing.

 

So #ClojureX was excellent, now here’s my corrections – @skillsmatter @onyxplatform

My first #ClojureX done and I’ve already spent the morning thinking about what I could possibly talk about in 2017. Hands down one of the best developer conferences I’ve attended. It’s all about the community and the Clojure community gets that 100%, it showed over the two days.

Many many thanks to everyone at SkillsMatter for looking me, the weary traveler, the tea was helpful. Also great to meet some of the folk I regularly talk to on the Clojurians Slack channel.

15267854_10155678189670200_7186480394544666013_n

A couple of things following my talk now I’ve watched it back. If you want to watch then this is the link: https://skillsmatter.com/skillscasts/9153-introducing-streaming-processing-with-kafka-and-the-onyx-platform

Firstly, when you create the Onyx app with lein it does create the docker compose file but it only has the Zookeeper element, not the Kafka one – that has to be added in afterwards.

Secondly, the percentage scheduler adds up to 100, I said zero. Brain detached for a second, thought one thing and something else came out.

Apologies, I don’t like giving out wrong information.

Craig and Darcey would have been proud, kind of, perhaps.

Here’s to 2017.

Running Investor Metrics on NI Startup @Mattermark Scores – #investing #clojure #stats #startups

In the last post I argued (mainly with myself) on how Mattermark scores could be used as a gauge for a NI startup’s performance. No response, not that I was really expecting one but no one complaining either. I did say that I would delve into the numbers a little deeper, so here it is.

Got a drink ready? Let’s go.

marilyn

“So How’s [Startup] Doing?”

It’s a question I’m often asked but one I have now stopped asking myself. The main reason: location, location, location. I’m nowhere near the action.

With the best will in the world it’s either be in Belfast or pretend you are in Belfast. The thing is when someone asks that question, “How’s so and so doing” the answer is usually based on hearsay, rumour and the 99% confirmation bias of the founders, regardless of how nice they are everything will be going fine. They’ve got to keep the positive mindset going, I won’t knock them for doing what they have to do, it’s business.

One of the reasons I’ve relied on the Mattermark scores more and more, they are a good gauge as to how a startup is performing from a social and investor sentiment profile standpoint. It also means I can compare against others in the same sector.

So how is seesense.cc doing? Easy, inspect the growth score numbers.

core>seesense
(160 173 175 170 174 172 170 169 176 178 185 196 196 196 188 196 196 198 199 206 205 204 207 197 187 187 187 182 173 173 170 167 160 154 153 146 150 149 147 145 144 140 138 136 135 134 129 126 129)

That’s all I care about, the previous 52 week Mattermark scores.

Annual Return

Taking the end and start figures I can return a percentage of theoretical returns (what did it make), if I were to treat the Mattermark score as stock price for example then how much am I making over the annual period.

The annual return is easy to calcuate, it’s a percentage.

end / start - 1

Seesense started at 160 and ended at 129, which gives me -0.19375 or -19.3%.

Daily Returns

Exactly the same calculation as annual returns, just done on an each reading basis over the period.

(inv/daily-returns seesense)
(0.08125000000000004 0.011560693641617936 -0.02857142857142858 0.02352941176470602 -0.011494252873563204 -0.011627906976744207 -0.00588235294117645 0.041420118343195034 0.01136363636363602 0.039325842696628976 0.059459459459459074 0.0 0.0 -0.04081632653061218 0.042553191489361986 0.0 0.010204081632652962 0.005050505050504972 0.035175879396984966 -0.004854368932038833 -0.004878048780487809 0.014705882352940902 -0.048309178743961345 -0.050761421319796995 0.0 0.0 -0.0267379679144385 -0.0494505494505495 0.0 -0.01734104046242768 -0.01764705882352935 -0.041916167664670656 -0.03749999999999998 -0.006493506493506551 -0.04575163398692805 0.027397260273972934 -0.00666666666666671 -0.01342281879194629 -0.013605442176870652 -0.006896551724137945 -0.02777777777777779 -0.014285714285714346 -0.01449275362318836 -0.007352941176470562 -0.007407407407407418 -0.03731343283582089 -0.023255813953488413 0.023809523809523947)

Remember these are based as percentages.

Measuring Risk

Risk is measured as the standard deviation of daily returns.

(defn risk-measure [coll]
 (stats/standard-deviation (daily-returns coll)))

In the above figures we get:

:risk-measure 0.02866805600574157

A 2% standard deviation from the mean, not bad going at all. It’s consistent.

Using The Sharpe Ratio

A common measurement used in finance, the Sharpe Ratio is a reward vs risk measurement. I’m basically taking the average of daily returns and dividing it by the standard deviation, the average is then multiplied against the square root of the number of trading days.

There are variations on the Sharpe Ratio but this version serves me well as a reward/risk measurement.

(defn sharpe-ratio [coll trading-days]
 (let [k-num (Math/sqrt trading-days)
 dr (daily-returns coll)
 risk (risk-measure coll)]
 (* k-num (/ (stats/mean dr) risk))))

Let’s have a look with Seesense’s readings.

(inv/sharpe-ratio seesense 52)
-1.0255674077731773

I’m looking for a ratio of 1 as a general good return with low risk, if it were 2 then my eyes would be wide open looking to invest if the company had IPO’d. As Seesense’s downward curve (why? they’re a good company in NI) is pretty consistent I personally wouldn’t be looking to invest, the -1 Sharpe ratio confirms it.

Take six companies

Let’s take this a little further. At the start of November I looked at the data from a number of companies from the mainland and Northern Ireland. Good companies doing great things, what’s important is that there’s a consolidated number within Mattermark for all of them.

I ran the investor metrics against the Mattermark company data and got the following data back.  Disclaimer: The idea is for me to remove myself from the biases of the startup and get a response purely from the numbers, the companies below are all great I just picked them for the purpose of this blog.

Company Annual Ret Risk Sharpe
Adoreboard -0.0645 0.0171 -0.4950
Airpos 0.9833 0.0408 2.5306
App Attic 0.5625 0.2118 0.8072
Brewbot -0.25 0.0182 -2.1838
Get Invited 0.1428 0.1084 0.5443
Taggled TV -0.0064 0.0252 0.0520

Interesting results, there’s one standout “investment” and that’s Airpos. With a Sharpe ratio of 2.53 and only a 4% risk factor an investor would be putting money on it if it were IPO’d.

Talking of risk, is App Attic the riskiest company? From the numbers it says so, at 21% but the annual return was good too at 56%. There was plenty of volatility in the raw data to back it up, especially at the early stage of the year.

From an investment standpoint the numbers, while live, are purely theoretical. The question we need to ask ourselves is this: is a VC, Angel or other investor going to use these scores as part of their due diligence process. Taken even further could something like the gamblers fallacy take over?

The Binary Decision

Would data driven decision making work for long term returns? If an investor algorithm for example worked on two basic rules:

if risk < 10% and sharpe > 1 then invest

Out of the seven companies we’ve looked at all together only one would be positive as a result of the algorithm, that’s Airpos.

Concluding

It’s been an interesting exercise and one I’ll be keeping an eye on. The question that remains unanswered though, do actual investors use the Mattermark scores as part of their due diligence and use the metrics available to steer the basic decision of to invest or decline?

 

 

Is the @Mattermark score the best NI Startup Score metric? #mattermark #data #startups #clojure

How To Measure a NI Startup?

It’s been a question on my mind for a good five years, to the point that in 2015 I bought a domain called nitechrank to collate all the data from various sources and see if it could be resold. No sooner as I started I saw that Mattermark was gaining momentum and I did the sensible thing and stopped.

The question though still remains in my mind, is there a good metric to tell me how a Northern Ireland startup is doing. One way is to ask the CEO of the company and hands down, 100% they will talk utter rubbish and say, “yeah it’s great, we’re just about to…..“. Let me make this simple, the phrase “just about to” means, “we haven’t”.

As there aren’t many publicly listed companies from Northern Ireland. Off the top of my head I can only think of two: Kainos and First Derivatives. Andor now belong to Oxford Instruments and UTV are now nested within ITV with the reaction from the action being that Julian has been removed from our screens. With listed companies there is a metric of performance and shareholder value. With a small startup there’s non of that just basic here say, hype and questionable cash flow projects.

So I’ve been thinking more and more about one metric that can tell me how a early stage Northern Ireland startup is doing. And I think it all points back to Mattermark.

All Hail The Growth Score

The Mattermark Growth score is the baseline on how a startup is looking to the outside world. It’s calculated on a number of sources but the main ones to think about are social sources like Linkedin, Facebook, Twitter and Instagram as well as the more entrepreneurial sources like Crunchbase and Angel.co.

With this information fed in, ran through a nice model and distilled to a number of key scores (Growth, Mindshare and Momentum) we have a good idea how a company is doing from an outsider’s perspective. Interaction with the brand will increase the mindshare in your Mattermark score for example.

There’s Nothing Like Exposure To Public Ridicule To Galvanise The Attention*….

If you are Northern Ireland CEO/Founder and you’re not looking at your Mattermark score then I suggest you get your skates on and watch it religiously.

While it’s all very nice with the “Our wee country” schizz, no one outside of Northern Ireland really gives much of a hoot about what you are saying but they will be keeping an eye on Mattermark, Crunchbase and other sources to see how things are performing.

In all fairness while it’s good to be on the Propel Programme and accelerators like StartPlanet NI, that the first foot up from the ground to the first rung on the ladder. That’s about getting you ready to be let out in the world and those two glorious words “investor ready”. Historically I’ve not see much gain on the hype cycle once a company has completed these types of things, just a few quid in the bank to last another quarter. And you know what, fair enough.

Back to the point, what I am saying is that your Mattermark Growth Score is your stock exchange price that others will look at, judge you on and figure if you’re worth looking at.

If we look at GetInvited’s Mattermark score it’s doing okay. A growth score of 48, okay there was a dip during the year but it’s on the up again.

file-29-10-2016-08-53-41

Now, I and others know that GI raised money. And it’s something to shout about so get it on Crunchbase! Would that raise the growth score, possibly. Would it show other potential investors that there was enough faith in the team to put money in it, definitely. Showing incremental growth on these platforms sends out a strong message that things are happening within the company.

Basically every NI company needs to figure out a way of getting positive PR, up to date information on Crunchbase and some good hype on the likes of Angel.co to start getting those scores up. Once that happens then perhaps startups from “our wee country” will get in front of the noses that matter.

Non of this is secret, it’s just common sense.

Coming Soon….

In the next part I’ll start exploring the growth score numbers and seeing what we can learn and apply from them.

 

 

Are mobile apps giving the bookies the edge? Probably. #secondscreen #tv #xfactor

I’ve had a hunch for a long time and I think on Sunday night it was, with a fair probability, finally confirmed. This is more a brain dump of thoughts than anything with a cohesive conclusion…. take it for what you will.

Apps Produce Data, Got That, Good.

My interest in data goes back a long time, way before apps were a thing to be touted as a startup idea. I survived the J2ME years when applications were delivered by 4cm x 4cm screen. There was a time where this….

photo-j2me-application-motorola-v3-razr

…was radical, I was there, it wasn’t pretty.  How times have moved on. Now we have things like smartphones, bigger screens, better experience. Yes time forwarded on and the advances in mobile technology did too. Connectivity got better and storage got cheaper.

Data creation, collection and processing got scalable, cheaper and much much faster. And that’s just one set of data, start mushing it with other data sources we have a rich mashup of stuff.

So what better use of these technological advances than the XFactor. Here me out, the data gets rather interesting I think.

Voting is data and data is money

file-25-10-2016-06-55-39

This is data collection of the highest order it’s:

  • Rich data is direct from the viewer for free
  • In real time, there’s a two minute window per song even per second suddenly have 120 data points in a time series across n viewers.
  • GeoIP tracking, more than likely (though I am guessing) so there’s a good chance of knowing where abouts the viewer is. I see a heat map in the making.
  • Plus the five free REAL votes about who gets saved in the bottom three each Sunday.

And that’s me just scratching the surface. I’m sure there’s more metrics pouring out of this thing that I care to imaging. Tellybug seem to have got the second screen instant feedback down to a fine, slick, art.

tellybug

Who has the edge?

They who have the data, have the edge” – Jason Bell, October 2016

I’ll park that there, it might come in useful one day. If you have data that no one else has then you have an edge on everyone else, simple. That might be in business or in this case, XFactor data on who’s going home.

A good example of having the edge in TV prediction markets was on the Great British Bake Off where some members of the production company, allegedly, were putting large volumes of cash against the winner from episode one. The betting was suspended as it become clear someone had the edge. To be fair as Bake Off is recorded in advance then plenty of people had the edge.

So last Sunday the bottom three were left in suspense, then some act was told they were saved, there was much surprise, shock and tears…. the rapping lady was not one of them. On Twitter though this happened four minutes after the reveal.

betfred

How did Betfred know this? The number of votes cast on the app…. a direct line to the data perhaps?

My conclusion, wrongly or rightly is this. The betting companies are buying in the data from either the production company or Tellybug and therefore gaining competitive advantage on the customers. To be honest it’s no different than any other outcome prediction. I could never figure out why the rapping lady’s odds on next elimination markets was so out there, Betfred (and others) already knew.

With the ratings system on the app a betting company can get the edge, it knows what the audience is thinking and as that audience in question is more than likely going to be voting then the odds can be generated with a fair form certainty.

Competitive advantage? Yes. Surprising? No. A good use of data? Absolutely.

 

 

 

 

Calculating The Darcey Coefficient – Part 4 – Live Testing #strictlycomedancing #clojure #linearregression

craig-revel-horwood-darcey-bussell

I promise this is the last part of The Darcey Coefficient, having gone through linear regression, neural networks and refining the accuracy of the prediction, it was only fair I ran the linear regressions against some live scores to see how it performed.

If you want to read the first four parts (yes four, I’m sorry) then they are all here.

Week 5 Scores

As ever the Ultimate Strictly website is on the ball, the scores are all in.

Judges scores
Couple Craig Darcey Len Bruno Total
Robert & Oksana 6 8 8 7 29
Lesley & Anton 5 6 7 6 24
Greg & Natalie 4 6 7 7 24
Anastacia & Gorka* 7 7 8 8 30
Louise & Kevin 8 8 8 9 33
Ed & Katya 2 6 6 4 18
Ore & Joanne 9 9 9 9 36
Daisy & Aljaž 8 8 8 8 32
Danny & Oti 8 9 9 9 35
Claudia & AJ 8 7 8 9 32

So we have data and the expected result. The real question is how well the regressions perform. Time to rig up some code.

Coding the Test

As the spreadsheet did the work I don’t need to reinvent the wheel. All I need is the numbers and put them in a function.

First there’s the Craig -> Darcey regression.

(defn predict-score-from-craig [x-score]
 (+ 3.031 (* 0.6769 x-score)))

And then there’s the All Judges -> Darcey regression.

(defn predict-darcey-from-all [x-score]
 (- (* 0.2855 x-score) 1.2991))

As the predictions will not come out as integers I need a function to round up or down as required. So I nabbed this one from a StackOverflow comment as it works nicely.

(defn round2 [precision d]
 (let [factor (Math/pow 10 precision)]
   (/ (Math/round (* d factor)) factor)))

Finally I need the data, a vector of vectors with Craig, Len, Bruno and Darcey’s score. I leave Darcey’s actual score in so have something to test against. What the predicted score was and what the actual score was.

;; vectors of scores, [craig, len, bruno, darcey's actual score]
(def wk14-scores [[6 8 7 8]
 [5 7 6 6]
 [4 7 7 6]
 [7 8 8 7]
 [8 8 9 8]
 [2 6 4 6]
 [9 9 9 9]
 [8 8 8 8]
 [8 9 9 9]
 [8 8 9 7]])

Predicting Against Craig’s Scores

The difference between Craig and Darcey’s scores can fluctuate depending on the judges comments. The dance with Ed and Katya is a good example, Craig scored 2 and Darcey scored 6, so I’m not expecting great things from this regression, but as it was our starting point let’s test it.

(defn predict-from-craig [scores]
 (map (fn [score]
   (let [craig (first score)
         expected (last score)
         predicted (round2 0 (predict-score-from-craig (first score)))]
    (println "Craig: " craig
             "Predicted: " predicted
             "Actual: " expected
             "Correct: " (if (= (int predicted) expected)
                             true
                             false)))) scores))

When run it gives us the following predictions:

strictlylinearregression.core> (predict-from-craig wk14-scores)
Craig: 6 Predicted: 7.0 Actual: 8 Correct: false
Craig: 5 Predicted: 6.0 Actual: 6 Correct: true
Craig: 4 Predicted: 6.0 Actual: 6 Correct: true
Craig: 7 Predicted: 8.0 Actual: 7 Correct: false
Craig: 8 Predicted: 8.0 Actual: 8 Correct: true
Craig: 2 Predicted: 4.0 Actual: 6 Correct: false
Craig: 9 Predicted: 9.0 Actual: 9 Correct: true
Craig: 8 Predicted: 8.0 Actual: 8 Correct: true
Craig: 8 Predicted: 8.0 Actual: 9 Correct: false
Craig: 8 Predicted: 8.0 Actual: 7 Correct: false

Okay, 50/50 but this doesn’t come as a surprise. With more scores (ie Len and Bruno) then we might hit the mark better.

Does The Full Judges Scores Improve the Prediction?

Let’s find out, here’s the code:

(defn predict-from-judges [scores]
 (map (fn [score]
          (let [judges (reduce + (take 3 score))
                expected (last score)
                predicted (round2 0 (predict-score-from-all judges))]
          (println "Judges: " judges
                   "Predicted: " predicted
                   "Actual: " expected
                   "Correct: " (if (= (int predicted) expected)
                                   true
                                   false)))) scores))

By taking the sum of the first three scores (Craig, Len and Bruno) that total is then run against the predict-score-from-all function. How it performs is anyone’s guess right now.

strictlylinearregression.core> (predict-from-judges wk14-scores)
Judges: 21 Predicted: 5.0 Actual: 8 Correct: false
Judges: 18 Predicted: 4.0 Actual: 6 Correct: false
Judges: 18 Predicted: 4.0 Actual: 6 Correct: false
Judges: 23 Predicted: 5.0 Actual: 7 Correct: false
Judges: 25 Predicted: 6.0 Actual: 8 Correct: false
Judges: 12 Predicted: 2.0 Actual: 6 Correct: false
Judges: 27 Predicted: 6.0 Actual: 9 Correct: false
Judges: 24 Predicted: 6.0 Actual: 8 Correct: false
Judges: 26 Predicted: 6.0 Actual: 9 Correct: false
Judges: 25 Predicted: 6.0 Actual: 7 Correct: false

Well that’s interesting, we get much lower expected scores based on the combined scores. Every prediction was wrong, that would hurt if you were betting on it.

All of this leads us to a conclusion, if you want to predict what Darcey’s score is going to be then look at what Craig does first.

That’s that, case is now closed.

 

Refining the Coefficient. Iterative Improvements In Learning. #data #machinelearning #linearregression

Refinement is an iterative process, sometimes quick and sometimes slow. If you’ve followed the last few blog posts on score prediction (if not you can catch up here) I’ve run the data once and rolled with the prediction, basically, “that’s good enough for this”.

The kettle is on, tea = thinking time

This morning I was left wondering, as Strictly is on tonight, is there any way to improve reliability of the linear regression from the spreadsheet? The neural network was fine but for good machine learning you need an awful lot of data to get a good prediction fit. The neural net was level pegging with the small linear model, about 72%.

I’ve got two choices, create more data to tighten up the neural net or have a closer look at the original data and find a way of changing my thinking.

Change your thinking for better insights?

Let’s remind ourselves of the raw data again.

2,5,5,5
5,6,4,5
3,5,4,4
4,6,6,7
6,6,7,6
7,7,7,7

Four numbers, the scores from Craig, Len, Bruno and Darcey in that order. The original linear regression only looked at Craig’s score to see the impact on Darcey’s score.

darceyco

That gave us the predition:

y = 0.6769x + 3.031

And a R squared value of 0.792, not bad going. The neural network took into account all three scores from Craig, Len and Bruno to classify Darcey’s score, it was okay but the lack of raw data actually let it down.

Refining the linear regression with new learning

If I go back to the spreadsheet, let’s tinker with it. What happens if I combine the three scores using the SUM() function to add them together.

linear3

Very interesting, the slope is steeper for a start. The regression now gives us:

y = 0.2855x - 1.2991

And the R squared has gone up from 0.792 to 0.8742, an improvement. And as it stands this algorithm is now more accurate than the neural network I created.

Concluding

It’s a simple change, quite an obvious on and we’ve taken the original hypothesis forward since the original post. How accurate is the linear regression? While I’ll find that out tonight I’m sure.

 

 

Calculating The Darcey Coefficient – Part 3 #strictlycomedancing #machinelearning #clojure #weka

The Story So Far…

This started off as a quick look at Linear Regression in spreadsheets and using the findings in Clojure code, that’s all in Part 1. Muggins here decided that wasn’t good enough and rigged up a Neural Network to keep the AI/ML kids happy, that’s all in Part 2.

Darcey, Len, Craig or Bruno haven’t contacted me with a cease and desist so I’ll carry on where I left off….. making this model better. In fact they seem rather supportive of the whole thing.

striclty-come-dancing-judges

Weka Has Options.

When you create a classifier in Weka there are options available to you to tweak and refine the model. With the Multilayer Perceptron that was put together in the previous post, that all ran with the defaults. As Weka can automatically build the neural network I don’t have to worry about how many hidden layers to define, that will be handled for me.

I do however want to alter the number of iterations the model runs (epochs) and I want to have a little more control over the learning rate.

The clj-ml library handles the options as a map.

darceyneuralnetwork.core> (def opts {:learning-rate 0.4 :epochs 10000})
#'darceyneuralnetwork.core/opts
darceyneuralnetwork.core> (classifier/make-classifier-options :neural-network :multilayer-perceptron opts)

The code on Github is modified to take those options into account.

(defn train-neural-net [training-data-filename class-index opts]
 (let [instances (load-instance-data training-data-filename)
       neuralnet (classifier/make-classifier :neural-network :multilayer-perceptron opts)]
   (data/dataset-set-class instances class-index)
   (classifier/classifier-train neuralnet instances)))

(defn build-classifier [training-data-filename output-filename]
 (let [opts (classifier/make-classifier-options :neural-network :multilayer-perceptron
                                                {:learning-rate 0.4
                                                 :epochs 10000})
       nnet (train-neural-net training-data-filename 3 opts)]
   (utils/serialize-to-file nnet output-filename)))

Concluding

There’s not much more I can take this as it stands. The data is actually pretty robust that using Linear Regression would give the kind of answers we were looking for. Another argument would say that you could use a basic decision tree to read Craig’s score and classify Darcey’s score.

If the data were all over the place in terms of scoring then using something along the lines of an artificial neural network would be worth doing. And using Weka with Clojure the whole thing is made a lot easier. It’s actually easy to do in Java which I did in my book Machine Learning: Hands on for Developers and Technical Professionals.

51tc7h5-I7L._SX342_

Rest assured this is not the last you’ll see of machine learning in this blog, there’s more to come.

 

 

 

Calculating The Darcey Coefficient – Part 2 #strictlycomedancing #machinelearning #clojure #weka

Previously on…..

In part 1 we looked at using linear regression, with the aid of a spreadsheet, to see if we could predict within a reasonable tolerance predict what Darcey Bussell’s scoring would be based on Craig Revel Horwood’s score.

No big deal, it worked quite well, it took less thank five minutes and didn’t interfere with me making a cup of tea. As we concluded from a bit of data application:

y = 0.6769x + 3.031

And all was well.

Time To Up The Ante

Linear Regression is all well and good but this is 2016, this is the year where every Northern Ireland company decides it’s going to do artificial intelligence and machine learning with hardly any data…. So, we’re going to upgrade the Darcey Coefficient and go all Techcrunch/Google/DeepMind on it, Darcey’s predictions are now going to be an Artificial Neural Network!

e3f58b2cb19ec661452a7d008db0f4a1

Ewwwww.

My sentiments exactly. For the readers of previous posts, both of you, my excitement for neural networks isn’t exactly up there. They’re good but held with a small amount of skepticism. My reasons? Well, like I’ve said before….

One of the keys to understanding the artificial neural network is knowing that the application of the model implies you’re not exactly sure of the relationship of the input and output nodes. You might have a hunch, but you don’t know for sure. The simple fact of the matter is, if you did know this, then you’d be using another machine learning algorithm.

We’ve already got a rough idea how this is going to pan out, the linear regression gave us a good clue. The amount of data we have isn’t huge either, the data set has 476 rows in it. So the error rate of a neural network might actually be larger than what I’d like.

The fun though is in the trying. And in the aid of reputation, ego and book sales well hey it’s worth a look. So I’m going to use the Weka Machine Learning framework as it’s good, solid and it just works.  The neural network can be used for predicting the score or any judge and as Len’s leaving this is perfect opportunity to give it a whirl. For the means of demonstration though I’ll use Darcey’s scores as it follows on from the previous post.

Preparing the Data

We have a csv file but I’ve parred this down so it’s just the scores of Craig, Darcey, Len and Bruno. Weka can import CSV files but I prefer to craft the proper format the Weka likes which is the ARFF file. It spells out the format, the output class we’re expecting to predict on and so on.

@relation strictlycd

@attribute craig numeric
@attribute len numeric
@attribute bruno numeric
@attribute darcey numeric

@data
2,5,5,5
5,6,4,5
3,5,4,4
4,6,6,7
6,6,7,6
7,7,7,7..... and so on

Preparing the Project

Let’s have a crack at this with Clojure, there is a reference to the Weka framework in Clojars so this in theory should be fairly easy to sort out. Using leiningen to create a new project let go:

$ lein new darceyneuralnetwork
Warning: refactor-nrepl requires org.clojure/clojure 1.7.0 or greater.
Warning: refactor-nrepl middleware won't be activated due to missing dependencies.
Generating a project called darceyneuralnetwork based on the 'default' template.
The default template is intended for library projects, not applications.
To see other templates (app, plugin, etc), try `lein help new`.

Copy the arff file into the resources folder, or somewhere on the file system where you can find it easily, then I think we’re ready to rock and rhumba.

I’m going to open the project.clj file and add the Weka dependency in, I’m also going to add the clj-ml project too, this is a handy Clojure wrapper for Weka. It doesn’t cover everything but it takes the pain out of some things like loading instances and so on.

(defproject darceyneuralnetwork "0.1.0-SNAPSHOT"
 :description "FIXME: write description"
 :url "http://example.com/FIXME"
 :license {:name "Eclipse Public License"
           :url "http://www.eclipse.org/legal/epl-v10.html"}
 :dependencies [[org.clojure/clojure "1.8.0"]
                [clj-ml "0.0.3-SNAPSHOT"]
                [weka "3.6.2"]])

Training the Neural Network

In the core.clj file I’m going to start putting together the actual code for the neural network (no Web API’s here!).

Now a quick think about what we actually need to do. Actually it’s pretty simple with Weka in control of things, a checklist is helpful all the same.

  • Open the arff training file.
  • Create instances from the training file.
  • Set the class index of the training file, ie what we are looking to predict, in this case it’s Darcey’s score.
  • Define a Multilayer Perceptron and set it’s options.
  • Build the classifier with training data.

The nice thing with using Weka with Clojure is we do REPL driven design and do things one line at a time.

The wrapper library has a load-instances function and takes the file location as a URL.

darceyneuralnetwork.core> (wio/load-instances :arff "file:///Users/jasonbell/work/dataissexy/darceyneuralnetwork/resources/strictlydata.arff")
#object[weka.core.Instances 0x2a5c3f7 "@relation strictlycd\n\n@attribute craig numeric\n@attribute len numeric\n@attribute bruno numeric\n@attribute darcey numeric\n\n@data\n2,5,5,5\n5,6,4,5\n3,5,4,4\n4,6,6,7\n6,6,7,6\n7,7,7,7\n6,7,7,6\n3,5,4,5\n5,6,5,5\n8,7,7,8\n5,7,5,5\n3,5,5,5\n6,6,7,8\n4,4,5,5\n6,6,6,6\n7,7,7,6\n6,6,6,6\n3,5,5,6\n6,7,7,6\n2,4,4,5\n7,8,8,7\n8,8,8,8\n5,5,5,5\n6,5,5,6\n7,6,7,6\n3,5,5,5\n7,6,6,6\n5,6,6,6\n4,7,6,5\n3,6,4,6\n3,5,4,6\n4,5,4,4\n7,7,8,7\n8,8,8,8\n7,6,6,7\n6,6,6,7\n7

Okay, with that working I’m going to add it to my code.

(defn load-instance-data [file-url]
 (woi/load-instances :arff file-url))

Note the load-instances function expects a URL so make sure you filename does begin with “file:///” otherwise it will throw an exception.

With training instances dealt with in one line of code (gotta love Clojure, it would take three in Java) we can now look at the classifier itself. So the decision is to use a Neural Network, in this instances a Multilayer Perceptron. In Java it’s a doddle, in Clojure even more so:

darceyneuralnetwork.core> (classifier/make-classifier :neural-network :multilayer-perceptron)
#object[weka.classifiers.functions.MultilayerPerceptron 0x77bf80a0 ""]

It doesn’t actually do anything yet but there’s a classifier ready and waiting. We have to define which class (Craig, Len, Bruno or Darcey) we wish to classify, so Darcey is number 3. Weka needs to know what you are trying to classify otherwise it will throw an exception.

(data/dataset-set-class instances 3)

Now we can train the model.

darceyneuralnetwork.core> (classifier/classifier-train ann ds)
#object[weka.classifiers.functions.MultilayerPerceptron 0x1b800b32 "Linear Node 0\n Inputs Weights\n Threshold -0.005224665277369991\n Node 1 1.161165780729305\n Node 2 -1.0681086084010063\nSigmoid Node 1\n Inputs Weights\n Threshold -2.5314445242321613\n Attrib craig 1.3343684436571155\n Attrib len 1.290973083908637\n Attrib bruno 1.1941270206738404\nSigmoid Node 2\n Inputs Weights\n Threshold -1.508477761092395\n Attrib craig -0.73817374973773\n Attrib len -0.7490868020959697\n Attrib bruno -1.3714589018840246\nClass \n Input\n Node 0\n"]
darceyneuralnetwork.core>

The output is showing the input node weights. All looks good. We have a neural network that can predict Darcey’s score based on the other three judges scores.

Remember this is all within the REPL, back to my code now and I can craft a function to train a neural network.

(defn train-neural-net [training-data-filename]
 (let [instances (load-instance-data training-data-filename)
       neuralnet (classifier/make-classifier :neural-network :multilayer-perceptron)]
   (data/dataset-set-class instances 3)
   (classifier/classifier-train neuralnet instances)))

All it does is create the steps I did in the REPL: load the instances, create a classifier, select the class to classify and then train the neural network.

Giving it a dry run we run it as so from the REPL.

darceyneuralnetwork.core> (def training-data "file:///Users/jasonbell/work/dataissexy/darceyneuralnetwork/resources/strictlydata.arff")
#'darceyneuralnetwork.core/training-data
darceyneuralnetwork.core> (def nnet (train-neural-net training-data))
#'darceyneuralnetwork.core/nnet
darceyneuralnetwork.core> nnet
#object[weka.classifiers.functions.MultilayerPerceptron 0x100e60b7 "Linear Node 0\n Inputs Weights\n Threshold -0.005224665277369991\n Node 1 1.161165780729305\n Node 2 -1.0681086084010063\nSigmoid Node 1\n Inputs Weights\n Threshold -2.5314445242321613\n Attrib craig 1.3343684436571155\n Attrib len 1.290973083908637\n Attrib bruno 1.1941270206738404\nSigmoid Node 2\n Inputs Weights\n Threshold -1.508477761092395\n Attrib craig -0.73817374973773\n Attrib len -0.7490868020959697\n Attrib bruno -1.3714589018840246\nClass \n Input\n Node 0\n"]
darceyneuralnetwork.core>

All good then. So we’ve crafted a piece of code pretty quickly to train a neural network. I’d like to save the model so I don’t have to go through the pain of training it everytime I want to use it. The utils.clj has a function to serialize the model to a file.

(utils/serialize-to-file nnet output-filename)

The nice thing with Weka is the process is the same for most of the different machine learning types.

  • Load the instances
  • Create a classifier
  • Set the output class
  • Train the model
  • Save the model

So let’s park that there, we have build a neural network. Time to move to predicting some scores.  If you want to have a look at the code I’ve put it up on Github.

Predicting with the Neural Network

With our model (rough, ready and in need of refinement) we can do some predicting. It’s just a case of creating a new instance based on the training instance and running it against the neural network to get a score.

The make-instance function will take a defined instance type and apply data from a vector to create a new instance. Then it’s a case of running that against the model with the classifier-classify function.

darceyneuralnetwork.core> (def to-classify (data/make-instance instances [8 8 8 0]))
#'darceyneuralnetwork.core/to-classify
darceyneuralnetwork.core> (classifier/classifier-classify model to-classify)
7.776474338751103

So we have a score, if we rounded it up we’d get an 8. Which is about right. If Craig were to throw a hum dinger of a score in the model performs well under the circumstances.

darceyneuralnetwork.core> (def to-classify (data/make-instance instances [3 8 8 0]))
#'darceyneuralnetwork.core/to-classify
darceyneuralnetwork.core> (classifier/classifier-classify model to-classify)
6.769847804443958

Let’s remember this model is created with defaults, it’s far from perfect and with the amount of data we have we it’s not a bad effort.  There’s more we can do but I can hear the kettle.

In Part 3…..

Yes, there’s a part 3. We’ll take this model and have a go at some tweaking and refining to make the predictions even better.