Machine Learning book update…. #ai #machinelearning #java #clojure #kafka #dl4j #weka #spark

The first draft of the second edition is complete and a lot of the editing work is done. There are a couple of big changes….

  • SpringXD has now been replaced with Kafka and how to perform self training machine learning models (with Neural Networks, Linear Regression and Decision Trees).
  • The chapter on Spark got a rewrite to bring it up to date (oh and bye Scala….)
  • New chapters on math/stats and stuff, machine learning with text and machine learning with images.
  • A whole new chapter on data preparation, extraction with Apache Tika and so on.
  • Some jokes have been removed…..

You can see the page on the Wiley website for more information (it’s being updated while edits are ongoing) so it might not be 100% accurate.

If you want to preorder on Amazon (thank you in advance) then you can do so here.

A Little Gotcha in Luminus request in Layout Renders….. #clojure #webdev #luminus

It’s been a while since I looked at the Luminus web development framework for Clojure. I know some folk find it a bit too heavy but I really like it. There have been some changes in releases since I built DeskHoppa so I though I better leave it here incase someone else gets confused by it.

So, the basic page render used to look like this:

(layout/render "mytemplate.html")

Then you could pass in a map for values that would appear in the page.

(layout/render "mytemplate.html" {:key1 value1 :key2 value2})

In more recent releases it now takes the request object.

(layout/render request "mytemplate.html")

Now on first inspection the automatic assumption would be to add key/values to the request.

(assoc request :key3 value3)

Adding to the request will do nothing and have no effect. The render function doesn’t actually do anything with the request when it comes to render the page. You can, however, append parameters to the function after the page name is declared.

(layout/render request "mytemplate.html" {:key1 value1 :key2 value2 :key3 value3})

And the map will get passed to the page so you can use it.

<p>Hi, {{key1}}, you're visiting from {{key2}}!</p>

And so on.

I only mention it because it caught me out for a good half an hour……. thought it might be useful to someone else.

 

 

Adding file logging to Kafka Connect – #Kafka #Connect #Streaming #Data #Devops

More of a memory aid for me as I’ll forget…..

Kafka Connect’s default logging goes to the console, I prefer tailing and grepping files instead. I’m using Confluent Kafka 5.1.2 but the file locations will be pretty much similar if you are using the Apache version.

Open up the connect-log4.properties file in $KAFKA_HOME/etc/kafka directory.

Add the following lines.

log4j.appender.logFile=org.apache.log4j.DailyRollingFileAppender
log4j.appender.logFile.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.logFile.File=/tmp/connect-worker.log
log4j.appender.logFile.layout=org.apache.log4j.PatternLayout
log4j.appender.logFile.layout.ConversionPattern=[%d] %p %m (%c)%n

Change:

log4j.rootLogger=INFO, stdout

To:

log4j.rootLogger=INFO, logFile

And then restart Kafka. The new log file will be in the /tmp folder.

Also worth noting…..

The connect-standalone.properties file seems to have to changed from version to version. The Connect jar path in 5.1.2 for example is share/java where it’s now /usr/share/java in 5.3.1, I was trying to figure out why no Connect plugins were working at all. So it’s worth setting up either a symlink or copying all the jars to that location, or simpler would be to change the properties file.

Jason, Jase and Jade – The @skillsmatter tribute #ClojureX #conferences #events

The sad news began to stream in during yesterday afternoon, that Skillsmatter had gone into administration. I’m sure some of us have been in the same situation and have been in an organisation when this kind of thing has happened, it’s no fun.

What I did witness was the outpouring of thanks and support, this was wonderful to see.

Five years ago I became aware of ClojureX, I refused to talk when it was suggested that I do so, “I’m no good to talk at that!”, was my reply. I’d been coding Clojure for six months at that point and the talk titles I’d seen were on another planet. It wasn’t for another year that I took to the stage.

Jade is born….. by mistake

My name is Jason, everyone calls me Jase (at my request, there’s another Jason Bell, he’s a very good photographer, we know each other, me being Jase makes things easier). By an autocorrect quirk of fate of Chris Howe-Jones turned Jase in to Jade. A meme was born, so everyone in the Clojure community knew me as Jade.

The one thing Jade noticed was that everyone at Skillsmatter were completely kind, caring, professional and always wanting to make sure that you were okay. If you were happy the conference was always better. They made sure everyone was happy, it felt more like family than organisers.

Bring on the Hecklers….

At ClojureX I could let go a little and, being in December, I could get away with some very silly call and response. It became near like pantomime last year with one lady shouting at the top of her voice, “IT’S BEHIND YOU!”, I folded over laughing.

The first row ended up being a gallery of heckling rogues. It was hilarious and I’ve never delivered a plain talk ever since, there has to be some audience participation if they are willing to join in, I know it’s not for everyone but it makes a huge difference.

ClojureX was always close to my heart, along with Strata London. To the point that this year I stopped doing all talks but these two.

So……..

A huge thank you to Wendy, Carla, Sam, Nicole and everyone else that made sure I was happy, fed, watered, looked after, let me nod off from very early morning flights.

To John, Gaibile, Chris, Bruce and the rest of the committee – thanks for the hard work you’ve put in over the years.

Lastly, to those who gave me the grace to stand up, entertain and educate, thank you. The positive feedback meant the world to me.

SkillsMatter, it was an honour. Thank you!

(Everyone will still call me Jade…. I know).

 

“Where have you been hiding?”…. #machinelearning #books #writing

Last posted 15th August…..

…basically I didn’t have anything interesting to say.

I’ve have been busy working on this though:

The second edition of Machine Learning: Hands on for Developers and Technical Professionals is well underway, the first drafts are going to completed over the next couple of weeks. A lot has changed since the first edition back in 2014 (the description text is liable to change between now and March 2020).

I took 2019 off from doing talks, I wore myself out in 2018 and really didn’t want to repeat the whole episode. After the book is complete I’ll be slowing down again in to 2020.

Once the book is complete I’ll be blogging with more frequency. There’s a lot more Kafka things to talk about.

Is Googlewhacking Still Possible? cc @DaveGorman :) #data #bigdata #davegorman #googlewhack

You Need to See This!….

A kind of ritual viewing has happened with myself and the teen recently. Especially as there’s a large interest in statistics and comedy between the pair of us. They’ll suggest one thing, we’ll watch it and then I’ve gone through Monty Python, Billy Connolly, Jasper Carrott, Bill Bailey and so on….. I’ll get shown Game Theory and Film Theory in equal measure.

Then one evening, it hit me….. YOU NEED TO SEE THIS!

Dave Gorman’s Googlewhack Adventure……

(Now as this isn’t an official version of the live show I encourage you to venture here for the DVD and here for the book.)

Francophile namesakes

I owned the DVD when it came out in 2004 and it was a wonder to watch, even after being involved in the web and data industry since 1995 I was still mesmerised.

It’s the true story of Dave Gorman tasked by his friend, Dave Gorman, to find a continuous connection, a chain, of ten Googlewhacks connections. Meeting each Googlewhacker in person they supply Dave with two further Googlewhacks of their own finding. It is, in all seriousness, compelling viewing. And I know what you’re thinking….

What’s a Googlewhack? We need to go back in time a bit.

Googlewhack is a contest for finding a Google search query consisting of exactly two words without quotation marks that returns exactly one hit. A Googlewhack must consist of two actual words found in a dictionary. A Googlewhack is considered legitimate if both of the searched-for words appear in the result page. (From Wikipedia)

Watching again in 2019 it’s still a brilliant story but also it’s interesting to see how much the internet has changed, some for the better and some for the far worse. There’s a far more important question.

Can It Still Be Done?

Has our accelerated lives, data, data shadows and other digital finger prints rendered all of this history. Or, is there a minute glimmer of hope that it could still be done?

The Oxford Dictory contains 171,476 words in current use. From this point on it’s a combinatorics problem, how many word pair combinations actually exist? Back in 2004 I wouldn’t even know how to ask that question let alone find an answer for it….. oh how my life has changed.

There are 14,701,923,550 word pairs that could be searched in an attempt to find a Googlewhack. Fourteen billion….. and from my point of view that’s not a big data problem, it’s an average sized data problem. How long would it take though?

A quick Google search on “Francophile Namesakes” tells us two interesting facts.

Firstly there’s 59,800 results…. no longer a Googlewhack by any stretch of the imagination, and second, the result took 0.33 seconds to find. (14701923550 * 0.33) / 60 gives us 80,860,579 minutes to do all the word pair searches, 1.36m hours. Basically a single computer would take 155 years to just go and hit Google with all the pairs to find a Googlewhack. 

In our world of clustered computing and loads of computers doing the job at the same time, I could deploy a 1,000 machines and it would still take over fifty days to do the work. 

Ultimately, it doesn’t matter. It’s been done already, by Dave, in a time when you could easily do those kinds of searches. When human connection was the default standard of communication. And that’s when I was reminded what the internet has lost for me, the humanity of data. With all the Facebook, Twitter and all the other social networks the social aspect is, to me, lost, it’s the broadcast medium for those who want to listen. Back in 2004 the landscape was much different…… The Googlewhack Adventure just reminded me how much I missed it. 

Thanks Dave. Sadly, I’ve no idea what that’s done to the graph.

 

 

I Shutdown DeskHoppa, here’s why. #DeskHoppa #startups

Yesterday I shut down DeskHoppa. It wasn’t an easy decision but it was the right decision. It surprised a few people that I’d do such a thing and there were a few messages from dear friends wondering if I’d made the right call.

And no, I didn’t delete the code but I did delete all the data.

Marketplace Startups Are Hard

That’s the plain and simple fact. While it’s all very well knowing that there are buyers and sellers out in the market place, actually tying them together via your service is really hard. You are effectively marketing to two sides of the coin, it’s not a simple equation to complete either.

One of the hardest things to solve is in the initial stages. In DeskHoppa’s case you need hosts listing in order to get users searching. Hosts were the hardest customer to get on board, the require convincing and the harsh reality is that most don’t trust you until you can really convince them.

It’s a Numbers Game

Everyone I spoke to was lovely, “That’s a great idea, I needed that yesterday!”. The problem is that kind words do not put money in the bank. So you have to start with a figure in mind, £100,000 turnover for example and work backwards….

There’s 260 working days a year so that’s my frame of reference. £100,000 / 260 = £384.61 a day, that’s what I need to be doing as an average.

If my fees are £1.20 for every £10 booked (card fees are applied after so they don’t chew in to my margin), then I’m looking at 321 bookings a day. Now look at the real world side of that market, the funnel of users.

Search -> View -> Book. 

Assuming my booked users are the 321, I’m guessing the conversion rate is 3% from view to book (users just looking around do that, just look around). I need 10,700 host views a day based on my 321 @ 3%. That, however, is not the end of the story. Not everyone is going to be looking all the time, so far the assumption that 100% of the users are searching, that’s unrealistic. It’s probably 3% again, at best.

So what I’m really saying is I need 356,666 users signed up and booking daily to make £100k/year. Or 3.56m users to make a million revenue a year. That doesn’t even take hosts into account….

Facebook, Instagram, Twitter, Linkedin and GoogleAds…..

This is the first time I ran some experiments on ads. The ability to narrow in on target segments is critical, get that wrong and your spend vanishes in hours as a bunch of underage users poke around to see what you are doing.

Linkedin spends are quite expensive and at least they give a rough idea of the conversion (for my scenarios it was about 0.79%).

Ultimately boosting posts didn’t return anything, some nice users in the US and a two hosts who enquired. Once again though it’s a high volume numbers game, you need money to make money. I knew that all along.

There’s a Skill to Knowing When to Call it a Day

When I embarked on DeskHoppa I was under no illusions, building the service is the easy bit (well it is for me, I can write code quick). The key was always eyeballs and they’re really hard to get. If you fool yourself that folk always care then there’s a hard reality, the majority don’t, it takes time to get their attention and trust.

Knowing when to say, “that’s enough”, is done through various iterations of history. I’ve let things run too long before. Idea validation is the hard part and I don’t believe it’s about product market fit, it’s about market product fit. You have to build the market first, if that market doesn’t exist then you’ll spend a long while creating it. The first person I heard that flipped the whole Product/Market thing was Gretta Van Reil of SkinnyMeTea.

After review numbers and looking at what it might take to get things where they need to be, the right decision was made. There’s a worse position to be in, a service that just trickles money in but doesn’t quite break even. The signal that something is happening but not a volumes you need….. things can become a millstone quickly.

Finally….

There are some wonderful, supportive people out there. Ones who gave feedback, lists of improvements, shouted out repeatedly on social media. Ones who were blunt and told me the reasons why they wouldn’t host desks…. it was all valuable.

I emailed everyone the final email, to say thank you. You can’t just close a service and not say thanks. Some of the responses were lovely.

Thank you.

 

“Where have you been hiding?…..” #nitech #ni #machinelearning #ai #customerloyalty #clojure #java

For those who’ve been asking why I’m not so active in NI…..

Errrm, I haven’t. So far 2019 has thrown some joyous curve balls. Some good, some challenging but the pointers to learn from were in plain sight.

Not Much Conference Talking….

Last year, 2018, was full on tech talks and as much as I love doing them it felt like I was treading old ground, a bit like keeping the old classics in the set even though you hate doing them.

In terms of local talks, I stopped. The transaction costs weren’t that high but I certainly wasn’t getting any value back. Plus the amount of sponsored meetups, hackathons and events were pushing out any realistic assessment of the AI/ML landscape locally, just an opinion.

I’d lost my joy for conference talks, I wasn’t talking about the things that mattered and it wasn’t until I back tracked my roots and realised how I was missing talking about real world retail and customer loyalty….. I know some of you had asked about me doing more of that, I’m still finding an interesting angle. (That and no one asks me now).

This year also saw me more involved in the international conferences that I do love. I’m now part of the programme committees for O’Reilly’s Strata Data Conference in London and San Jose, and also ClojureX in London.

And remember, no one should feel pressured to talk. If you want to do it, do it. If you don’t, then don’t.

Machine Learning Book 2nd Edition

Work has now started on the update to Machine Learning: Hands On for Developers and Technical Professionals. More machine learning at scale on the JVM (in Java and Clojure) and more on Deep Learning, Kafka, Image recognition and text mining.

Release won’t be until the end of the year or into 2020. Not my call, depends on how fast I can type…. If you don’t see me then I’m probably typing.

 

Apache Storm: From Clojure to Java….. some thoughts. #clojure #java #storm

The route to Clojure can be an odd one. For some it just falls straight in to the developer’s way of working (“It’s Lisp, I geddit!”). Others, like me, with our old Java based OOP heads struggled for the penny to drop.

If Java is your main language then the move to Clojure can be difficult, it’s a different way of thinking. If Clojure is your first/main language then doing Java Interop in Clojure is going to melt your head (I’ve seen this a lot, I found it surprising too).

For me the penny dropped when my then boss, Bruce Durling, put it to me like this: “Data goes in to the function, data goes out of the function”. After that everything made sense and if you make functions small, separate and testable then it’s a joy to use.

There’s one issue though that has always been a challenge, not just for Clojure but other languages, mainstream adoption.

It’s better for a developer to have two or three languages in their toolbox, not just one. The reason…. well the Apache Storm project dropped the mic.

https://storm.apache.org/2019/05/30/storm200-released.html

“While Storm’s Clojure implementation served it well for many years, it was often cited as a barrier for entry to new contributors.”

Yup get that completely.

Clojure Takes Time….

Clojure takes time to learn and to do well. There’s a group of folk in society that just get confused by too many parentheses, I was one of them. Another thing I’ve found is that adoption route can be made harder by the documentation in projects, too many times I’ve come across things that you were just supposed to know, it just wasn’t helpful.

I suffered huge huge huge imposter syndrome with the Clojure community, they talked in a different language, my mental reaction was “I don’t fit in here”. They spoke about solutions that were just plain confusing. Over the last four years of this blog I’ve done my best to break stuff down and explain it in English to give the next poor sod a chance. I was actually scared of doing my first talk at ClojureX, petrified actually. The audience in the room knew far more than I did.

Finding Clojure developers is pretty much an uphill struggle, it’s a small circle. Finding good ones is harder, though that could be said of Scala and the like too. It’s easier to cross train someone from Java into Clojure but that takes time and most companies are not in a position to wait, there’s work to be done. Recently I was talking to a company who were potentially interested in hiring but the made one thing very clear, “We wouldn’t want you to do anything in Clojure, no one here can support it.”, I totally agree, the bus number is key.

So with something like Apache Storm this does not come as a surprise, Apache projects need adopters and that is a numbers game. Do a project with minority adoption and then there’s a good chance the project will wither and die. Actually I didn’t realise Storm was written in Clojure until I read the announcement.

The Bottom Line is I Love Clojure

Knowing what I know now I find it hard to move away from Clojure. DeskHoppa is 100% Clojure but I know it’ll be developing that for the time being. I’ve realised that it’s a niche especially when it comes to things like Strata Data Conference where I’ve always put things in Java and some Clojure, I’ve had too otherwise my talks get rejected.

I never wanted to learn Haskell…….

Finding #pi with #montecarlo method and #Clojure – #math #justmath

I was reading a post from Toward Data Science blog this morning on mathematical programming to build up skills in data science by Tirthajyoti Sarkar. While the article was based around Python it didn’t use any of the popular frameworks like NumPy or SciPy.

Now with a bit of a lull I wanted to keep my brain ticking nicely so the thought of using math within Clojure appeals nicely to me. And I’m not saying one is better than the other. The best language to for data science is the one you know. The main key of data science is having a good grounding of the math behind, not the frameworks to make it easier.

Calculating Pi By Simulating Random Dart Board Throws

The Monte Carlo method is the concept of emulating a random process. When the process is repeated a large number of times will give rise to the approximation of some mathematical quantity of interest.

If you imagine a square dart board…..

Now imagine a square dart board with a circle inside the square, the edges of circle touch the square…..

If you throw enough darts at the board some will land within the circle and some outside of it. As the original article graphically put it:

These are random throws, you might throw 10 times, you might throw 1 million times. At the end of the dart throws you count the number of darts within the circle, divide that by the number of throws (10, 1m etc) and then multiply it by 4.

As the original article states: the probability of a dart falling inside the circle is just the ratio of the area of the circle to that of the area of the square board.

The more throws we do the better chance we get of finding a number near Pi. The law of large numbers at work.

Throwing a Dart at The Board

I’m going to create a function that simulates a single dart throw. I want to break down my Clojure code into as many simple functions as possible. This makes testing and bug finding far easier in my opinion.

(defn throw-dart []
  {:x (calc-position 0)
   :y (calc-position 0)})

What I’m creating an x,y coordinate with a 0,0 centre point then passing the coord for the x and the y through another function to calculate the position (calc-position).

(def side-of-square 2)

(defn calc-position [v]
  (* (/ (+ v side-of-square) 2) (+ (- 1) (* 2 (Math/random)))))

The calc-position function takes the value of either x and y and applies the calculation, this is somewhere -side-of-square/2 and +side-of-square/2 around the centre point.

Running this function in a REPL we can see the x or y positions.

mathematical.programming.examples.montecarlo> (calc-position 0)
0.4298901518005238

Is The Dart Within The Circle?

Now I have a x,y position as a map {:x some random throw value :y some random throw value} I want to confirm that the throw is within the circle.

Using the side-of-square value again (hence it’s a def ) I can figure out if the dart hits within. I’ll pass the map with x,y coords in and take the square root of the added squared coordinates.

(defn is-within-circle [m]
  (let [distance-from-center (Math/sqrt (+ (Math/pow (:x m) 2) (Math/pow (:y m) 2)))]
     (< distance-from-center (/ side-of-square 2))))

This function will return true or false. If I check this in the REPL it looks like this:

mathematical.programming.examples.montecarlo> (throw-dart)
{:x 0.22535085231582297, :y 0.04203583357796781}
mathematical.programming.examples.montecarlo> (is-within-circle *1)
true

Now Throws Lots of Darts

So far there are functions to simulate a dart throw and confirm it’s within the circle. Now I need to repeat this process as many times as required.

I’m creating two functions, compute-pi-throwing-dart to run a desired number of throws and throw-range to do the actual working to find the number of true hits in the circle.

(defn throw-range [throws]
  (filter (fn [t] (is-within-circle (throw-dart))) (range 0 throws)))

(defn compute-pi-throwing-dart [throws]
  (double (* 4 (/ (count (throw-range throws)) throws))))

The throw-range function executes the throw-dart function and is-within-circle evaluates the map to see if the value is either true or false. The filter functions will return a list of true values. So, for example, if out of ten throws the first, third and fifth are within the circle I’ll get (1,3,5) as the result from the function.

Calling the function compute-pi-throwing-dart sets all this into motion. Like I said at the start, taking the number of darts in the circle and dividing that by the number of throws taken, multiplying that by four should give a number close to Pi.

The more throws you do, the closer it should get.

mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 10)
3.2
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 10)
3.2
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 10)
3.6
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 10)
2.4
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 10)
4.0
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 10)
2.8
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 100)
2.92
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 1000)
3.136
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 10000)
3.138
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 100000)
3.15456
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 1000000)
3.13834
mathematical.programming.examples.montecarlo> (compute_pi_throwing_dart 10000000)
3.1419096

Let’s Build a Simulation

Via the REPL there is proof of an emergent behaviour, the value of Pi comes from the large number of throws we did at the dart board.

The last thing I’ll do is build a function to run the simulation.

(defn run-simulation [iter]
  (map (fn [i]
    (let [throws (long (Math/pow 10 i))]
      (compute-pi-throwing-dart throws))) (range 0 iter)))

If I run 4 simulations I’ll get 1, 10, 100 and 1000 throws computed, these are then returned as a list. If I run 9 simulations (which can take some time depending on the machine you’re using) in the REPL I get the following:

mathematical.programming.examples.montecarlo> (run-simulation 9)
(0.0 3.6 3.28 3.128 3.1176 3.1428 3.142932 3.1425368 3.14173752)

That’s a nice approximation, Pi is 3.14159265 so to get a Monte Carlo method to compute Pi by random evaluations is good.