The Next Five Years of Machine Learning. #machinelearning #data #bigdata @brianmacnamee @digitalcircle


Last night I attended the Royal Irish Academy lecture “Show me your data, and I’ll tell you who you are” at Ulster University’s Magee Campus. It was an interesting lecture by Dr Brian MacNamee, one that sidestepped any technicalities and aimed for a general audience. It was a very good, informative and entertaining lecture.

One thing that I did notice was the mix of audience, students from school, members of the public and some lecturers from the university, no entrepreneurs in the room that I could see which was a shame though.

It was the amount of school ties in the room that inspired and prompted me to ask the question during the Q&A, “What do you see as the challenges to Machine Learning over the next 5 to 10 years?“.

Some Relics

Some of the machine learning algorithm that we’re using are old, take the decision tree for example, the ID3 algorithm designed by Ross Quinlan goes back to 1986, it’s on its thirtieth birthday. Threshold Logic, which because the foundation for Neural Networks dates back from the work of Warren McCulloch and Walter Pitts back in 1943, that’s 73 years ago.

Most of our modern machine learning systems are based on old technologies and algorithms, is there an opportunity to refine and redevelop these technologies? Is there an opening for new algorithms? I believe there is and the ones who will carry that torch are possibly the ones sat in the Magee lecture room last night proudly wearing their uniforms.

Thanks to Digital Circle for listing the event, I wouldn’t have known about it otherwise.



WhatsApp, encrypted messages don’t matter, your number does.


I’m always bemused when users read tech press comments, company gets acquired and the ringing message in the piece is, “this acquisition won’t change the way users interact with our product”. Of course it will…..

Facebook acquires Instagram, rapid user growth, it made sense. Facebook acquires WhatsApp, okay wasn’t expecting that one (and certainly not for $19Bn) and “it won’t change the service”…..Of course it will….

Phone Number + Facebook Profile = Targeted Insight

The relaxation of WhatsApp interaction with Facebook was always going to happen, you’d have to be pretty naive to think it wasn’t. Even with the encrypted messages that merely skims the surface of messaging as an advertising medium. WhatApp have your number at signup and if you join that with your phone number on your Facebook profile…. well you’ve got the advertising segments, likes, friends, events, movements, checkins and so on. WhatsApp just joined the big old advertising platform graph.


And if you think that the WhatsApp is generated, which it is, remember that Facebook own the whole shebang and will easily be able to generate the same. Even when you mangle it through a MD5 digest it’s still an id of sorts. Perform the same digest on the phone number of your Facebook profile you’ll get a match.

MD5 ("+447900316123plusmyreallylongIMEI") = f6dde26f21be9acef37667d770c95869

Non of this should be a surprise. Expect the ads to roll in soon and remember the old adage, “if you are using a service for free, you are the product”.



The Onyx Chronicles: onyx-template gotchas…. #onyx #clojure #data #kafka

In the previous posts on Onyx (you can read part 1 and part 2 on how to setup a Kafka Streaming application with Onyx), there are a couple of things worth noting that I didn’t mention in the original articles.

Blocking Jobs

When you submit your job you are submitting it to Zookeeper, then when the peers are running it sees what jobs are available to execute. The template code’s -main method does a good job of separating these actions out.

This line:

(onyx.test-helper/feedback-exception! peer-config job-id)

is looking out for exception messages on the job. If all is well it will block any form of exiting and wait for someone to kill it. Removing the line will effectively submit the job to Zookeeper, exit out and that’s that. When the peer is running it will then pick up the job. In our Kafka job the peer will pick up the job and then do it’s Kafka input job. Kill the peer and start it again, the job should pick up from the last offset.

The Greedy and the Balanced

The default template works on a greedy job scheduler, meaning that if you run ten peers and the job demands three, then the remaining seven will be blocked from use.

Fine when there’s only one job submitted but a pain when there’s two or more. In the config.edn file it’s a case of changing the job scheduler from greedy to balanced. The balanced scheduler will keep the remaining available peers and allocate them to the other jobs as they come on.

:onyx.peer/job-scheduler :onyx.job-scheduler/greedy


:onyx.peer/job-scheduler :onyx.job-scheduler/balanced



The Tattered Tail of City of Derry to Dublin Flights (LDY > DUB)


Routes are 90% marketing and the small matter of staff, an aeroplane and permission to do the route. And while my suspicion that the Derry to Dublin route was going to get shelved goes way back before a Brexit vote, Brexit is the very reason used as to why it’s not going ahead. I’ve yet to see one shred of evidence to support that.

Was It Ever Going To Happen?

Well there was a subsidy application in late 2015. Citywing operating with aircraft from Van Air Europe, Citywing don’t own aircraft they are classed as a “virtual airline” where they take the bookings and Van Air Europe hold the routes, the air operator’s certificate and provide the planes, crew, maintenance and insurance.

So the real negotiation has nothing to do with Citywing and everything to do with Van Air Europe. Digging around for that info though is hard work.

Yes, it was in the works to happen though I feel the press might have jumped on it a wee bit too soon. Nothing at that point was set in stone.

City of Derry Needs Good News Stories

I’ve posted before about City of Derry Airport and what I think needed to happen to revive it’s place in Northern Ireland as a viable route and destination. And the shelving of the original route to Dublin in 2011 (operated by British Airways but provided by Logan Air) did not help matters. It stopped for the reason many routes stopped, the subsidy ran out.

It’s no surprise CoD pushed the PR boat out. The headlines about the route starting in April 2016 filtered to the press and everyone in Co. Londonderry smiled. Thing is these things are never straightforward.

While there is evidence of the subsidy application happening I’ve yet to see any firm response or confirmation of a subsidy. Chances are it gets hard to find that info as there’s due diligence and legals that you’d not be putting online.

Bad News, Blame It On Brexit

It’s now in fashion, there’ll be a four page Vogue special feature soon, to blame things not happening on Brexit. For CoD to blame Brexit on this, well I don’t buy it. Things started to seep out slowing in March…


All that time though Citywing were pushing the LDY > DUB route on their website, you just couldn’t book anything. So there was probably another reason. There were three potential reasons I believe, first of all Citywing realised the route wasn’t commercially viable, especially if the subsidy wasn’t going to happen. So they spent a lot of time mulling over on what to do. At this point, we still don’t know if the subsidy was offered or not.

Secondly, when your airline has four aircraft then route planning becomes a logistical operation. It’s just not Northern Ireland, it’s the Isle of Man, it’s Cardiff and all the other route combinations. It’s a fun maths problem, but that’s not what I’m going to do now.

Thirdly and most importantly, you need pilots to operate plans….. Van Air were on the look out as late as May.



While Van Air don’t say the routes the pilot would be operating, Van Air also fly out of Belfast City Airport, there’s that good chance it was to fulfil new routes.

City of Derry cited “technical problems” in the Mark Patterson Facebook thread.


Technical problems are usually down to the wings falling off the plane or you can’t find someone to fly it. I’m going for the latter.

Until I see some actual evidence of Brexit being a problem for this route then I shall remain skeptical about the reasons.

Using onyx-template to craft a Kafka Streaming application. Part 2. #clojure #onyx #kafka #streaming #data


The Story So Far….. And Beyond

In Part 1 I covered that basic setting up of the Onyx platform configuration, starting Zookeeper and deploying the peer and a basic job. In this post I’m going to plug in the Kafka components into the code base and setup a three broker cluster on a local machine. Then we’ll kick everything off and send some messages for processing.

Code Amendments To Our Application.

Before we dive into code a little bit of planning needs to be thought about. First of all how many peers we need to run. At present my workflow is pretty simple:

:in :inc :out

A core.async channel for the input, a function to increment the value passed in and then an output channel (again via core.async). I’m going to modify this so we have a Kafka topic being read (though the :in channel), I’ll amend the math.clj code for ease of understanding so it just shows to deserialised message as a Clojure map. The output channel I’ll leave as is.

My Kafka cluster has three brokers and one partition. If a broker dies then another will be elected the lead and the data stream doesn’t suffer. Onyx operates on a one peer per partition, if you allocate too many peers against it then it will throw errors so I’m making that clear now.

One peer per partition.

Let’s make the code changes.

Changing the Project Dependencies

First of all let’s add the Kafka plugin to the project.clj file.

[org.onyxplatform/onyx-kafka-0.8 ""]
[cheshire "5.5.0"]

There are separate plugins for Kafka 0.8 and 0.9. As my cluster is then I’m going for the 0.8 plugin.

Changing the Basic Job Code

Second let’s make changes to the basic job. Job one is to add the dependencies to my :require line.

[onyx.plugin.kafka :as kafka]
 [onyx.tasks.kafka :as kafka-task]

Next in the basic-job function I need to change the base job configuration. The input channel. :in, is now a onyx.tasks.kafka/consumer and will read the topic stream that we give it.

(-> base-job
 (add-task (kafka-task/consumer :in kafka-opts))
 (add-task (math/process-kafka :inc batch-settings))
 (add-task (core-async-task/output :out batch-settings)))

I need to beef up the settings for the Kafka channel so I’m adding kafka-opts to the let statement.

{:onyx/name :in
 :onyx/plugin :onyx.plugin.kafka/read-messages
 :onyx/type :input
 :onyx/medium :kafka
 :kafka/topic "my-message-stream"
 :kafka/group-id "onyx-consumer"
 :kafka/fetch-size 307200
 :kafka/chan-capacity 1000
 :kafka/zookeeper ""
 :kafka/offset-reset :smallest
 :kafka/force-reset? true
 :kafka/empty-read-back-off 500
 :kafka/commit-interval 500
 :kafka/deserializer-fn :testapp.shared/deserialize-message-json
 :kafka/wrap-with-metadata? false
 :onyx/min-peers 1
 :onyx/max-peers 1
 :onyx/batch-size 100
 :onyx/doc "Reads messages from a Kafka topic"}

I’m passing in the topic, consumer group and peer size information. I also need to add a deserialiser function to convert the message stream to a map for me.

Adding a Deserialiser

(I’m English so I’m using “s” instead of “z”….)🙂

I’ve borrowed this quick Clojure code from one of Mastodon C’s repos, it works and I’m happy with it so I reuse it for safety.

(ns testapp.shared
 (:require [taoensso.timbre :as timbre]
 [cheshire.core :as json]))

(defn deserialize-message-json [bytes]
 (let [as-string (String. bytes "UTF-8")]
 (json/parse-string as-string true)
 (catch Exception e
 {:parse_error e :original as-string}))))

(defn serialize-message-json [segment]
 (.getBytes (json/generate-string segment)))

(def logger (agent nil))

(defn log-batch [event lifecycle]
 (let [task-name (:onyx/name (:onyx.core/task-map event))]
 (doseq [m (map :message (mapcat :leaves (:tree (:onyx.core/results event))))]
 (send logger (fn [_] (timbre/debug task-name " segment: " m)))))

(def log-calls
 {:lifecycle/after-batch log-batch})

So for every message stream that Onyx reads it will pass through the deserialiser (as we configured in the :in workflow. All that’s left to do is change the original math.clj file to print the map to the console.

Amending The Process Function

Let’s keep this really simple. It just prints the deserialised map to the console.

(ns testapp.tasks.math
 (:require [schema.core :as s]))

(defn get-data [fn-data]
 (println fn-data))

(s/defn process-kafka
 ([task-name :- s/Keyword task-opts]
 {:task {:task-map (merge {:onyx/name task-name
 :onyx/type :function
 :onyx/fn ::get-data}

I’m reusing the original code from the template, obviously in any other scenario you’d be tidying up naming as you went along.

So that’s the code taken care of. Added the Kafka dependency, changed the job spec around a bit, added a deserialiser for the messages and amended the processing function.

Use leiningen to clean and create an uberjar.

$ lein clean ; lein uberjar
Compiling testapp.core
Compiling testapp.core
Created target/testapp-0.1.0-SNAPSHOT.jar
Created target/peer.jar

All done. Now to setup the Kafka cluster.

Setting Up a Three Node Kafka Cluster

Kafka runs nicely as single node cluster but I want to use three brokers to give it some real world exposure. The config directory has a file called I’m going to copy that for the other two brokers I want.

$ cp
$ cp

There are three things to change in each of the new properties file.

Setting 1 2
port 9093 9094
log.dirs /tmp/kafka-logs1 /tmp/kafka-logs2

You can leave the original file alone, it will still use port 9092 as default.

Now For The Big Test

So this is the order we’re going to run in:

  • Start Zookeeper
  • Start Kafka Broker 0
  • Start Kafka Broker 1
  • Start Kafka Broker 2
  • Add a new topic
  • Start the Onyx Peers
  • Tail the onyx.log file
  • Submit the job.
  • Send some messages to the Kafka topic.

Assuming you’ve been using Zookeeper before it’s worth clearing out the structure as we’re only testing things.

$ rm -rf /tmp/zookeeper
$ rm -rf /tmp/kafka-logs*

Starting Zookeeper

With a clean sheet we can start Zookeeper:

$KAFKA_HOME/bin/ config/

Give it a second or two to start up and the open another terminal window.

Start Kafka Broker 0

The first Kafka broker will act as the leader.

$KAFKA_HOME/bin/ config/

[2016-08-03 18:03:59,792] INFO [Kafka Server 0], started (kafka.server.KafkaServer)
[2016-08-03 18:03:59,844] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)

Start Kafka Broker 1

As a JMX_PORT is already set from broker 0 you will need to specify a new port number for Broker 1.

JMX_PORT=9997 $KAFKA_HOME/bin/ config/

[2016-08-03 18:04:29,616] INFO Registered broker 1 at path /brokers/ids/1 with address (kafka.utils.ZkUtils$)
[2016-08-03 18:04:29,631] INFO [Kafka Server 1], started (kafka.server.KafkaServer)

Start Kafka Broker 2

As Brokers 0 and 1 have their own JMX ports, one again you’ll need to specify a new one for broker 2.

JMX_PORT=9998 $KAFKA_HOME/bin/ config/

[2016-08-03 18:04:51,572] INFO Registered broker 2 at path /brokers/ids/2 with address (kafka.utils.ZkUtils$)
[2016-08-03 18:04:51,586] INFO [Kafka Server 2], started (kafka.server.KafkaServer)

Create a new topic

While Kafka can create new topics on the fly when messages are sent to them Onyx doesn’t always behave when the job is submitted but the topic isn’t there. So it’s safer to create the topic ahead of time.

$KAFKA_HOME/bin/ --create --zookeeper localhost:2181 --topic my-message-stream --replication-factor 3 --partitions 1

Created topic "my-message-stream".

Notice the partition count is 1, that needs to match the number of peers on the :in channel of the Kafka reader.

Start the Onyx Peers

We’ve already create the uberjar so it’s just a case of firing up the peers.

$ java -cp target/peer.jar testapp.core start-peers 3 -c resources/config.edn -p :default
Starting peer-group
Starting env
Starting peers
Attempting to connect to Zookeeper @
Started peers. Blocking forever.

Now create a new terminal window so we can tail the logs.

Tail the onyx.log File

A lot of the output is dumped in the onyx.log file so it’s worth tailing it for errors and info. If the peer count is incorrect then it’ll show up here.

$ tail -f onyx.log
16-Aug-03 18:08:25 mini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Aug-03 18:08:25 jmini.local INFO [onyx.static.logging-configuration] - Starting Logging Configuration
16-Aug-03 18:08:25 mini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.communicator] - Starting Log Writer
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.communicator] - Starting Replica Subscription
16-Aug-03 18:08:27 mini.local INFO [onyx.static.logging-configuration] - Starting Logging Configuration
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.acking-daemon] - Starting Acking Daemon
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.aeron] - Starting Aeron Messenger
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.virtual-peer] - Starting Virtual Peer ec2d6a86-d221-4e98-b9c4-e01e845486cc
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.acking-daemon] - Starting Acking Daemon
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.aeron] - Starting Aeron Messenger
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.virtual-peer] - Starting Virtual Peer 4d1b1c70-8d1c-4229-843c-42ef9a16e432
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.acking-daemon] - Starting Acking Daemon
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.aeron] - Starting Aeron Messenger
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.virtual-peer] - Starting Virtual Peer 44be7413-dbce-46b5-a104-f161473f5bd6

Submit The Job

Submit the job, remember the job name has now changed from basic-job to kafka-job.

$ java -cp target/peer.jar testapp.core submit-job "kafka-job" -c resources/config.edn -p :default
16-Aug-03 18:11:40 jasebellmacmini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Aug-03 18:11:41 jasebellmacmini.local INFO [onyx.log.zookeeper] - Stopping ZooKeeper client connection
Successfully submitted job: #uuid "c4997d0d-a75f-4d26-9819-41782a50fcae"
Blocking on job completion...

Once deployed look at the onyx.log file again. You should see the Kafka offset position being reported.

16-Aug-03 18:11:43 mini.local INFO [onyx.plugin.kafka] - Kafka consumer is starting at offset 0

We’re ready to send some messages.

Send Some Messages to the Kafka Topic

There are some useful scripts in the Kafka bin directory, one includes a shell for writing messages to a topic.

$KAFKA_HOME/bin/ --broker-list localhost:9092,localhost:9093,localhost:9094 --topic my-message-stream

The broker list is comprised of all our running brokers. When connected start pumping out some messages:

[2016-08-03 18:15:01,904] WARN Property topic is not valid (kafka.utils.VerifiableProperties)
{"firstname":"Jase", "lastname":"Bell", "github":""}
{"firstname":"Jase", "lastname":"Bell", "github":""}
{"firstname":"Jase", "lastname":"Bell", "github":""}
{"firstname":"Jase", "lastname":"Bell", "github":""}
{"firstname":"Jase", "lastname":"Bell", "github":""}
{"firstname":"Jase", "lastname":"Bell", "github":""}
{"firstname":"Jase", "lastname":"Bell", "github":""}
{"firstname":"Jase", "lastname":"Bell", "github":""}
{"firstname":"Jase", "lastname":"Bell", "github":""}

Now look at the terminal window where the peer is running, you should see the deserialised output.

Attempting to connect to Zookeeper @
Started peers. Blocking forever.
{:firstname Jase, :lastname Bell, :github}
{:firstname Jase, :lastname Bell, :github}
{:firstname Jase, :lastname Bell, :github}
{:firstname Jase, :lastname Bell, :github}
{:firstname Jase, :lastname Bell, :github}
{:firstname Jase, :lastname Bell, :github}
{:firstname Jase, :lastname Bell, :github}
{:firstname Jase, :lastname Bell, :github}
{:firstname Jase, :lastname Bell, :github}

Our work here is done.

I’m Going To Park That There….

Okay that was a long gig for anyone. We’ve covered integrating the Onyx Kafka plugin to a project, amending the code to allow it to consume Kafka messages from a topic and deserialise the to a Clojure map.

Next time, when I get to it. I’ll do something funky with the map and give it some real world use case.

A cup of tea is in order….






Using onyx-template to craft a Kafka Streaming application. Part 1. #clojure #onyx #kafka #streaming #data


Over the past couple of months I’ve been using the Onyx Platform with Mastodon C an awful lot. I’ve developed a bit of a love/hate relationship with, when it works it’s a beauty, when it doesn’t it’s the most frustrating thing ever. Like I said about my Kafka post yesterday, sometimes we plug all these bits in without really knowing what’s going on properly.

So this short series while I have some time off, this is hobby mode research not work btw, is going to attempt to show you how to build a Onyx Kafka application that consumes a Kafka stream and kicks out the messages to the console. Nothing flashy but more a “here’s the concepts”.

In part 1 I’m going to look at getting the Onyx template working and using the Onyx Dashboard. Part 2, and possibly beyond, I’ll then add the Kafka connection and consume from a topic.

Your Shopping List

We’ll look at the Onyx things via git so there’s nothing to download. You will need to download Kafka.

Important thing is that it’s 0.8.x as 0.9 isn’t supported yet that I’m aware of. Let’s get started.

Onyx Dashboard

Before we do anything do yourself a huge favour and download the Onyx Dashboard, it makes things an awful lot easier to see what’s going on with the cluster.

$ git clone
Cloning into 'onyx-dashboard'...
remote: Counting objects: 3815, done.
remote: Total 3815 (delta 0), reused 0 (delta 0), pack-reused 3814
Receiving objects: 100% (3815/3815), 3.37 MiB | 2.30 MiB/s, done.
Resolving deltas: 100% (2079/2079), done.
Checking connectivity... done.
$ cd onyx-dashboard/
$ lein uberjar
Successfully compiled "resources/public/js/app.js" in 76.2 seconds.
Created target/onyx-dashboard-
Created target/onyx-dashboard.jar

The dashboard will compile, build and be saved as a jar file in the target directory. You may get some ClojureScript warnings but don’t worry about them, it will still build okay. Don’t start anything just yet….

The Onyx Application

The easiest way to create a new Onyx application is to use the Onyx template. It gives you the main structure of the application and the deployment of the peer and the job (more on that later). You don’t have to download anything in advance just use lein to create it.

$ lein new onyx-app testapp -- +docker
Warning: refactor-nrepl requires org.clojure/clojure 1.7.0 or greater.
Warning: refactor-nrepl middleware won't be activated due to missing dependencies.
Generating fresh Onyx app.
Building a new onyx app with: ("+docker")

The template makes some assumptions, which you’d expect, so I’m going to edit the configuration file a bit to make it work under my conditions. From the template the config file (in the resources folder) looks like:

 {:onyx/tenancy-id #profile {:default "1"
 :docker #env ONYX_ID}
 :onyx.bookkeeper/server? true
 :onyx.bookkeeper/local-quorum? true
 :onyx.bookkeeper/delete-server-data? true
 :onyx.bookkeeper/local-quorum-ports [3196 3197 3198]
 :onyx.bookkeeper/port 3196
 :zookeeper/address #profile {:default ""
 :docker "zookeeper:2181"}
 :zookeeper/server? #profile {:default true
 :docker false}
 :zookeeper.server/port 2188
 :onyx.log/config #profile {:default nil
 :docker {:level :info}}}
 {:onyx/tenancy-id #profile {:default "1"
 :docker #env ONYX_ID}
 :zookeeper/address #profile {:default ""
 :docker "zookeeper:2181"}
 :onyx.peer/job-scheduler :onyx.job-scheduler/greedy
 :onyx.peer/zookeeper-timeout 60000
 :onyx.messaging/allow-short-circuit? #profile {:default false
 :docker true}
 :onyx.messaging/impl :aeron
 :onyx.messaging/bind-addr #or [#env BIND_ADDR "localhost"]
 :onyx.messaging/peer-port 40200
 :onyx.messaging.aeron/embedded-driver? #profile {:default true
 :docker false}
 :onyx.log/config #profile {:default nil
 :docker {:level :info}}}}

First of all I’m using my own Zookeeper instance (the one that comes with the Kafka download) so I’m not going to use the embedded one.

:zookeeper/server? #profile {:default true
 :docker false}

Now becomes:

:zookeeper/server? #profile {:default false
 :docker false}

And the port will need changing too in both the env-config and peer-config blocks..

:zookeeper/address #profile {:default ""
 :docker "zookeeper:2181"}

Now becomes:

:zookeeper/address #profile {:default ""
 :docker "zookeeper:2181"}

I have no need for a BookKeeper server to be running at this moment in time so that can be disabled too.

:onyx.bookkeeper/server? true

Now becomes

:onyx.bookkeeper/server? false

As far as I can see those are all the changes I need to do. Feel free to use the embedded Zookeeper but I prefer to start as I mean to go on and use the version that I’ll be using in the wild.

Next is to the build the basic template, I want to see how it’s going to behave and how to deploy it, which is life gets really interesting. Once again we’ll use Lein to build the application.

$ lein clean ; lein uberjar
Compiling testapp.core
Compiling testapp.core
Created target/testapp-0.1.0-SNAPSHOT.jar
Created target/peer.jar

All built okay. The next step is to start everything up.

Running the Application

Before I start here’s something to clear up, the relationship between peers and jobs as I feel the documentation doesn’t make it that clear. Jobs are submitted to Zookeeper, they don’t require any peers running at the time. When a peer is then started it will go to Zookeeper looking for jobs to run. And if it took my colleagues “a while for the penny to drop” well firstly there’s hope for me yet and secondly it may have been buried in the docs somewhere.

I’m not going to submit data to the job, I just want to get the job running and get a peer running too. Then I’ll inspect the dashboard to see what’s going on.

Starting Zookeeper

I’m going to my Kafka distribution and going to start the Zookeeper instance. On my Mac OSX developer environment I start it as root so there’s no faffing about with permissions. Yes I know Docker exists… that comes later.

bin/ config/

Starting a Peer

Open up a new terminal window and go to the application directory again. I’m going to start a single peer up.

$ java -cp target/peer.jar testapp.core start-peers 1 -c resources/config.edn
Starting peer-group
Starting env
Starting peers
Attempting to connect to Zookeeper @
Started peers. Blocking forever.

Once you see “Started peers. Blocking forever” we have a registered peer on the cluster.

Submitting the Job

Order doesn’t matter, originally I always thought it did. I can submit a job to Zookeeper for a running peer to pick it up later. Don’t forget that Zookeeper will persist the job when a peer is shutdown. So in theory you only need to submit the job once.

Saying all that: IF you have never started a peer before then there’s no Onyx job scheduler registered in Zookeeper and this will cause you pain as the job will have no scheduler to check against, it will time out.

$ java -cp target/peer.jar testapp.core submit-job "basic-job" -c resources/config.edn
16-Jul-31 09:45:50 jasebellmacmini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Jul-31 09:45:50 jasebellmacmini.local INFO [onyx.log.zookeeper] - Stopping ZooKeeper client connection
Successfully submitted job: #uuid "0201c86b-93ff-4470-9e58-7c2f8a78e080"
Blocking on job completion...
16-Jul-31 09:45:51 jasebellmacmini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Jul-31 09:45:51 jasebellmacmini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.

Once running it will sit there merrily waiting for the job to complete.

Starting the Dashboard

To confirm my assumptions I’m going to fire up the dashboard and have a look around. All I need to do is start up the jar file and pass in the Zookeeper address.

$ java -jar target/onyx-dashboard.jar localhost:2181
Starting Sente
Starting HTTP Server
Http-kit server is running at http://localhost:3000/

Once started we can open a browser and have a look at the dashboard.


From the drop down you will see a list of the running peer specifications. I’ve only got the one running. When selected it will load up the information about the peer and the added jobs.

Inspecting the Peer and Job

So I’ve a browser open and I’ve loaded up http://localhost:3000


The raw cluster activity shows us everything that’s happening. Click on the left hand side of the log entry will show you more information. The jobs are listed on the left hand side of the page. You can see there’s only one submitted job, if you click on it will show you more detail.


So we have a running job, you can also see the workflow of the job with it’s input, inc and output flows.

In Part 2

I’ll take this basic job and make it consume a Kafka topic and show the contents of the message load. I may even make it do something useful.






Getting clj-kafka consumer example to work. #clojure #kafka #streaming #data

As a data engineer and a software developer I’m spending a lot of time working out things over a number of different technologies. It might be Spark, Sparkling and S3 one day and Cassandra, Clojure and ElasticSearch the next.

Historically I’ve been a huge advocate of RabbitMQ, I still am, but more and more I’m using Kafka. The original reason for not wanting to use Kafka wasn’t Kafka, it was Zookeeper, something else to monitor. Ultimately you may find yourself knowing 90% of one thing well and not so much about the underlying tools and services. Change that.

Make Data Streams Your Friend

Get used to them, they don’t have to real time, in fact any startup that says they’re “doing real time as it’s gotta be realtime”, well 99 times out of 100 that’s usually not the case. So give them the standard Andrew Bolster gif of the century….


I’d say that 99% of my work is working with high volume streams (and some low volume ones too).

Anyways, I digress.

Okay So I Was Wrong About Kafka

It’s glorious, really flipping glorious. While I can’t go into detail about how I’m using it, blimey it’s glorious. I did wake up the morning still feeling I’d not got enough background so I cracked open the note book and went hunting for information…. seven pages of notes later I’m in a much happier place.


In Clojure we use clj-kafka.

The clj-kafka wrapper is for Kafka 0.8.x (don’t attempt to use 0.9.x it will not work).

There is however one wee little error in the introduction documentation. The consumer call doesn’t work it merely returns an empty sequence. To get the consumer working you need a doall in you with-resource call.

 kafka-testing.core> (def config {"zookeeper.connect" "localhost:2181"
 "" "clj-test" 
 "auto.offset.reset" "smallest"
 "auto.commit.enable" "false"})
;; => #'kafka-testing.core/config
kafka-testing.core> (use 'clj-kafka.consumer.zk)
kafka-testing.core> (use 'clj-kafka.core)
kafka-testing.core> (with-resource [c (consumer config)]
                        (doall (take 2 (messages c "test"))))
;; => ({:topic "test",
 :offset 0,
 :partition 0,
 :key nil,
 [116, 104, 105, 115, 32, 105, 115, 32, 109, 121, 32, 109, 101, 115, 115, 97, 103, 101, 32, 49]}
 {:topic "test",
 :offset 1,
 :partition 0,
 :key nil,
 [116, 104, 105, 115, 32, 105, 115, 32, 109, 121, 32, 109, 101, 115, 115, 97, 103, 101, 32, 50]})

There a small note of help if anyone gets stuck with it. Next time, Kafka with the Onyx platform.

The NiTechrank Retrospective. #opendata #nijobs #clojure #opensource #data

The @nitechrank was a simple index to report the changes in programming jobs in Northern Ireland. It wasn’t anything scientific and for all my hailing of the automating of all things possible, meaningless and trivial…. well it was me editing the tweets every morning, with a fresh cup of tea, half asleep and in my dressing gown 99% of the time*. Today I ran the last index as the picture became clear….



The Good News

In linear regression terms the jobs outlook is positive. It slopes upwards over time, not to say that there aren’t down days, the start of June was a bit of a surprise but also was the peak which happened on the day after the referendum. So with a low of 42 jobs and a high of 92 jobs yeah it got a little varied but nothing troubling.

Keep in mind though, the only data source, seems to list larger companies so you can assume with a certain amount of confidence that startups won’t list there and look for developers word of mouth. One of the reasons that Java may favour so highly in the results as a whole.

For Developers

If you are interested in Clojure then the repository for the code is available on my Github account. Take from it what you will, it’s slung together to get the numbers out. Note there’s no support, the README should be clear enough to get things working. I won’t be doing any updates on this repo, it’s just there for anyone one who wants to look. I also included the original shell script that did the ranking originally but only for historical purposes.

For Data Folk

Everyday the NiTechrank updated a CSV file so in the data directory is the CSV data of all the indexes that were run. Have fun with it, I’ve not done any analysis on it.


Yeah, it was a giggle.

*I was dressed for the other 1%

From Onyx Template to Working App in Thirty Seconds #onyx #clojure

UPDATE: You can ignore this post now. I’ve done a FULL WALKTHROUGH!

It’s early days for the Onyx Platform and I’m using it in a number of projects as it plugs in with Kafka well. Though getting it up and running can be a bit confusing. So as an aide memoir for me more than anything, here’s the thirty second version.

I’m assuming you have leiningen installed. No tutorial or walkthrough, maybes later when I’ve got some more time and can present something a little more worthy but in the meantime here’s the basics.

With regards to Zookeeper Onyx is using it’s own in-memory version so there’s no need to spin one up for this quick overview.

Create The Application with Onyx Template

$ lein new onyx-app funky-avocado
Generating fresh Onyx app.
Building a new onyx app with:

Create The Uberjar

$ cd funky-avocado/
funky-avocado $ lein uberjar
Compiling funky-avocado.core
Compiling funky-avocado.core
Created /work/funky-avocado/target/funky-avocado-0.1.0-SNAPSHOT.jar
Created /work/funky-avocado/target/peer.jar

Running The Application

$ java -cp target/peer.jar funky_avocado.core start-peers 10 -c resources/config.edn
Starting peer-group
Starting env
Starting peers
Attempting to connect to Zookeeper @
Started peers. Blocking forever.

Do TripAdvisor ratings mean anything over the long term? #ratings #data

Averages, lovely things. Ratings, lovely things. If it helps the consumer then I’m all for it. Do I trust Trip Advisor ratings, well on the whole yes and then there are some gotcha’s.


Averages Can Distort the Actual Truth

Distorted averages are hardly new, if you want to learn more the Skillswise website has a good explanation. Want to earn an average of £63,000, as discussed in a recent #factbait? Chances are you might, or you might not, who knows. At the end of the day it’s just an average.

They are something to think about in reviews especially when an event may change the way a company operates.

So I’ll use the real world example of Burger Club, I’d read good things, nah, great things about this place. And anyone who knows me I like a good burger. So a visit was in order. Okay the experience didn’t go as planned….

TripAdvisor Averages

So I’m assuming that the rating for a location on Trip Advisor is calculated as an average. Take all the community ratings and then work out the average. That’s all well and good if the averages. So Burger Club was a 4/5 rating which I’m happy with but you have to delve into the detail a little closer. The Recency of ratings will tell an awful lot.

In this case a change of management was leading to a string of one and two star reviews. With it being a recent change it meant the stronger ratings were still giving a higher rating average. Nothing wrong with that if it’s a blip. The warning signs are in the actual reviews which I wasn’t reading until I was waiting for my food (not making that mistake again).

How To Fix It?

There’s a simple fix, give an overall rating (the average) and then ratings for the last six months and the last month. Then it’s down to the consumer to decide whether to go ahead eating/drinking/staying there or not. If I’d taken the time to look at the recent ratings before hand then things might have been different. That’s my fault and no one else’s, and certainly not Trip Advisor’s.

Recency is everything though especially when there’s a radical change in the structure of a place. Different people lead to different outcomes.

I hope the Burger Club just read their reviews (and not the rogue five star review from the chef’s mate) and do great things in Portstewart again.