The Onyx Chronicles: onyx-template gotchas…. #onyx #clojure #data #kafka

In the previous posts on Onyx (you can read part 1 and part 2 on how to setup a Kafka Streaming application with Onyx), there are a couple of things worth noting that I didn’t mention in the original articles.

Blocking Jobs

When you submit your job you are submitting it to Zookeeper, then when the peers are running it sees what jobs are available to execute. The template code’s -main method does a good job of separating these actions out.

This line:

(onyx.test-helper/feedback-exception! peer-config job-id)

is looking out for exception messages on the job. If all is well it will block any form of exiting and wait for someone to kill it. Removing the line will effectively submit the job to Zookeeper, exit out and that’s that. When the peer is running it will then pick up the job. In our Kafka job the peer will pick up the job and then do it’s Kafka input job. Kill the peer and start it again, the job should pick up from the last offset.

The Greedy and the Balanced

The default template works on a greedy job scheduler, meaning that if you run ten peers and the job demands three, then the remaining seven will be blocked from use.

Fine when there’s only one job submitted but a pain when there’s two or more. In the config.edn file it’s a case of changing the job scheduler from greedy to balanced. The balanced scheduler will keep the remaining available peers and allocate them to the other jobs as they come on.

:onyx.peer/job-scheduler :onyx.job-scheduler/greedy

becomes…

:onyx.peer/job-scheduler :onyx.job-scheduler/balanced

 

 

The Tattered Tail of City of Derry to Dublin Flights (LDY > DUB)

image_news_citywing_01

Routes are 90% marketing and the small matter of staff, an aeroplane and permission to do the route. And while my suspicion that the Derry to Dublin route was going to get shelved goes way back before a Brexit vote, Brexit is the very reason used as to why it’s not going ahead. I’ve yet to see one shred of evidence to support that.

Was It Ever Going To Happen?

Well there was a subsidy application in late 2015. Citywing operating with aircraft from Van Air Europe, Citywing don’t own aircraft they are classed as a “virtual airline” where they take the bookings and Van Air Europe hold the routes, the air operator’s certificate and provide the planes, crew, maintenance and insurance.

So the real negotiation has nothing to do with Citywing and everything to do with Van Air Europe. Digging around for that info though is hard work.

Yes, it was in the works to happen though I feel the press might have jumped on it a wee bit too soon. Nothing at that point was set in stone.

City of Derry Needs Good News Stories

I’ve posted before about City of Derry Airport and what I think needed to happen to revive it’s place in Northern Ireland as a viable route and destination. And the shelving of the original route to Dublin in 2011 (operated by British Airways but provided by Logan Air) did not help matters. It stopped for the reason many routes stopped, the subsidy ran out.

It’s no surprise CoD pushed the PR boat out. The headlines about the route starting in April 2016 filtered to the press and everyone in Co. Londonderry smiled. Thing is these things are never straightforward.

While there is evidence of the subsidy application happening I’ve yet to see any firm response or confirmation of a subsidy. Chances are it gets hard to find that info as there’s due diligence and legals that you’d not be putting online.

Bad News, Blame It On Brexit

It’s now in fashion, there’ll be a four page Vogue special feature soon, to blame things not happening on Brexit. For CoD to blame Brexit on this, well I don’t buy it. Things started to seep out slowing in March…

mpshow

All that time though Citywing were pushing the LDY > DUB route on their website, you just couldn’t book anything. So there was probably another reason. There were three potential reasons I believe, first of all Citywing realised the route wasn’t commercially viable, especially if the subsidy wasn’t going to happen. So they spent a lot of time mulling over on what to do. At this point, we still don’t know if the subsidy was offered or not.

Secondly, when your airline has four aircraft then route planning becomes a logistical operation. It’s just not Northern Ireland, it’s the Isle of Man, it’s Cardiff and all the other route combinations. It’s a fun maths problem, but that’s not what I’m going to do now.

Thirdly and most importantly, you need pilots to operate plans….. Van Air were on the look out as late as May.

 

vanair

While Van Air don’t say the routes the pilot would be operating, Van Air also fly out of Belfast City Airport, there’s that good chance it was to fulfil new routes.

City of Derry cited “technical problems” in the Mark Patterson Facebook thread.

codresponse

Technical problems are usually down to the wings falling off the plane or you can’t find someone to fly it. I’m going for the latter.

Until I see some actual evidence of Brexit being a problem for this route then I shall remain skeptical about the reasons.

Using onyx-template to craft a Kafka Streaming application. Part 2. #clojure #onyx #kafka #streaming #data

high-res-logo

The Story So Far….. And Beyond

In Part 1 I covered that basic setting up of the Onyx platform configuration, starting Zookeeper and deploying the peer and a basic job. In this post I’m going to plug in the Kafka components into the code base and setup a three broker cluster on a local machine. Then we’ll kick everything off and send some messages for processing.

Code Amendments To Our Application.

Before we dive into code a little bit of planning needs to be thought about. First of all how many peers we need to run. At present my workflow is pretty simple:

:in :inc :out

A core.async channel for the input, a function to increment the value passed in and then an output channel (again via core.async). I’m going to modify this so we have a Kafka topic being read (though the :in channel), I’ll amend the math.clj code for ease of understanding so it just shows to deserialised message as a Clojure map. The output channel I’ll leave as is.

My Kafka cluster has three brokers and one partition. If a broker dies then another will be elected the lead and the data stream doesn’t suffer. Onyx operates on a one peer per partition, if you allocate too many peers against it then it will throw errors so I’m making that clear now.

One peer per partition.

Let’s make the code changes.

Changing the Project Dependencies

First of all let’s add the Kafka plugin to the project.clj file.

[org.onyxplatform/onyx-kafka-0.8 "0.9.9.1-SNAPSHOT"]
[cheshire "5.5.0"]

There are separate plugins for Kafka 0.8 and 0.9. As my cluster is 0.8.2.2 then I’m going for the 0.8 plugin.

Changing the Basic Job Code

Second let’s make changes to the basic job. Job one is to add the dependencies to my :require line.

[onyx.plugin.kafka :as kafka]
 [onyx.tasks.kafka :as kafka-task]

Next in the basic-job function I need to change the base job configuration. The input channel. :in, is now a onyx.tasks.kafka/consumer and will read the topic stream that we give it.

(-> base-job
 (add-task (kafka-task/consumer :in kafka-opts))
 (add-task (math/process-kafka :inc batch-settings))
 (add-task (core-async-task/output :out batch-settings)))

I need to beef up the settings for the Kafka channel so I’m adding kafka-opts to the let statement.

{:onyx/name :in
 :onyx/plugin :onyx.plugin.kafka/read-messages
 :onyx/type :input
 :onyx/medium :kafka
 :kafka/topic "my-message-stream"
 :kafka/group-id "onyx-consumer"
 :kafka/fetch-size 307200
 :kafka/chan-capacity 1000
 :kafka/zookeeper "127.0.0.1:2181"
 :kafka/offset-reset :smallest
 :kafka/force-reset? true
 :kafka/empty-read-back-off 500
 :kafka/commit-interval 500
 :kafka/deserializer-fn :testapp.shared/deserialize-message-json
 :kafka/wrap-with-metadata? false
 :onyx/min-peers 1
 :onyx/max-peers 1
 :onyx/batch-size 100
 :onyx/doc "Reads messages from a Kafka topic"}

I’m passing in the topic, consumer group and peer size information. I also need to add a deserialiser function to convert the message stream to a map for me.

Adding a Deserialiser

(I’m English so I’m using “s” instead of “z”….):)

I’ve borrowed this quick Clojure code from one of Mastodon C’s repos, it works and I’m happy with it so I reuse it for safety.

(ns testapp.shared
 (:require [taoensso.timbre :as timbre]
 [cheshire.core :as json]))

(defn deserialize-message-json [bytes]
 (let [as-string (String. bytes "UTF-8")]
 (try
 (json/parse-string as-string true)
 (catch Exception e
 {:parse_error e :original as-string}))))

(defn serialize-message-json [segment]
 (.getBytes (json/generate-string segment)))

(def logger (agent nil))

(defn log-batch [event lifecycle]
 (let [task-name (:onyx/name (:onyx.core/task-map event))]
 (doseq [m (map :message (mapcat :leaves (:tree (:onyx.core/results event))))]
 (send logger (fn [_] (timbre/debug task-name " segment: " m)))))
 {})

(def log-calls
 {:lifecycle/after-batch log-batch})

So for every message stream that Onyx reads it will pass through the deserialiser (as we configured in the :in workflow. All that’s left to do is change the original math.clj file to print the map to the console.

Amending The Process Function

Let’s keep this really simple. It just prints the deserialised map to the console.

(ns testapp.tasks.math
 (:require [schema.core :as s]))

(defn get-data [fn-data]
 (println fn-data))

(s/defn process-kafka
 ([task-name :- s/Keyword task-opts]
 {:task {:task-map (merge {:onyx/name task-name
 :onyx/type :function
 :onyx/fn ::get-data}
 task-opts)}}))

I’m reusing the original code from the template, obviously in any other scenario you’d be tidying up naming as you went along.

So that’s the code taken care of. Added the Kafka dependency, changed the job spec around a bit, added a deserialiser for the messages and amended the processing function.

Use leiningen to clean and create an uberjar.

$ lein clean ; lein uberjar
Compiling lib-onyx.media-driver
Compiling testapp.core
Compiling lib-onyx.media-driver
Compiling testapp.core
Created target/testapp-0.1.0-SNAPSHOT.jar
Created target/peer.jar

All done. Now to setup the Kafka cluster.

Setting Up a Three Node Kafka Cluster

Kafka runs nicely as single node cluster but I want to use three brokers to give it some real world exposure. The config directory has a file called server.properties. I’m going to copy that for the other two brokers I want.

$ cp server.properties server1.properties
$ cp server.properties server2.properties

There are three things to change in each of the new properties file.

Setting server1.properties server2.properties
broker.id 1 2
port 9093 9094
log.dirs /tmp/kafka-logs1 /tmp/kafka-logs2

You can leave the original server.properties file alone, it will still use port 9092 as default.

Now For The Big Test

So this is the order we’re going to run in:

  • Start Zookeeper
  • Start Kafka Broker 0
  • Start Kafka Broker 1
  • Start Kafka Broker 2
  • Add a new topic
  • Start the Onyx Peers
  • Tail the onyx.log file
  • Submit the job.
  • Send some messages to the Kafka topic.

Assuming you’ve been using Zookeeper before it’s worth clearing out the structure as we’re only testing things.

$ rm -rf /tmp/zookeeper
$ rm -rf /tmp/kafka-logs*

Starting Zookeeper

With a clean sheet we can start Zookeeper:

$KAFKA_HOME/bin/zookeeper-server-start.sh config/zookeeper.properties

Give it a second or two to start up and the open another terminal window.

Start Kafka Broker 0

The first Kafka broker will act as the leader.

$KAFKA_HOME/bin/kafka-server-start.sh config/server.properties

[2016-08-03 18:03:59,792] INFO [Kafka Server 0], started (kafka.server.KafkaServer)
[2016-08-03 18:03:59,844] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)

Start Kafka Broker 1

As a JMX_PORT is already set from broker 0 you will need to specify a new port number for Broker 1.

JMX_PORT=9997 $KAFKA_HOME/bin/kafka-server-start.sh config/server1.properties

[2016-08-03 18:04:29,616] INFO Registered broker 1 at path /brokers/ids/1 with address 192.168.1.91:9093. (kafka.utils.ZkUtils$)
[2016-08-03 18:04:29,631] INFO [Kafka Server 1], started (kafka.server.KafkaServer)

Start Kafka Broker 2

As Brokers 0 and 1 have their own JMX ports, one again you’ll need to specify a new one for broker 2.

JMX_PORT=9998 $KAFKA_HOME/bin/kafka-server-start.sh config/server2.properties

[2016-08-03 18:04:51,572] INFO Registered broker 2 at path /brokers/ids/2 with address 192.168.1.91:9094. (kafka.utils.ZkUtils$)
[2016-08-03 18:04:51,586] INFO [Kafka Server 2], started (kafka.server.KafkaServer)

Create a new topic

While Kafka can create new topics on the fly when messages are sent to them Onyx doesn’t always behave when the job is submitted but the topic isn’t there. So it’s safer to create the topic ahead of time.

$KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper localhost:2181 --topic my-message-stream --replication-factor 3 --partitions 1

Created topic "my-message-stream".

Notice the partition count is 1, that needs to match the number of peers on the :in channel of the Kafka reader.

Start the Onyx Peers

We’ve already create the uberjar so it’s just a case of firing up the peers.

$ java -cp target/peer.jar testapp.core start-peers 3 -c resources/config.edn -p :default
Starting peer-group
Starting env
Starting peers
Attempting to connect to Zookeeper @ 127.0.0.1:2181
Started peers. Blocking forever.

Now create a new terminal window so we can tail the logs.

Tail the onyx.log File

A lot of the output is dumped in the onyx.log file so it’s worth tailing it for errors and info. If the peer count is incorrect then it’ll show up here.

$ tail -f onyx.log
16-Aug-03 18:08:25 mini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Aug-03 18:08:25 jmini.local INFO [onyx.static.logging-configuration] - Starting Logging Configuration
16-Aug-03 18:08:25 mini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.communicator] - Starting Log Writer
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.communicator] - Starting Replica Subscription
16-Aug-03 18:08:27 mini.local INFO [onyx.static.logging-configuration] - Starting Logging Configuration
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.acking-daemon] - Starting Acking Daemon
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.aeron] - Starting Aeron Messenger
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.virtual-peer] - Starting Virtual Peer ec2d6a86-d221-4e98-b9c4-e01e845486cc
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.acking-daemon] - Starting Acking Daemon
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.aeron] - Starting Aeron Messenger
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.virtual-peer] - Starting Virtual Peer 4d1b1c70-8d1c-4229-843c-42ef9a16e432
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.acking-daemon] - Starting Acking Daemon
16-Aug-03 18:08:27 mini.local INFO [onyx.messaging.aeron] - Starting Aeron Messenger
16-Aug-03 18:08:27 mini.local INFO [onyx.peer.virtual-peer] - Starting Virtual Peer 44be7413-dbce-46b5-a104-f161473f5bd6

Submit The Job

Submit the job, remember the job name has now changed from basic-job to kafka-job.

$ java -cp target/peer.jar testapp.core submit-job "kafka-job" -c resources/config.edn -p :default
16-Aug-03 18:11:40 jasebellmacmini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Aug-03 18:11:41 jasebellmacmini.local INFO [onyx.log.zookeeper] - Stopping ZooKeeper client connection
Successfully submitted job: #uuid "c4997d0d-a75f-4d26-9819-41782a50fcae"
Blocking on job completion...

Once deployed look at the onyx.log file again. You should see the Kafka offset position being reported.

16-Aug-03 18:11:43 mini.local INFO [onyx.plugin.kafka] - Kafka consumer is starting at offset 0

We’re ready to send some messages.

Send Some Messages to the Kafka Topic

There are some useful scripts in the Kafka bin directory, one includes a shell for writing messages to a topic.

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092,localhost:9093,localhost:9094 --topic my-message-stream

The broker list is comprised of all our running brokers. When connected start pumping out some messages:

[2016-08-03 18:15:01,904] WARN Property topic is not valid (kafka.utils.VerifiableProperties)
{"firstname":"Jase", "lastname":"Bell", "github":"https://github.com/jasebell"}
{"firstname":"Jase", "lastname":"Bell", "github":"https://github.com/jasebell"}
{"firstname":"Jase", "lastname":"Bell", "github":"https://github.com/jasebell"}
{"firstname":"Jase", "lastname":"Bell", "github":"https://github.com/jasebell"}
{"firstname":"Jase", "lastname":"Bell", "github":"https://github.com/jasebell"}
{"firstname":"Jase", "lastname":"Bell", "github":"https://github.com/jasebell"}
{"firstname":"Jase", "lastname":"Bell", "github":"https://github.com/jasebell"}
{"firstname":"Jase", "lastname":"Bell", "github":"https://github.com/jasebell"}
{"firstname":"Jase", "lastname":"Bell", "github":"https://github.com/jasebell"}

Now look at the terminal window where the peer is running, you should see the deserialised output.

Attempting to connect to Zookeeper @ 127.0.0.1:2181
Started peers. Blocking forever.
{:firstname Jase, :lastname Bell, :github https://github.com/jasebell}
{:firstname Jase, :lastname Bell, :github https://github.com/jasebell}
{:firstname Jase, :lastname Bell, :github https://github.com/jasebell}
{:firstname Jase, :lastname Bell, :github https://github.com/jasebell}
{:firstname Jase, :lastname Bell, :github https://github.com/jasebell}
{:firstname Jase, :lastname Bell, :github https://github.com/jasebell}
{:firstname Jase, :lastname Bell, :github https://github.com/jasebell}
{:firstname Jase, :lastname Bell, :github https://github.com/jasebell}
{:firstname Jase, :lastname Bell, :github https://github.com/jasebell}

Our work here is done.

I’m Going To Park That There….

Okay that was a long gig for anyone. We’ve covered integrating the Onyx Kafka plugin to a project, amending the code to allow it to consume Kafka messages from a topic and deserialise the to a Clojure map.

Next time, when I get to it. I’ll do something funky with the map and give it some real world use case.

A cup of tea is in order….

1882b8cc909570b25ea861feaf8ec2dd

 

 

 

 

Using onyx-template to craft a Kafka Streaming application. Part 1. #clojure #onyx #kafka #streaming #data

high-res-logo

Over the past couple of months I’ve been using the Onyx Platform with Mastodon C an awful lot. I’ve developed a bit of a love/hate relationship with, when it works it’s a beauty, when it doesn’t it’s the most frustrating thing ever. Like I said about my Kafka post yesterday, sometimes we plug all these bits in without really knowing what’s going on properly.

So this short series while I have some time off, this is hobby mode research not work btw, is going to attempt to show you how to build a Onyx Kafka application that consumes a Kafka stream and kicks out the messages to the console. Nothing flashy but more a “here’s the concepts”.

In part 1 I’m going to look at getting the Onyx template working and using the Onyx Dashboard. Part 2, and possibly beyond, I’ll then add the Kafka connection and consume from a topic.

Your Shopping List

We’ll look at the Onyx things via git so there’s nothing to download. You will need to download Kafka.

Important thing is that it’s 0.8.x as 0.9 isn’t supported yet that I’m aware of. Let’s get started.

Onyx Dashboard

Before we do anything do yourself a huge favour and download the Onyx Dashboard, it makes things an awful lot easier to see what’s going on with the cluster.

$ git clone git@github.com:onyx-platform/onyx-dashboard.git
Cloning into 'onyx-dashboard'...
remote: Counting objects: 3815, done.
remote: Total 3815 (delta 0), reused 0 (delta 0), pack-reused 3814
Receiving objects: 100% (3815/3815), 3.37 MiB | 2.30 MiB/s, done.
Resolving deltas: 100% (2079/2079), done.
Checking connectivity... done.
$ cd onyx-dashboard/
$ lein uberjar
Successfully compiled "resources/public/js/app.js" in 76.2 seconds.
Created target/onyx-dashboard-0.9.9.0.jar
Created target/onyx-dashboard.jar

The dashboard will compile, build and be saved as a jar file in the target directory. You may get some ClojureScript warnings but don’t worry about them, it will still build okay. Don’t start anything just yet….

The Onyx Application

The easiest way to create a new Onyx application is to use the Onyx template. It gives you the main structure of the application and the deployment of the peer and the job (more on that later). You don’t have to download anything in advance just use lein to create it.

$ lein new onyx-app testapp -- +docker
Warning: refactor-nrepl requires org.clojure/clojure 1.7.0 or greater.
Warning: refactor-nrepl middleware won't be activated due to missing dependencies.
Generating fresh Onyx app.
Building a new onyx app with: ("+docker")

The template makes some assumptions, which you’d expect, so I’m going to edit the configuration file a bit to make it work under my conditions. From the template the config file (in the resources folder) looks like:

{:env-config
 {:onyx/tenancy-id #profile {:default "1"
 :docker #env ONYX_ID}
 :onyx.bookkeeper/server? true
 :onyx.bookkeeper/local-quorum? true
 :onyx.bookkeeper/delete-server-data? true
 :onyx.bookkeeper/local-quorum-ports [3196 3197 3198]
 :onyx.bookkeeper/port 3196
 :zookeeper/address #profile {:default "127.0.0.1:2188"
 :docker "zookeeper:2181"}
 :zookeeper/server? #profile {:default true
 :docker false}
 :zookeeper.server/port 2188
 :onyx.log/config #profile {:default nil
 :docker {:level :info}}}
 :peer-config
 {:onyx/tenancy-id #profile {:default "1"
 :docker #env ONYX_ID}
 :zookeeper/address #profile {:default "127.0.0.1:2188"
 :docker "zookeeper:2181"}
 :onyx.peer/job-scheduler :onyx.job-scheduler/greedy
 :onyx.peer/zookeeper-timeout 60000
 :onyx.messaging/allow-short-circuit? #profile {:default false
 :docker true}
 :onyx.messaging/impl :aeron
 :onyx.messaging/bind-addr #or [#env BIND_ADDR "localhost"]
 :onyx.messaging/peer-port 40200
 :onyx.messaging.aeron/embedded-driver? #profile {:default true
 :docker false}
 :onyx.log/config #profile {:default nil
 :docker {:level :info}}}}

First of all I’m using my own Zookeeper instance (the one that comes with the Kafka download) so I’m not going to use the embedded one.

:zookeeper/server? #profile {:default true
 :docker false}

Now becomes:

:zookeeper/server? #profile {:default false
 :docker false}

And the port will need changing too in both the env-config and peer-config blocks..

:zookeeper/address #profile {:default "127.0.0.1:2188"
 :docker "zookeeper:2181"}

Now becomes:

:zookeeper/address #profile {:default "127.0.0.1:2181"
 :docker "zookeeper:2181"}

I have no need for a BookKeeper server to be running at this moment in time so that can be disabled too.

:onyx.bookkeeper/server? true

Now becomes

:onyx.bookkeeper/server? false

As far as I can see those are all the changes I need to do. Feel free to use the embedded Zookeeper but I prefer to start as I mean to go on and use the version that I’ll be using in the wild.

Next is to the build the basic template, I want to see how it’s going to behave and how to deploy it, which is life gets really interesting. Once again we’ll use Lein to build the application.

$ lein clean ; lein uberjar
Compiling lib-onyx.media-driver
Compiling testapp.core
Compiling lib-onyx.media-driver
Compiling testapp.core
Created target/testapp-0.1.0-SNAPSHOT.jar
Created target/peer.jar

All built okay. The next step is to start everything up.

Running the Application

Before I start here’s something to clear up, the relationship between peers and jobs as I feel the documentation doesn’t make it that clear. Jobs are submitted to Zookeeper, they don’t require any peers running at the time. When a peer is then started it will go to Zookeeper looking for jobs to run. And if it took my colleagues “a while for the penny to drop” well firstly there’s hope for me yet and secondly it may have been buried in the docs somewhere.

I’m not going to submit data to the job, I just want to get the job running and get a peer running too. Then I’ll inspect the dashboard to see what’s going on.

Starting Zookeeper

I’m going to my Kafka distribution and going to start the Zookeeper instance. On my Mac OSX developer environment I start it as root so there’s no faffing about with permissions. Yes I know Docker exists… that comes later.

bin/zookeeper-server-start.sh config/zookeeper.properties

Starting a Peer

Open up a new terminal window and go to the application directory again. I’m going to start a single peer up.

$ java -cp target/peer.jar testapp.core start-peers 1 -c resources/config.edn
Starting peer-group
Starting env
Starting peers
Attempting to connect to Zookeeper @ 127.0.0.1:2181
Started peers. Blocking forever.

Once you see “Started peers. Blocking forever” we have a registered peer on the cluster.

Submitting the Job

Order doesn’t matter, originally I always thought it did. I can submit a job to Zookeeper for a running peer to pick it up later. Don’t forget that Zookeeper will persist the job when a peer is shutdown. So in theory you only need to submit the job once.

Saying all that: IF you have never started a peer before then there’s no Onyx job scheduler registered in Zookeeper and this will cause you pain as the job will have no scheduler to check against, it will time out.

$ java -cp target/peer.jar testapp.core submit-job "basic-job" -c resources/config.edn
16-Jul-31 09:45:50 jasebellmacmini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Jul-31 09:45:50 jasebellmacmini.local INFO [onyx.log.zookeeper] - Stopping ZooKeeper client connection
Successfully submitted job: #uuid "0201c86b-93ff-4470-9e58-7c2f8a78e080"
Blocking on job completion...
16-Jul-31 09:45:51 jasebellmacmini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Jul-31 09:45:51 jasebellmacmini.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.

Once running it will sit there merrily waiting for the job to complete.

Starting the Dashboard

To confirm my assumptions I’m going to fire up the dashboard and have a look around. All I need to do is start up the jar file and pass in the Zookeeper address.

$ java -jar target/onyx-dashboard.jar localhost:2181
Starting Sente
Starting HTTP Server
Http-kit server is running at http://localhost:3000/

Once started we can open a browser and have a look at the dashboard.

onyx1

From the drop down you will see a list of the running peer specifications. I’ve only got the one running. When selected it will load up the information about the peer and the added jobs.

Inspecting the Peer and Job

So I’ve a browser open and I’ve loaded up http://localhost:3000

onyx2

The raw cluster activity shows us everything that’s happening. Click on the left hand side of the log entry will show you more information. The jobs are listed on the left hand side of the page. You can see there’s only one submitted job, if you click on it will show you more detail.

onyx3

So we have a running job, you can also see the workflow of the job with it’s input, inc and output flows.

In Part 2

I’ll take this basic job and make it consume a Kafka topic and show the contents of the message load. I may even make it do something useful.

 

 

 

 

 

Getting clj-kafka consumer example to work. #clojure #kafka #streaming #data

As a data engineer and a software developer I’m spending a lot of time working out things over a number of different technologies. It might be Spark, Sparkling and S3 one day and Cassandra, Clojure and ElasticSearch the next.

Historically I’ve been a huge advocate of RabbitMQ, I still am, but more and more I’m using Kafka. The original reason for not wanting to use Kafka wasn’t Kafka, it was Zookeeper, something else to monitor. Ultimately you may find yourself knowing 90% of one thing well and not so much about the underlying tools and services. Change that.

Make Data Streams Your Friend

Get used to them, they don’t have to real time, in fact any startup that says they’re “doing real time as it’s gotta be realtime”, well 99 times out of 100 that’s usually not the case. So give them the standard Andrew Bolster gif of the century….

giphy

I’d say that 99% of my work is working with high volume streams (and some low volume ones too).

Anyways, I digress.

Okay So I Was Wrong About Kafka

It’s glorious, really flipping glorious. While I can’t go into detail about how I’m using it, blimey it’s glorious. I did wake up the morning still feeling I’d not got enough background so I cracked open the note book and went hunting for information…. seven pages of notes later I’m in a much happier place.

20160730_202853

In Clojure we use clj-kafka.

The clj-kafka wrapper is for Kafka 0.8.x (don’t attempt to use 0.9.x it will not work).

There is however one wee little error in the introduction documentation. The consumer call doesn’t work it merely returns an empty sequence. To get the consumer working you need a doall in you with-resource call.

 kafka-testing.core> (def config {"zookeeper.connect" "localhost:2181"
 "group.id" "clj-test" 
 "auto.offset.reset" "smallest"
 "auto.commit.enable" "false"})
;; => #'kafka-testing.core/config
kafka-testing.core> (use 'clj-kafka.consumer.zk)
kafka-testing.core> (use 'clj-kafka.core)
kafka-testing.core> (with-resource [c (consumer config)]
                      shutdown
                        (doall (take 2 (messages c "test"))))
;; => ({:topic "test",
 :offset 0,
 :partition 0,
 :key nil,
 :value
 [116, 104, 105, 115, 32, 105, 115, 32, 109, 121, 32, 109, 101, 115, 115, 97, 103, 101, 32, 49]}
 {:topic "test",
 :offset 1,
 :partition 0,
 :key nil,
 :value
 [116, 104, 105, 115, 32, 105, 115, 32, 109, 121, 32, 109, 101, 115, 115, 97, 103, 101, 32, 50]})

There a small note of help if anyone gets stuck with it. Next time, Kafka with the Onyx platform.

The NiTechrank Retrospective. #opendata #nijobs #clojure #opensource #data

The @nitechrank was a simple index to report the changes in programming jobs in Northern Ireland. It wasn’t anything scientific and for all my hailing of the automating of all things possible, meaningless and trivial…. well it was me editing the tweets every morning, with a fresh cup of tea, half asleep and in my dressing gown 99% of the time*. Today I ran the last index as the picture became clear….

 

finalnitechrank

The Good News

In linear regression terms the jobs outlook is positive. It slopes upwards over time, not to say that there aren’t down days, the start of June was a bit of a surprise but also was the peak which happened on the day after the referendum. So with a low of 42 jobs and a high of 92 jobs yeah it got a little varied but nothing troubling.

Keep in mind though, the only data source, NiJobs.com seems to list larger companies so you can assume with a certain amount of confidence that startups won’t list there and look for developers word of mouth. One of the reasons that Java may favour so highly in the results as a whole.

For Developers

If you are interested in Clojure then the repository for the code is available on my Github account. Take from it what you will, it’s slung together to get the numbers out. Note there’s no support, the README should be clear enough to get things working. I won’t be doing any updates on this repo, it’s just there for anyone one who wants to look. I also included the original shell script that did the ranking originally but only for historical purposes.

For Data Folk

Everyday the NiTechrank updated a CSV file so in the data directory is the CSV data of all the indexes that were run. Have fun with it, I’ve not done any analysis on it.

Conclusions

Yeah, it was a giggle.

*I was dressed for the other 1%

From Onyx Template to Working App in Thirty Seconds #onyx #clojure

UPDATE: You can ignore this post now. I’ve done a FULL WALKTHROUGH!

It’s early days for the Onyx Platform and I’m using it in a number of projects as it plugs in with Kafka well. Though getting it up and running can be a bit confusing. So as an aide memoir for me more than anything, here’s the thirty second version.

I’m assuming you have leiningen installed. No tutorial or walkthrough, maybes later when I’ve got some more time and can present something a little more worthy but in the meantime here’s the basics.

With regards to Zookeeper Onyx is using it’s own in-memory version so there’s no need to spin one up for this quick overview.

Create The Application with Onyx Template

$ lein new onyx-app funky-avocado
Generating fresh Onyx app.
Building a new onyx app with:

Create The Uberjar

$ cd funky-avocado/
funky-avocado $ lein uberjar
Compiling lib-onyx.media-driver
Compiling funky-avocado.core
Compiling lib-onyx.media-driver
Compiling funky-avocado.core
Created /work/funky-avocado/target/funky-avocado-0.1.0-SNAPSHOT.jar
Created /work/funky-avocado/target/peer.jar

Running The Application

$ java -cp target/peer.jar funky_avocado.core start-peers 10 -c resources/config.edn
Starting peer-group
Starting env
Starting peers
Attempting to connect to Zookeeper @ 127.0.0.1:2188
Started peers. Blocking forever.

Do TripAdvisor ratings mean anything over the long term? #ratings #data

Averages, lovely things. Ratings, lovely things. If it helps the consumer then I’m all for it. Do I trust Trip Advisor ratings, well on the whole yes and then there are some gotcha’s.

ta

Averages Can Distort the Actual Truth

Distorted averages are hardly new, if you want to learn more the Skillswise website has a good explanation. Want to earn an average of £63,000, as discussed in a recent #factbait? Chances are you might, or you might not, who knows. At the end of the day it’s just an average.

They are something to think about in reviews especially when an event may change the way a company operates.

So I’ll use the real world example of Burger Club, I’d read good things, nah, great things about this place. And anyone who knows me I like a good burger. So a visit was in order. Okay the experience didn’t go as planned….

TripAdvisor Averages

So I’m assuming that the rating for a location on Trip Advisor is calculated as an average. Take all the community ratings and then work out the average. That’s all well and good if the averages. So Burger Club was a 4/5 rating which I’m happy with but you have to delve into the detail a little closer. The Recency of ratings will tell an awful lot.

In this case a change of management was leading to a string of one and two star reviews. With it being a recent change it meant the stronger ratings were still giving a higher rating average. Nothing wrong with that if it’s a blip. The warning signs are in the actual reviews which I wasn’t reading until I was waiting for my food (not making that mistake again).

How To Fix It?

There’s a simple fix, give an overall rating (the average) and then ratings for the last six months and the last month. Then it’s down to the consumer to decide whether to go ahead eating/drinking/staying there or not. If I’d taken the time to look at the recent ratings before hand then things might have been different. That’s my fault and no one else’s, and certainly not Trip Advisor’s.

Recency is everything though especially when there’s a radical change in the structure of a place. Different people lead to different outcomes.

I hope the Burger Club just read their reviews (and not the rogue five star review from the chef’s mate) and do great things in Portstewart again.

Airline Seat Auctions… What to bid? Part 1. #travel #airlines #auctions #clojure

Priestmangoode-airline-interior-for-Embraer_dezeen_468_0

Airlines have long perfected the yield algorithms over the years to determine flight prices and the break even point. The basic rule of thumb is the closer you get to the travel date the more expensive your flight will be.

Going, Going, Gone…. again.

So it looks like seat auctions are on their way back, once you’ve already bought your seat. The Economist ran a report yesterday on the increasing use of auctions on premium seats, a rather nifty way of upselling to the traveller once they’ve already parted with the cash. Depending on the airline and whether you need to be enrolled on one of their programmes will determine what your options are.

Interestingly Richard Kerr, aka The Points Guy, offered a handy calculation as a rough guide as what to bid. So I thought I’d give it a whirl. Previously I’ve done auction systems in the airline industry but that was for the entire aircraft, not a single seat. Fun, fun, fun indeed.

A Working Example

Okay, suppose I want to fly from London Heathrow to Dubai (LHR -> DXB). What I first need is the basic economy price and then the price for the premium economy.

A quick look on cheapflights.co.uk gives me the following:

Economy LHR->DXB: £352

Premium Economy LHR->DXB: £968

So the question is, what to bid? The Points Guy offers a simple equation that gives a sensible guide price.

Bid offer = (premium seat price - economy seat price paid) * percentage

The percentage can be anything you want but the guide set by The Points Guy is between 20-40%. To implement this in Clojure is easy enough, it can be done in one line.

user> (defn calc-bid [premecon econ pc] (* pc (- premecon econ)))

So using the prices I have for the LHR->DXB flight, I’m going to add 20% and see how that looks.

user> (calc-bid 968 352 1.2)
;; => 739.1999999999999

A suggested bid price of £739.19, a potential saving of £228.81 on the premium economy class price, that’s promising but that does not take into account factors such as how many other people are bidding, scarcity and so on.

There’s nothing to stop you bidding the absolute minimum of £1 for example, depending on the number of bidders (which is highly controlled when you think about it). Perhaps you were flying on a Boeing 777 which can average a capacity of 382 seats depending on configuration, with an 80% load factor that’s 306 people who could bid but only a small percentage, 8% or so, would actually want to bid (not everyone is competitive or has the cash for example). So theoretically 24 people actually bid….

Then you’re into bid psychology and proper statistics, this post really needs a part 2…..

 

 

When everyone is saying #no doesn’t mean you should stop. #startups #bigdata #hadoop

“I want to be the new Clubcard in town…”

In 2009/2010 I was working on customer loyalty technology with a view to doing something on the phone. So uVoucher was born with probably too much fanfare and talking. The value for me wasn’t the iPhone app for the retailer, it was the data and the learning that could come off the back of it. Data to me was always the cornerstone of business decision.

dssitescreen-scaled500

At that time the word “Hadoop” kept cropping up with processing large volumes of data with commodity hardware, the idea of making one large computer out of lots of small ones to do processing appeals, especially when volumes of retail data are concerned.

Waterskiing On The Data Lake

The more I looked at Hadoop the more I saw huge potential but I also saw a huge gap in the technical to real world users. Configuration was a big pain to do well and as a technology was way above the reach of the, what I would call, common user.

That spawned Cloudatics, the human version of Hadoop, say where your data is and choose from the drop down list of things you wanted to do to your data, start the job and then wait for the output. Simple….. it seemed obvious to me that was the way things were going, towards data platforms.

I made one fundamental error, I feel, I listened to other people’s opinion too much. At that time I liked to gauge opinion of people that I trusted. Some folk got it and some folk really didn’t get it, and when I say “didn’t get it” I mean at all….. “who’d use that!?”.

One other, “learn from that for future reference”, was I entered a pitch competition. Yet again, one person got it, “yeah, data mining for everyone” and the other slammed it right in my face, “no business would use this”. I left the room with the pair of them arguing. Northern Ireland wasn’t ready for big data or Hadoop…. so pitching it was a bad idea. I’ve never pitched since and never will, build, ship and sell is the only way.

So fast forward on five years and below a picture of the Expo hall on Thursday 2nd/Friday 3rd June 2016 at Strata Conference in London. Pretty much every stand down there is a data platform, with the exception of O’Reilly, or a company who is highly integrated into a data platform.

20160602_182531

Have a Hunch and the Data to Back It Up?

Then go for it, and I’m not saying don’t listen to anyone. Yes sometimes you have to be bloody minded and forge ahead to see what the challenges are (in fact, some would say I made a career out of it) but sometimes it is wise just to get that sounding board feedback.

Be careful where and who you pitch to. While every competition is happy for you to stand and do your three minutes, not every judge is going to get it or support what you are saying. And I don’t go for the whole “gut” thing either.

I’m certainly not bitter or unhappy, I’m quite the opposite, I love where I work. I love the BigData community once you get past the marketing “it’s just a gimmick” naysayers.

Ultimately it’s about the right message, at the right time at the right place.

Follow

Get every new post delivered to your Inbox.

Join 919 other followers