[#Kafka Diaries] Topic Level Settings You Can’t Ignore Part 1. – #Data #Streaming

For the majority of users the defaults are there and they kind of work, your messages are small and there’s enough volume on the box to be able to relax. If you are working on local development then there’s a good chance you don’t even consider such things, once things go live though then it’s a different matter.

Message Retention

There are two methods that are available to you for setting retention of messages on Kafka, firstly by time a message is in the log and then by log size.

log.retention.hours, log.retention.minutes, log.retention.ms

Yes there are three but they all do the same thing. How long messages are retained in the log by time. The default is 168 hours (which is seven days). You can use either hours, minutes or milliseconds as they all set the same thing. If more than one setting is present then the lowest unit size is used.


You can retain messages expressed as a the total number of bytes of the messages in the log. The retention bytes is set and applies to per partition so a topic with three partitions and a log.retention.bytes of 1GB is 3GB bytes retained at the very most. If you increased the partition count by one on the topic for example then the retention bytes will then increase to 4GB.

The two types of log retention, size and time, can be used together. If both are set then messages are removed when the either of the settings are satisfied. If you have a retention time of 1 day and 2GB retention in size then the log rules will be applied if you have over 2GB before the one day period is up.


Producers are limited to the size of messages they can produce. The default is 1mb in size, if a producer is sent a message over that then it will not be accepted. The setting refers to the compressed size of the message, so the message itself can be over the set size unpressed.

While you can set Kafka to use larger message sizes this does have performance impact across the network and I/O throughput. So it’s worth sitting down with a pen and paper (or a spreadsheet) to gauge the average message sizes and adjust the settings accordingly.


Realtime & Streaming Workflows and The Rule of 72 #data #streaming #kafka #kinesis


In Investment, the Rule of 72

The Rule of 72 is a simple calculation used in accountancy and investment. Simply, if you divide the interest rate by 72 you get the number of periods your investment will double. I have £100 and an interest rate of 10% per year then it will take 7.2 years (72/10 = 7.2) for my money to double.

In Realtime and Streaming Applications

Workflow throughput is everything. How consumers perform and behave will have a knock on effect to the number of messages you can process. So please allow me to present, The Streaming Rule of 72.

In realtime and streaming applications, the rule of 72, the rule of 70 and the rule of 69.3 are methods for estimating the volume of messages able to be processed doubling time. The rule number (e.g., 72) is divided by the percentage gain  per period to obtain the approximate number of periods (usually seconds) required for doubling.

For example if 100 messages are flowing through the system per second and by changing the workflow timeouts, increase a container shared memory volume or extend a heartbeat time out for example, then measure throughput again and we process 130 messages a second. That’s a 30% increase ((130-100)/100) so with the streaming rule of 72….. we will double the message volume every 2,4 seconds.


Monitoring Consumer Offsets in #Kafka.

With a bunch of applications acting as consumers to a Kafka stream it appears to be a Google dark art to find any decent information of what’s going where and doing what. The big question is, where is my application up to in the topic log?

After hours of try, test, rinse, repeat, tea, pulling hair, more tea, Stackoverflow (we all do, get over it) and yet more tea, this dear digital plumber was looking like this….


….but in the male form, less angry and not using tables, just more tea.

The /consumer node in Zookeeper is a bit of a red herring, your application consumer group ids don’t show up there but ones from the Kafka shell console do. This makes running the ConsumerGroupCommand class a bit of a dead end.

Consumer Offsets Hidden in Plain Sight

It does exist though! It’s right there, looking at you…

$ bin/zookeeper-shell.sh localhost:2181
Connecting to localhost:2181
Welcome to ZooKeeper!
JLine support is disabled


WatchedEvent state:SyncConnected type:None path:null
ls /brokers/topics
[__consumer_offsets, topic-input]

..just not plainly obvious.

Most consumers are basically while loops collecting a number of records and using the poll() method to update the consumer offset of any records not dealt with, basically saying “I’m up to here boss! I didn’t read these though”. It also acts as the initial link with the Kafka Group Coordinator to register a new consumer group. Those consumer groups do not show up where you expect them too.

Finding Out Where the Offset Is

At this point Zookeeper isn’t much help to me, using the Zookeeper shell doesn’t give me much to go on.

get /brokers/topics/__consumer_offsets/partitions/44/state
cZxid = 0x4e
ctime = Sat Mar 04 07:41:28 GMT 2017
mZxid = 0x4e
mtime = Sat Mar 04 07:41:28 GMT 2017
pZxid = 0x4e
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 72
numChildren = 0

So I need something else…. and help is at hand. It just takes a little jiggery pokery.

Using The Kafka Consumer Console to Read Offset Information

We can use Kafka’s console tools to read the __consumer_offsets. First thing to do is create a config file in a temporary directory.

$ echo "exclude.internal.topics=false" > /tmp/consumer.config

Then we can start the console.

$ bin/kafka-console-consumer.sh --consumer.config /tmp/consumer.config --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" --zookeeper localhost:2181 --topic __consumer_offsets --from-beginning

Any consumer applications you have running should show up in the offset log. In this example I have two applications running from the same topic (topic-input) on one partition. So I can see from here that my-stream-processing-application is up to offset 315 in the topic while my-other-processing-application is further ahead at 504. That could potentially tell us there’s an issue with the first application as it appears to be way behind in the topic.

[my-stream-processing-application,topic-input,0]::[OffsetMetadata[189,NO_METADATA],CommitTime 1488613422905,ExpirationTime 1488699822905]
[my-stream-processing-application,topic-input,0]::[OffsetMetadata[252,NO_METADATA],CommitTime 1488613901498,ExpirationTime 1488700301498]
[my-stream-processing-application,topic-input,0]::[OffsetMetadata[315,NO_METADATA],CommitTime 1488614472422,ExpirationTime 1488700872422]
[my-other-processing-application,topic-input,0]::[OffsetMetadata[378,NO_METADATA],CommitTime 1488614576300,ExpirationTime 1488700976300]
[my-other-processing-application,topic-input,0]::[OffsetMetadata[441,NO_METADATA],CommitTime 1488614606314,ExpirationTime 1488701006314]
[my-other-processing-application,topic-input,0]::[OffsetMetadata[504,NO_METADATA],CommitTime 1488615237410,ExpirationTime 1488701637410]

The frustration at this point is that it’s hard to know the total number of records in the log, we know where the offset is up to but not the total and what the lag is.

The search continues…..




My first foray in to #novel #writing with #AI. – #Tensorflow #AI

TL;DR – Quick Summary

For all interested writers, authors and creative writing types…. I think it’s fairly safe to assume you’re safe for the time being.

Have a good day.

Can AI Write Me a Book?

First of all, this isn’t really about code it’s just about process. So there are no juicy code snippets or scripts to get all hot under the collar about. This whole (stupid) episode started out with a couple of questions.

  1. Could AI write a novel or a novella of a quality that it could be entered into a writing competition?
  2. Is it possible to make 50 Shades of Grey readable?

So before my usual working day I downloaded some recurrent neural network code, installed Tensor Flow and trained it on Shakespeare and left the laptop alone to do it’s thing. Yup, training takes a long time, in my case it was eight hours on a commodity Toshiba C70D laptop running Ubuntu. If you want to read more about RNN’s then there’s an excellent explanation here. Generating samples of text generated from the RNN is a doddle….. takes seconds.

That Flipping Book.

So, rinse and repeat with some different text. How about 161,528 words of that book. Now, I have a confession I’ve never read that book, or in fact novels, for some reason my brain is firmly planted in non-fiction. Now I’m wondering if I can get AI to write me an O’Reilly book…. wonder what animal I’d get?

Another eight hours pass, overnight this time.

So how quick is it to generate 500 words of AI driven wordery? No time at all it seems…..

In fact with some simple bash scripting I can write an 11,500 Novella in under two minutes, on a cheap laptop. I shall be rich after all!

Not So Fast….

While the output was, well okay, it needs A LOT OF WORK to make it actually work on a human readable level. The main reason is in the training, if you look at that book it trundles in at 900k long in text file length, for training that’s way too small. In the samples the AI would get stuck in a look at repeat the same phrases over and over. Sometimes it would actually add to the paragraph, most times it repeated so often it didn’t make sense.

“He looks so remorseful, and in the same color as the crowd arrives and in my apartment. The thought is crippling. But and I don’t want to go to me that I want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to him — and I can tell him about 17 miles a deal.”

The only thing I can think of that came close was Adrian Belew’s repetative shouting of “I repeat myself when under stress, I repeat myself when under stress…..” (Warning: Youtube link contains King Crimson and a Chapman Stick).

Regardless, part of me thinks that’s naff, part of me thinks that’s rather cool, AI did that. So in theory and with a little cleaning up it’s possible to craft something.

What About Topic and Flow?

This is the thing with creative text, it has characters, themes and a story flow. What I’ve done so far doesn’t address any of that and that’s where everything falls flat on it’s bum for AI. Without some hefty topic wrangling it’s going to be difficult to craft something that’s actually going to flow and make sense.

My favourite book on text mining by a country mile is not one that has tons of code in it, it’s The Bestseller Code by Jodie Archer and Matthew Jockers. It’s a good attempt, while by their admission could be improved, investigation using multivariate analysis, NLP and other text mining tools to see if there were patterns in the best seller list.


Topic is important, that goes without saying. My AI version has no plot line whatsoever as it plainly isn’t told of such matters, if you want to know more about plot lines then there are the seven basic plot lines that are widely used. Baking those plot lines to an AI will take work.

The more text you generate the worse it’s going to be to get a basic plot going. A way to get the AI to focus on generating certain aspects of the story over a timeline would be beneficial but hard to do. Once again though, nothing is impossible.

An Industry Of Authorship, Automated?

The future automation of all things literature I think is a long way off. Though, let’s look at this from a 30,000ft view. I can generate an eleven thousand word book, while ropey, showing some promise if only needing an editor to sort the wording out.

API’s that exist now, well I could pick of for words to form a title. “rules are a hostile anthem” came out…. one Google Image Search for creative commons photos….


And automatically pick an image that fits a certain AI criteria. No text… pass that on to an overlay of the title of the book and the author’s name, “Alan Inglis” (geddit A.I.) and package that up in a Kindle format (that can be automated too) and off it goes to an Amazon account.

*I did check on Alan Inglis, there are no authors of that name but, rather ironically, one within clinical neurosciences….. I should write an algorithm to create a surname that doesn’t exist really. I just guessed this one out.

Perhaps Not Fiction Then….

Perhaps not, but with texts that tend to take the same form it could be easy to create fairly accurate drafts which require some form of editorial gaze afterwards. News reports, Invest NI jobs created press releases, term sheets and even business plans. Yup I think there’s sufficient scope for something to happen. I don’t think you’ll replace them human element but then again you don’t really want to.

Back, To Answer My Original Question

So, could my AI submit something to writing competition? Perhaps it could if it were less than 10,000 words. With enough corpus text it would be possible to do something of a quality that could be considered readable. Would a judge notice it as AI writing, who knows. There are some bits with the samples I’ve generated that are quite interesting, it looks like there’s heart in the prose but it’s simply not true.

I think the Alan Inglis’s of this world are safe for the time being….. I suppose I should go and read The Circle by Dave Eggers to see what my future holds.





The 30 Second Bayesian Classifier #machinelearning #bayes #classification

I’m putting this up as I got a nice email from a reader who was having trouble with running the Britney example. And as developers know, bad examples are enough to put people off…. actually they’re toxic.


See what I did there…


The Classifier4J library is old so it’s not on any Maven repository I’m aware of. So we have to go old school and go old fashion download jar file. You can find the Classifier4J library at http://classifier4j.sourceforge.net/

If you don’t have the code for the book you can download it from https://github.com/jasebell/mlbook


Open a terminal window and go to the example code for the book. In chapter2 is the Britney code. Keep a note of where you’ve downloaded the Classifier4J jar file as you’ll need this in the Java compile command.

$ javac -cp /path/to/Classifier4J-0.6.jar BritneyDilemma.java


There should be a .class file in your directory now. Running the Java class is a simple matter. Notice we’re referencing the package and class we want to execute.

$ java -cp .:..:/path/to/Classifier4J-0.6.jar chapter2.BritneyDilemma
brittany spears = 0.7071067811865475
brittney spears = 0.7071067811865475
britany spears = 0.7071067811865475
britny spears = 0.7071067811865475
briteny spears = 0.7071067811865475
britteny spears = 0.7071067811865475
briney spears = 0.7071067811865475
brittny spears = 0.7071067811865475
brintey spears = 0.7071067811865475
britanny spears = 0.7071067811865475
britiny spears = 0.7071067811865475
britnet spears = 0.7071067811865475
britiney spears = 0.7071067811865475
christina aguilera = 0.0
britney spears = 0.9999999999999998

About the Book


You can find out more about the book on the Wiley website. As well as machine learning practical examples it also has sections on Hadoop, Streaming Data, Spark and R.

#Snapchat – The #IPO that will probably never profit.

The App That I Still Don’t Understand is IPOing

It used to be music that instantly classified you as “old”, with me it appears to be this app.

Snapchat has finally revealed plans for the much publicised IPO…..

And there’s one sentence that says everything about Silicon Valley IPO’s. You’ve read it a thousand times probably but here Snapchat were pretty plain about it.

“to incur operating losses in the future, and may never achieve or maintain profitability”

Which is about as wide open as you can get. The IPO therefore has the single purpose, as other Silicon Valley IPO’s tend to, recoup the money for the initial investors. There’s 24 of them… the idea of buying shares as investment is thinking that over the long term that the company will be striving to make a profit for the shareholders and therefore paying a dividend in the future. A company claiming it won’t make a profit to traditional shareholders is…..


The hope then is the share price is more than what you paid for it and sell it on.

And if you think the unicorn eye-watering valuation of $25 billion is impressive then the loss figures are equally sphincter tightening. You can also see this in Uber, Lyft and other media luvvied ventures. These aren’t businesses, they’re planned investor flip schemes where profit isn’t a measure, just a bonus. The focus is on the IPO.

What is Business?

If you need a reminder.

“Business is a very specific, limited activity, whose defining purpose is maximising owner value over the long term by selling goods or services.  Accordingly business is not an association to promote social welfare, spiritual fulfillment or full employment; such organisations are legitimate, but they are not businesses” Elaine Sternberg – Just Business – Business Ethics In Action

I’m just wondering if the buyers of Snap shares will be left with a donkey, not a unicorn once the original investors have cashed out. It’s a well documented approach, hardly new.

The ongoing issue of data transparency in Northern Ireland. #data #opendata #charity

There’s an interesting data parallel happening at the moment, the ability to champion open data and all the trimmings: reports, dashboards, findings (with smatterings of bias thrown in for good measure). And on the other hand the one in the cogs of government, this kind of thing.


During the Detail Data Conference last week the main word that delegates were interested in most of all was one of “transparency” and while Northern Ireland is getting there in some measure, there’s a portal of open data, I can even see Translink real time rail data now, something un-imaginable five years ago.

If you look at the mainstream media it appears to be the opposite, data is requested, demanded and downright expected so someone can back up their claims. It’s been an interesting few weeks indeed.

The harsh reality is that while things like the open data portal for Northern Ireland are a great idea, under good management and have the potential to let others do good things, well we are far behind with some catching up to do.

Three Key Requirements for 2017

Open up the Postcodes

It comes up time and time again but on Wednesday it was coming from some heavy hitters in the open data sector. The GrantNav utility in the 360Giving website lets you find details on all levels of giving in the UK. Their main issue (if I picked it up correctly) is the ease of searching on Northern Ireland, with postcode searches still requiring a licence it makes things very difficult for them and others on the mainland.

The work around was to map to ward level, it feels like a stop gap to me though, I’m hoping from NI’s side of things it can be resolved. The mainland is taking interest in what we do, there are times we just make it difficult to access the information.

file-15-01-2017-11-46-14Open Up Grant Funding

I know that Detail Data got part the way with this, here’s the report on Invest NI funding.  These usually require freedom of information requests, phone calls, sarcastic tweets and so on. There has to be an easier way. Who benefits from the RHI money? And I’m not just talking about the ones with boilers, I mean the suppliers too? Invest NI money, as it’s essentially public money to aid the economy, so the old eSynergy fund or the newer TechStartNI management. The Propel Programme, who’s going through that and what were they getting at the end? StartPlanetNI? I’m scratching the surface…. there’ll be hundreds.

Now Tell Me The Ones That Got Turned Down

One conversation I had during the Detail Data Conference was with 360Giving CEO Rachel Rank, while the site has all the parties who received money it doesn’t list the ones who applied and got turned away. I was interested to see if i) that data existed and ii) would it be publicly available.

The short answer is no. I personally believe this needs to happen. There’s actually a lot to learn here and I’m thinking on a algorithmic level. Neural networks predicting Darcey Bussell’s scores are all well and good but it would be better for all to put it to good use.

How about a startup who securely feeds their idea into a system that can predict their probability of getting funding from various bodies? To train a system like that you need the accepted, the declined and the undecideds.

Northern Ireland’s Panama Papers Moment?

It’s better to start opening up as much data as possible. What could follow is an opening up of data by other means. Back door sifting and publishing with connections to all the funding received. The media would love this and baking cakes with pictures on will be a walk in the park compared to the explaining some could potentially have to do.

Rest assured, it won’t be me having the Panama Papers moment, I have enough to do.


Filtering the Streaming #Twitter API with #Onyx Plugin – #clojure #twitter #data #firehose #onyx

Twitter Fashion Analytics Revisited?

Not quite but I have been working on revisiting the original work I did in SpringXD and moving it to the Onyx Framework.


The Twitter Plugin for Onyx Workflows

The original Onyx Twitter plugin acted as a handy input stream from the Twitter firehose. Nice and easy to setup too, just put your Twitter app credentials as environment variables and add a catalog. For example:

{:twitter/consumer-key (env :twitter-consumer-key)
 :twitter/consumer-secret (env :twitter-consumer-secret)
 :twitter/access-token (env :twitter-access-token)
 :twitter/access-secret (env :twitter-access-secret)
 :twitter/keep-keys [:id :text]}

The plugin used the .sample call from Twitter4J library which gives you the 1% of tweets from the public feed. Fine but the data coming out was wayward, it’s very random.

The .filter function

For a more refined look at the Twitter firehose you’re better off using the .filter function which takes a FilterQuery and uses an array of Strings that you want to monitor, this then refines the matches against the firehose and then you get 1% of the matching tweets and not a bunch of random noise.

So, to that end, I’m delighted to say that the Onyx Twitter plugin now supports the tracking of strings in the stream.

Just add:

:twitter/track ["#fashion" "#louboutins" "#shoes"]

..to the catalog and you’re away. If you leave this option off then you get the usual random sample from the public stream.

I might get around to revisiting the whole Twitter Fashion analytics thing in Clojure, with the Onyx Platform soon.

Quick Recipe for #Kafka Streams in #Clojure


Kafka Streams were introduced in Kafka 0.10.x and act as a way of programatically manipulating the data from Kafka. William Hamilton from Funding Circle introduced the concepts in a lightening talk during ClojureX. As discussed by myself and William, make Java Interop your friend.

I’ve based my example from James Walton’s Kafka Stream example which you can find on GitHub.

The Quick and Dirty Basic Stream Demo

First add the dependencies to your project.

[org.apache.kafka/kafka-streams ""]


First of all some configuration, the properties we’re going to use give the application a name, the Kafka broker to work with and the key/value classes to use for each message (in this example they are both strings). With those properties we then create a StreamsConfig class.

(def props
 {StreamsConfig/APPLICATION_ID_CONFIG, "my-stream-processing-application"
 StreamsConfig/BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"
 StreamsConfig/KEY_SERDE_CLASS_CONFIG, (.getName (.getClass (Serdes/String)))
 StreamsConfig/VALUE_SERDE_CLASS_CONFIG, (.getName (.getClass (Serdes/String)))})

(def config
 (StreamsConfig. props))

Creating the Builder

The main builder is defined first then we’ll add the topic and config on when the stream is created.

(def builder

Defining the Topic

Just a string array of topic names, Kafka Streams can read more than one topic.

(def input-topic
 (into-array String ["topic-input"]))

Working with the Stream

While the stream is running every event passed through the topic becomes a KStream object, it’s a case of passing that through a method to do some work on the content of that stream. In this case we’re mapping the values (.mapValues) and converting the value of the key/pair (v) to a string then counting the length. That thing to do is print out the results to the System.out.

 (.stream builder input-topic)
 (.mapValues (reify ValueMapper (apply [_ v] ((comp str count) v))))

It’s worth looking at the actual Java API for the Kafka KStream class. There are lots of methods to manipulate the data passing through, this might result in a value being sent to another Kafka topic or it just being written out to a file. Take the time to study the options, you’ll save yourself time in the long run.

Setting It All Off

The final parts of the puzzle.

(def streams
 (KafkaStreams. builder config))

(defn -main [& args]
 (prn "starting")
 (.start streams)
 (Thread/sleep (* 60000 10))
 (prn "stopping"))

The main function starts the service and will keep it alive for ten minutes.

Packaging it all up

I’m using leiningen, it’s a simple case of creating an uberjar.

$ lein uberjar
Compiling kstream-test.core
log4j:WARN No appenders could be found for logger (org.apache.kafka.streams.StreamsConfig).
log4j:WARN Please initialize the log4j system properly.
Created /Users/jasonbell/work/dataissexy/kstream-test/target/uberjar+uberjar/kstream-test-0.1.0-SNAPSHOT.jar
Created /Users/jasonbell/work/dataissexy/kstream-test/target/uberjar/kafka_streams.jar

Testing the Service

So straight out of the box, Kafka 0.10 is installed in /usr/local, I’m going to be the root user while I run all this (it’s just a local machine).

Start Zookeeper

$KAFKA_HOME/bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka

$KAFKA_HOME/bin/kafka-server-start.sh config/server.properties

Create the Topic

$KAFKA_HOME/bin/kafka-topics.sh --create -zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic-input

Created topic "topic-input".


Start a Producer and Add Content

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic topic-input
This is some data
Thisis some more data
This is more data than the last time

Start the Uberjar’d Stream Service

$ java -jar target/uberjar/kafka_streams.jar
log4j:WARN No appenders could be found for logger (org.apache.kafka.streams.StreamsConfig).
log4j:WARN Please initialize the log4j system properly.
null , 5
null , 3
null , 3
null , 3
null , 3
null , 6
null , 3
null , 3
null , 6
null , 0
null , 3
null , 4
null , 5


A really quick walkthrough but it gets the concepts across. Ultimately there’s no way of doing things that’s better than the other. Part of me wants to stick with Onyx, the configuration works well and the graph workflow is easier to map and change. Kafka Streams is important though and certainly worth a look if you are using Kafka 0.10.x, if you are still on 0.8 or 0.9 then Onyx, in my opinion, is still the best option.



@jennschiffer Delivers the Best Talk on the State of Tech I’ve Seen #tech #talks


I love the tech industry, the diversity, the people, everything. I don’t like the nonsense that goes with it, it’s pointless and not needed. And it’s not often I’ll watch something that leaves me peeing myself laughing but also angry about the treatment that some get.

So, Jenn Schiffer, your talk at the XOXO Festival, brilliant and thank you. As for everyone else, her talk is definitely required viewing.