#nitechrank – The Northern Ireland tech jobs index. (@nitechrank)

The History

The #nitechrank index came about because of two things, firstly it was prompted by a piece by Naomi McMullan in The Profit Margin podcast, secondly I had an unused twitter account so I might as well put it to some use.


Why Do This?

I’ve not seen a reliable score of the NI Tech jobs market, so I thought it would be nice to give something back and provide one. You never know it might even become a regular segment on the Profit Margin….

The Methodology

The index is calculated from a selection of programming language jobs that are listed on the nijobs.com website. The score is a combination of the search results for the following languages/skills:

  • Java
  • Hadoop
  • PHP
  • Perl
  • Python
  • Ruby
  • iOS
  • C++

Today is the first day, so this is just to establish the baseline.

All subsequent index calculations will be based off today’s baseline figure and then posted to twitter every day. I’ll automate it at some point but for now I’ll publish them while I’m having a cup of tea and waking up.

You can follow the @nitechrank twitter account and get the daily updates, nothing else will be posted there. If anyone has any ideas then I’m all ears too and you can email me at jasonbell@datasentiment.com



Co-Founder/Mentor Selection, could you “Tinderise” it? #startups #math #data @fryrsquared

The Story So Far…

For those who’ve had the sheer bravery to keep up with my ramblings, I salute you. Annoyingly I created another Linkedin profile as you are a professional leper if you don’t have one it appears. So I’m there in name and spirit but I’m not accepting any recruiter spam, I’m 100% happy where I am thank you.

The net result of reopening the Pandora’s box of professional ego storage was that I got two offers of being a mentor on “a fantastic new product” and even more annoying for you dear reader, that got me thinking. How did a founder come to that conclusion so quickly. Did my Linkedin invitation just trigger that thought? Was it the catalyst of, “Oh there’s Jase, I should ask him” or was the founder playing a game of, “the next annoyed idiot that emails me is going to get asked”?

All possible scenarios and I’ve seen all of them been played out.

When in Doubt, Fry R Square It.

When it comes to putting one set of people (in this instance, founders) seeking another set of people (in this instance potential co-founders or mentors) then I think of the work of Dr Hannah Fry.

Dr Fry attempting to eat a Subway Foot Long, assuming it's Chicken Teriyaki or something.... I hope she doesn't read this then tell me she's vegetarian.

Dr Fry attempting to eat a Subway Foot Long, assuming it’s Chicken Teriyaki or something along those lines…. I hope she doesn’t read this then tell me she’s vegetarian.

So my semi fiendish plan to pair founders with the mentors of their dreams/nightmares is to use something I first learned from Dr Fry, optimal stopping theory.

My interest in maths and stats happened way after I left school, I code and I read a lot. After writing the book on Machine Learning* I’ve wanted to extend what I know. Now I have the OU phoning me up, “those Maths modules you chose, the next bit it pay for them……”, anyway I digress slightly.

At some mad networking event you decide that you are not leaving until you find your mentor. So first work out how many people you are going to talk to. Okay, there are 50 people in the room…. the golden number you need to remember right now is 37%. That’s your lottery ticket right there.

In other words, fifty divided by one hundred and times by 37… she or he is your target to pin down, they are the chosen one. Okay, 18.5, let’s round up and say that you have to shot, “YOU’RE NUMBER 19, YOU ARE THE ONE, SIGN HERE!”, you could even do all this with Paul Hardcastle playing in the background. The rest of them, well that’s just a risk you’ll have to take.

Note: Normally I find that investors, mentors and potential co founders, well they talk to you about their entire life story, how great it is to be in the tech sector (they obviously weren’t there for the dot com bust, I was, it was hell), and how AI chatbots are the next new thing that’s going to replace email. Then they hand you a card and say….. “email me”. 

So You Aren’t At The Networking Event

A Yorkshireman living 70 miles from Belfast where the networking meetings happen, well it’s not going to return dividends, not that I’m looking, but let’s imagine.

So I need another method. If you watched “Horizon: How To Find Love Online” you’ll know that the app Tinder was used in the brutal writing off of the 37%. With the likes of Linkedin and Angel.co does it not make sense to apply the same to mentor/cofo searching? Another ranked pool of people seeking out other interested people for evenings of meetings, funding applications and investor pitches.


Swipe your way through (this is apparently what the kids do) 37% of a number of users you’re willing to write off and…. the mentor of your dreams! So much work taken out, you’ll be thanking me, actually no, Dr Fry, later. I cannot, and will not, take any credit for this.

Now what would be funny would be if Elon Musk came up within the first n results and you had to move on, you’d be kicking yourself for days. There’s got to be another blog post just on working out the probability of such a thing happening. Another time perhaps, right now we have the Timmy Number to contend with.

I Can Just See It Now….

I’m expecting in two or three weeks time a phone call, saying that they applied the optimal stopping rule to a network meeting and it’s all gone horribly wrong.

Well I’m not responsible, you had the the choice to ignore this blog post completely. I’m just wondering who’ll give it go:)

If you want to know more about the optimal stopping theory and other interesting choosing algorithms then Dr Hannah Fry’s book, “The Mathematics of Love: Patterns, Proofs and the Search for the Ultimate Equation“, is definately worth a look.

* And for those who want a challenge, the Chinese and Korean translations of my book are coming out in the next few months.

Why are we allowed to #bet on already concluded #events?

Recently I posted on the probability of David Cameron leaving Downing Street over the  Panama papers scandal. I suggested a more accurate way of looking at probability was to see who was putting money on it.

What is alarming is the other games on the betting sites….

For a Man Who Doesn’t Like Sport……

….I’ve worked on a lot of sporting themed stuff. My first internet job was in 1997 working at PA/Sporting Life on the website. I learned Perl on the back of a KitKat wrapper, I worked on one of the first digital pools coupons and I also worked on the very first online betting site, BetOnline.

The team were into horses, some owned them, some knew the finer detail of the numbers of the sport and I’d say 95% of them had betting accounts. I didn’t bet, just wasn’t in my nature too and to be honest it still isn’t.

Later I worked with SportsFusion working on a bunch of sites including prediction markets, I learned way too much about PIN machines.

Guessing Stuff

Is a certain outcome going to happen? That’s it in a nutshell.

The Slots

Random chance like slot machines add to the excitement (if you like that kinda thing). The old slot machines could be calculated with a degree of certainty. Three rings, ten pictures on a ring, so the odds of the jackpot were going to be 103, basically 1 in 1000. Even with the 0.1% chance of that event happening it does limit the payout a company would put it.

To make it more exciting, just add more pictures to the reels. They did and upped it to 22 thus making it a 0.01% chance. Still not exciting enough for some I’m sure.

Things changed in the digital revolution with the introduction of random number generators the odds became less and less. In certain countries there is a percentage of funds a machine has to pay out but it’s still less certain than the old mechanical slots.

Sports Betting

Regardless of whether it’s horses, greyhounds, football (or any form of Sportsball for that matter) you can bet on an outcome.  Fixed odds is fairly normal and there’s also Parimutuel betting where all the money for the bet is pooled together and is then split against the number of people who picked the correct outcome.

You can bet on just about anything now, politics, fantasy sports, digital virtual sports and television programmes. And it’s the last one that is open to certain amounts of abuse.

The Great British Bet Off

So when a tv show kicks off a new series and there’s a competitive element then the bookies will very quickly have a market open for it. All very well if it’s live but I still don’t think it’s a good idea. Recorded programmes, like The Great British Bake Off, are made by production companies and broadcast by other companies, so guess what, they already know the outcome but can still bet on it.

It looks bad when the production company, allegedly, starts opening up betting accounts and betting on the winner.

Would you bet on a drama? Why would you bet on a drama? Who’s going to get bumped off on Game of Thrones, yup there’s a market for that too.

Just seems to me that if something’s already recorded with an outcome then all bets are off, someone somewhere knows the outcome and therefore it’s not really a random outcome.

Minor rant over…..


Calculating the Timmy Number…. in the mind of @TimTendo.

It all started with this….

The Hypothesis


My Immediate Response


The Short Answer

This is not the first time Tim has led me down a Liz Hurley shaped rabbit hole, let we forget the shoe size based on heel alone…. the internet does not forget Tim!

So, Tim, to answer your question. Yes I could…. the quality of the question determines the quality of the answer.

Job done, everyone celebrate with a drink, break out the bubbly, Cava all round! (I wonder if he’ll keep reading down…….)






















The Slightly Longer More Involved Answer


While the “yes I could” answer still stands, the quality of the answer is a different matter altogether. Mainly because the variables are completely wayward at this point. Take the population of London, 8.539m and the two friends Tim’s bumped into.

(1 / 8,539,000) * 100 = 0.00001164%

We really need a better set of variables to work this out in a more refined manner.

  • How many people does Tim know?
  • What time of day was the Tube journey?
  • What’s the average capacity of a full London Underground train?

How Many People Does Tim Know?

Psychologist Richard Wiseman says we know about 300 people by first name. Now Tim’s Twitter profile maintains he has 1,400+ followers but let’s be fair he could have bought some of those😉 – so I’m sticking with 300.

What’s The Average Capacity of a Full London Underground Train?

Different trains, well they have different carriages and capacities. So we’ll go with an average. There’s a nice list on Wikipedia. So I’ve got a Steve Reich (let’s see who gets that joke!) number of 816.

What Time of Day Was Tim Travelling?

Passenger volume on the Underground is not a constant. You can have a percentage of capacity (or over capacity in the morning/evening rush hour), as the Economist has previously reported.


So depending on what line, the time of day and how many stops between departure and destination has a very large bearing on how many travellers Tim will encounter. Now as I don’t have that information to have our result is going to a fairly wide tolerance as not to be accurate.

Let’s Try and Work Something Out

So assuming all of Tim’s 300 friends are on the underground at the same time in a busy station at a busy time of day. There’s no real nice way to work this out, looking around there are different theories but I’ll plump with this one.

(Timmy’s Friends + People On The Train) / London Population

(300 + 816) / 8,539,000 = 0.01%

Now that assumes a lot, the train is full and no one gets on or off between the stations. Even if over 10 stops 200 people get off and 200 new people get on, so 2816.

(300 + 2816) / 8,539,000 = 0.03649%

Still a small amount.

But that’s the end, because Tim bumped into two people he knows.

(299 + 2816) / 8,538,999 = 0.03647%

0.03649 * 0.03647 = 0.001330%

I’m still not 100% convinced that’s correct, for a varying number of reasons. The variables for a start are so inaccurate as we don’t really know them, we’re making guesses. You could safely add a 20% tolerant number line each side and still be way off the mark.

As Lee Schneider pointed out in his post on the same question:

“First, it’s amazing that we can put a number to something that you might think of as random, like running into a friend. Second, the number delivered by our spectacular calculation was meaningless. No way everyone in New York is going to be outside at the same time and distributed randomly so I could run into them in a controlled way. Fuggaboutit! As the statistics professor put it, “These assumptions are ridiculous, of course!””

But hey, Tim, we had a crack at it.





The Incoming #ChatBot #Revolution – Don’t Forget the Most Important Question…


It’s been a busy week for chatbots, not surprising though as Apps have now become the big lake/mino effect and everyone has realised there’s probably not enough money in the “App Economy”.

How Much Interest?

Enough that it looks pretty obvious in Google Trends.


Now IRC and AIM Messenger chatbots have been around for years, I was writing AIM bots to do common tasks (Weather, RSS Feed News and Flight info) back in 2002 at the same time I was getting the data mining bug. The one thing we didn’t have then was someone else hosting the platform, it was hosted and run by the owner.

Now we have the added fairy dust of AI, high volume data processing and a tech press and investor structure who’ll throw money at the next big thing.


The issue is that the deployment and hosting landscape has changed considerably. With Facebook and Microsoft firmly putting their flags up in the bot arena it’s the big companies that are offering the tools to make your next fortune (in theory). Those frameworks come at a cost, data privacy.

Who Owns The Conversation?

This will be the most important question. And where you host the chatbot of your dreams is going to be pretty important. As with apps the main challenge is going to be the distribution of your chatbot. A chatbot is just another app, people will line up with their own app stores. Telegram already have an API to integrate chatbots.

Training the AI is hard if you don’t have the experience, probably why I’ve had a few enquiries over the last week. I don’t invest, I don’t want to be your tech co founder (that old chestnut) and I’m not going to build it for you. There’ll be a host of accelerators popping up to get your chatbot to market. The majority will dodge the most important question, who owns the conversation, as it’ll be about being the first out of the door and investor ready.

Putting your product on someone else’s platform is always a brush with danger, regardless of what you may think, you’re losing control of the core platform. While I agree that the larger companies will have already done the training involved there’s the risk of that Nest moment where your bot is shuttered for no reason whatsoever. Lest we forget the big Twitter plan on the API cherry pick back in 2012?

Azeem Azhar nailed in his excellent The Exponential View newsletter, “There are many reasons to follow the new KISS. (Keep it Separate, Stupid).” If you’re not reading it, then subscribe here, it’s become my essential Sunday morning reading.

Is it really AI?

Over time people will realise that doing AI is hard. The problem right now is the transition phase where everyone is reading about AI in the media and the explosion of startups waiting in the wings. Remember that annoying quote about BigData and sex? Just rinse and repeat that with AI chatbots.

“An AI chatbot is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…”

As I guessed a few weeks ago there’s a good chance that you just need to expect the obvious and realise where there’s an “AI” thing, there’s more than likely a human somewhere along the line. Some of the time it’s to keep training the algorithm and that’s fine (and needed). What I believe you’ll see is the rise of companies and entrepreneurs touting AI and just having a human responding to a Web/Phone/Text request.

Lest we forget the sorry tale of Spinvox who’s Voice Message Conversion System (VMCS) was sending voicemails off for human conversion, usually outside the EU, which landed them in a lot of bother.

The Coming wave….

So, the coming wave of AI startups – a mixture of technology, hype (and let’s be honest, it’ll be mainly hype) and human interaction. It’s like the App explosion of 2007/08. No one will be making $400k from a farting AI though.


Training the black box, unleashing the branded chat bots. #clojure #nlp #slack #brands #marketing

Are chatbots the beginning of the new internet? No, I don’t believe so, it’s just another realignment of how users interact with the internet. “The new websites” is something I don’t buy into, a chatbot is just another black box dependent on an input stream of information from a user. Let me explain….

Slack is just another input layer

Like our own input devices; eyes, ears, nose, skin etc the internet has worked within a set of input devices of its own. Web forms, SMS messages (and WhatsApp, iMessage, Kik etc), voice recognition, Slack channels and so on. In common though is there is information fed through these channels that has to be handled, interpreted and in some cases responded to.


The early days were pretty basic call and response stuff. And chatbots of one type or another have been around for a long time. Who remembers writing bots for IRC and AIM Messenger back in the day, just me…. I don’t think so.

Slack offers the addition of a signed up audience, you’ve already registered an interest to join the channel so that’s a lot of work done already. Especially if you are interacting with a brand, the rest is a matter of reading the input. To be fair brands should be doing this with all the input channels they have, web forms, email, slack and Twitter etc. And in most cases it can be automated.

The important part of all of this is the black box which is decoding and defining the response.

Reading the Input Streams of Customer Thought

Let’s consider the following….

“Hi, my name is Jason and I’d like to book a restaurant table in Belfast please on 1st July”.


There’s an action (a user wants to book something), a type (in this case a restaurant) a date (1st July) and a location. The type could be interchangeable at this point, it could be a hair appointment or a flight booking. All we need is an AI that can handle all this things coming in. At this point we know the phone number, email address or slack name of the user so we don’t need to go finding that, we’ve already got it.

Using NLP to find the who, what, when and where

I’m using the Clojure OpenNLP and the NLP models to tokenize and extract the details that I’m looking for. And for ease I’ll do it all from the REPL.

First thing I have at my disposal is some predefined models that have been trained. So for things like names and location I’m not having to do lots of work.

wibble.chatty> (require '[opennlp.nlp :as nlp])
;; => nil
wibble.chatty> (def tokenizer (nlp/make-tokenizer "/opt/models/en-token.bin"))
;; => #<Var@6eb72feb: #function[opennlp.nlp/eval689$fn--690$tokenizer--691]>
wibble.chatty> (def names (nlp/make-name-finder "/opt/models/en-ner-person.bin"))
;; => #<Var@16ede515: #function[opennlp.nlp/eval719$fn--721$name-finder--723]>
wibble.chatty> (def locations (nlp/make-name-finder "/opt/models/en-ner-location.bin"))
;; => #<Var@15b5b602: #function[opennlp.nlp/eval719$fn--721$name-finder--723]>

Let’s define our input text, we’d normally be reading this from an API of some form or via email reading which can be easily automated.

wibble.chatty> (def input-text "Hi, my name is Jason and I'd like to book a restaurant table in Belfast please on 1st July")
;; => #<Var@62be6517: 
 "Hi, my name is Jason and I'd like to book a restaurant table in Belfast please on 1st July">

With my models loaded and my input text defined we can have a go at extracting some information.

wibble.chatty> (def enquiry-name (names (tokenizer input-text)))
;; => #<Var@2081c12: ("Jason")>
wibble.chatty> (def enquiry-location (locations (tokenizer input-text)))
;; => #<Var@7d92da99: ("Belfast")>

Even with that information alone we can form a response and say “we’re on it” and send that back to the user. Keep in mind the response mechanism is only another API to the input sensor provider (email, Slack, SMS etc). In other words it can be easily handled programatically.

wibble.chatty> (format "Hi %s - looking for restaurants in %s." (first enquiry-name) (first enquiry-location))
;; => Hi Jason - looking for restaurants in Belfast.

Remember that OpenNLP is returning sequences of words that are in the model. At this point we could fire off an API call to Yelp or Open Table and look for restaurants in that location with the resulting first ten results being sent back to the user with TripAdvisor reviews.

This is the surface scratched.

Chat bots as they currently stand are an easy technology. Input data from an API going into do some decoding model and a response being generated. The challenge is to have a meaningful conversation, do I hear Turing Test anyone?

So the question remains, why Slack as an input medium. Well it’s an easy route to knowing who the customer is in the first instance. Plus the UI is built, the API is defined and it’s easy to integrate to.

If you’re looking for the real holy grail, you want to raise the UI to the next level and start looking at speech as the input sensor. And that means waiting for Apple to release a Siri API (developers have been waiting a long time for that) and Google to release the Now API.

I’ll be tinkering with this a little more in the future.


Will Cameron Go? Using #Clojure, betting odds, Sherman Kent and probability.

I don’t trust social media hype. Simple as that, while internet is the perfect platform for saying how you feel that’s no guarantee that action will take place.

Cameron Will Be Gone Tomorrow(?)

So when I see things like this:


It’s not my default thought to think that’s that, it’s a done deal. No way, never, nope not me.

There is a Better Gauge

Twitter with it’s automated tweets and “let’s see if we can get this trending” is something I certainly don’t trust. It’s not a gauge of X-Factor or The Voice winners so I don’t really give it much hope for anything else these days.

Edward Snowdon on the other hand did hit the sentiment right.


An even better gauge on predicting the future is to see who’s putting real money on it.

Show Me The Odds….


And low and behold, there’s a couple of open books on when David Cameron will no longer be Prime Minister. For him to leave in 2016 there are two odds 2/1 and 7/4.

Betting odds are just another way of showing probability. Easy worked out too. So 2/1, we can call A = 2 and B = 1, A/B. To work out the percentage probability of those odds as % = B / (A + B).

I can even wrap that up in a Clojure function:

(defn calc-prob [a b] (double (/ b (+ a b))))

So Ladbrokes are giving us 2/1, let’s have a look with my new function.

user> (calc-prob 2 1)
;; => 0.3333333333333333

So 33% chance…. ok, let’s look at Betfair at 7/4.

user> (calc-prob 7 4)
;; => 0.3636363636363636

Let’s Consult Sherman Kent

Sherman Kent retired from the CIA in 1967, one of his legacy’s though was a chart, a real simple one, on the potential outcome based on a probability. A “fair chance” of success was defined as 3 to 1 against success by an advisor, Kent needed a way of interpreting what “fair chance” and words like “probable” were from advisors, so he came up with this table.

Certainty (%) General Area of Possibility
100% Certain
93% (+/- 6%) Almost certain
75% (+/- 12%) Probable
50% (+/- 10%) Chances about even
30% (+/- 10%) Probably not
7% (+/- 5%) Almost certainly not
0% Impossible

So looking at our 33-36% betting probability of Mr Cameron packing up in 2016, it’s looking as a “probably not”.

To Summarise

When people put money on things they have a certain confidence that the event is going to happen. That sorts out the serious folk from the armchair opinions straight away. So it’s a good idea to consult these odds to see if they agree or disagree with the hypothesis. It’s just another piece of information.

Thrown in with a quick Clojure lesson, betting probability and a history lesson on Sherman Kent, well that’s not a bad evening’s work.


Are Artificial Intelligence Frameworks the new Web Frameworks? #AI #MachineLearning #Data #ArtificialIntelligence

I spent most of Sunday morning working on some updates and a couple of corrections on my book, Machine Learning: Hands on for Developers and Technical Professionals. The comments and feedback all centred around a common theme.


Validating the Muse.

Interestingly a lot of the comments centred around decision trees which at least proves they are still popular but also proves that you can sit down with pen and paper and do some of the grunt work yourself to validate the model.

Now most people will load a dataset into something like Weka and let the system do all the work. And you know what, that’s okay, there’s nothing wrong with that. At the same time though I could with some effort work out the information gain, to find the potential root node in a tree, myself with a calculator and prove the model was good.

The same could be said for things like Apriori algorithms, Naive Bayes and Bayes Networks, Linear Regression, K-Means clustering and, at a push, Linear Support Vector Machines. If you have a pen, paper and a calculator you can start working things out.

When we get to neural networks that’s where things start to get hazy, really hazy.

The Media’s Love Affair with Neural Nets

Oh boy the tech press like a good ANN story. Whether it be the Deep Mind team beating Lee Se Dol in three games of Go or IBM’s Watson doing the whole Jeopardy thing or Google’s self driving car.  They are all, without doubt, sexy AI stories that are going to generate discussion and debate. And the joys of debate is that it rises up polar opposites of public opinion, like it loathe it. It’s going to help us or it’s going to destroy us.

While the core concepts of neural networks are simple from the idea of perceptron weights to activation functions, they’re quite easy to grasp. The problems arise once the models have been created, the maths can become so black box like they are difficult to prove or write off. One thing’s for sure, over time and with enough iterations it will get better.


Like I said in the book, “One of the keys to understanding the artificial neural network is knowing that the application of the model implies you’re not exactly sure of the relationship of the input and output nodes. You might have a hunch, but you don’t know for sure. The simple fact of the matter is, if you did know this, then you’d be using another machine learning algorithm.

Though there are good books on the subject the models themselves are always difficult to prove, with enough training you’ll get results. Even the publications which I hold in high esteem such as “Data Mining” by Wittan, Frank and Hall merely skirt around the mechanics of Neural Nets which, to be fair, made me feel a whole lot better when writing the book at the time.

Artificial Intelligence as Frameworks

What I believe we are seeing the chasm point of artificial intelligence frameworks. Some have been around for a while, Weka and RapidMiner for instance, and there are others that are new on the scene such as TensorFlow. The common thread though is that they are provide a starting point for machine learning and AI to the mass market developer.

It’s very much like the web frameworks of the Web 2.0 era of websites. The main tipping point was Ruby on Rails which obscured a lot of the hard work that was going on under the covers. This led to a plethora of web frameworks in a variety of languages where you really didn’t need to know what was going on technically, it was just a case of downloading, setting a few things up and then going through the motions of creating the objects you needed. There came a time where it was more important to know how to get the framework working than the underlying language that was doing all the work. I believe we’re at the same point with artificial intelligence.

Data Velocity + AI + Lack of Algorithmic Knowledge = Concern

While some of the machine learning algorithms have been around for forty plus years it’s only recently they’ve become in vogue due to computing processing power, vast amounts of data generated and the need for corporations to push the bottom line down to keep stake holders happy over the long term.

As we push vast AI knowledge into the hands of developers who may have no prior knowledge on how the stuff really works, is this a good idea? I don’t believe so and any corporation saying “we need to do data science” to a team that doesn’t know what they are doing is commercial suicide in my eyes.


If you look at Google and Tesla they’ve been analysing data over a long period of time. They’ve got the right people involved whether that be developers, quants and the hardcore maths folk in measure, refine and deliver. Even then it goes wrong. The first self driving caused crash, well it was bound to happen at some point. You’re working on a probability and regardless of the odds it can, and will, at some point go the way you weren’t expecting. The point with AI is though, you’re not delving into the algorithm to tweak it at that stage, if you do that then what is the knock on effect of the change? You don’t really know. So far down the line all you can do is provide more data for the algorithm to learn off.

Basically you’re left with un-answered questions but you just have to go with the system because the algorithm says so.

AI and Machine Learning Costs Money

To be done right these technologies take time to develop and deliver. They also need repeat testing to ensure that the application is behaving as you’d expect. All fine with supervised learning as you’ve already defined the outcomes of the training data you have. Unsupervised learning comes with it’s own set of issues that need to be closely looked at before it’s deployed to the real world.

While any developer can download these tools and use them I still firmly believe it’s vitally important to have a knowledge of how these algorithms work. It’s not the easiest thing in the world to do either. I’d rather have an explanation of what happened rather than shrugging my shoulders with a blank look on my face.

The plane crashed because the algorithm did it is just not a reasonable excuse.




NI’s Air Passenger Duty, #Opendata and #Clojure

A long while ago the powers that be decided to cut air passenger duty (APD) on long haul flights. Yesterday the Scottish government started a consultation on cutting the APD. So the question is, how many flights are actually effected by this APD cut in Northern Ireland?


Northern Ireland’s Air Passenger Duty

Any flight from Northern Ireland that flies direct to a destination over 2000 miles is classed as long haul and exempt from APD. Changeover routes, for example if you flew from Belfast International to London Heathrow and then legged it to Dubai, well that doesn’t count.

So how to do find out the flights that apply? With some open data and some Clojure.

Airports and Routes

Open flights has CSV files for airports, airlines, routes and potentially schedules. I’m only interested in the first two.

Airports have a name, IATA code and location information.

465,"Belfast Intl","Belfast","United Kingdom","BFS","EGAA",54.6575,-6.215833,268,0,"E","Europe/London"

Routes are a source and destination airport by IATA code and an operating airline.


A Quick Checklist

So, we’ve got a question and we’ve got some open data. Now I need a check list of what needs doing to get to the answer.

  • Load the CSV files.
  • Get the airport info of the departure airport (BFS in our case)
  • Get the matching routes where BFS is the source airport.
  • Calculate the distance between the two lat/lon points of each airport.
  • Check the flight is classed as long haul.
  • Calculate the average.

Loading CSV Files

Always worth knowing how to do in Clojure, especially how to open a CSV file and convert it to a map that has keys. First of all we need to know the header information of the CSV file, as the routes and airports don’t have that information we have to find it out and create our own references.

(def airport-fields [:airportid :name :city :country :iata :icao :lat :lon :alt :timezone :dst :tz])
(def route-fields [:airline :airlineid :source-airport :source-airport-id :dest-airport :dest-airport-id :codeshare :stops :equip])

The actual loading is done with Clojure’s Data CSV library.

;; Things that will load files for us.
(defn load-csv [filename]
 (with-open [in-file (io/reader (.toString (io/resource filename)))]
 (csv/read-csv in-file))))

I need a way of connecting the field names with the data, using zipmap will let me do that so I’m going to create two functions that will handle the airports and routes respectively.

(defn load-airports [filename]
 (->> (load-csv filename)
 (map #(zipmap airport-fields %))))
(defn load-routes [filename]
 (->> (load-csv filename)
 (map #(zipmap route-fields %))))

Finding Specific Airport Information

With an IATA code I can find out information about the airport, which we’ve loaded in already. The get-airport function takes a string as a parameter with the IATA code and returns a map of the airport info.

(defn get-airport [iata-code airports]
 (first (filter #(= iata-code (:iata %)) airports)))

The filter command rattles through each entry and will return you a sequence of maps that match the criteria, so it’s just a case of getting the first entry (there should only be one anyway).

Get The Matching Routes For the Source Airport

Now we’ve got some helper functions to do some of the work we can get to the meat of what needs to happen. Finding the matching routes is a case of filtering all the routes and finding ones that match our source airport.

Assuming the routes cvs file is loaded in we can use the filter function as so:

(filter #(= (:iata dept-airport) (:source-airport %)) routes)

Once I have routes I need to map through each of them with the aim of creating a map with the source, destination, lat/lon for each airport, a distance, the airline and whether the flight is long haul or not. While the function looks complex it’s actually fairly simple.

(defn find-routes [departure-airport]
 (let [airports (load-airports "airports.csv")
 routes (load-routes "routes.csv")
 dept-airport (get-airport departure-airport airports)
 matching-routes (filter #(= (:iata dept-airport) (:source-airport %)) routes)]
 (->> matching-routes
 (map (fn [route] (try
 (let [dest-airport (get-airport (:dest-airport route) airports)
 distance (calc-distance (Double/parseDouble (:lat dept-airport))
 (Double/parseDouble (:lon dept-airport))
 (Double/parseDouble (:lat dest-airport))
 (Double/parseDouble (:lon dest-airport)))
 long-haul (is-long-haul? distance)]
 {:dept (:name dept-airport)
 :dept-iata (:iata dept-airport)
 :dept-lat (:lat dept-airport)
 :dept-lon (:lon dept-airport)
 :dest (:name dest-airport)
 :dest-iata (:iata dest-airport)
 :dest-lat (:lat dest-airport)
 :dest-lon (:lon dest-airport)
 :distance distance
 :long-haul long-haul
 :airline (:airline route)})
 (catch Exception e )))))))

It may be better to look at the formatted code in Github: https://github.com/jasebell/ni-airpassengerduty/blob/master/src/ni_apd/core.clj#L57 

There’s two functions I need to create before we can test this, the distance and if the flight is classed as long haul.

Calculating Distances

With two latitude and longitude points we can calculate the distance in miles. I actually covered this a long while ago when I was first dabbling with Clojure (thrown in at the deep end may be a better description) when I started working with Mastodon C.

(defn- deg2rad [deg]
 (/ (* deg (. Math PI)) 180))
(defn- rad2deg [rad]
 (/ (* rad 180) (. Math PI)))
(defn get-distance [lat1 lon1 lat2 lon2]
 (+ (* (Math/sin (deg2rad lat1)) (Math/sin (deg2rad lat2)))
 (* (Math/cos (deg2rad lat1))
 (Math/cos (deg2rad lat2))
 (Math/cos (deg2rad (- lon1 lon2))))
(defn calc-distance [lat1 lon1 lat2 lon2]
 (int (* 60 1.1515 (rad2deg (Math/acos (get-distance lat1 lon1 lat2 lon2))))))

You can read the original blog post for a better explanation.

Is the flight long haul?

With a distance we can figure this out quite easily. My threshold is 2000 miles so I’m going to wrap that up as a piece of data in Clojure.

(def apd-threshold 2000)

Next it’s case of finding out if the distance is greater than the threshold.

(defn is-long-haul? [distance]
 (> distance apd-threshold))

Calculating the Percentage

With the total number of routes and the total number of routes that are classed as long haul I can work out a percentage. Using a mixture of Clojure’s count function and filter function we can find out with ease.

Total number of long haul flights / Total number of flights * 100

My Clojure function looks like this:

(defn get-longhaul-percentage [departure-airport]
 (let [routes (find-routes departure-airport)]
 (double (* 100 (/ (count (filter #(= true (:long-haul %)) routes)) (count routes))))))

I’m passing in the source airport IATA code and using the find-routes function created earlier to get a list of the all the matching routes from the source airport. The first count is filtered on whether the :long-haul value is true, the second count is the total number of routes in the list.

That’s the checklist complete. Now it’s time to see what percentage of the routes for an airport are actually classed as long haul.

Testing With the REPL

First of all I’m going to test the find-routes function, that will give you an idea of the data structure as it looks before the percentage is calculated.

ni-apd.core> (find-routes "LDY")
;; => ({:airline "FR", :dest-iata "BHX", :long-haul false, :dest-lon "-1.748028", :dest-lat "52.453856", :distance 284, :dest "Birmingham", :dept-lat "55.042778", :dept "City of Derry", :dept-lon "-7.161111", :dept-iata "LDY"}
 {:airline "FR", :dest-iata "FAO", :long-haul false, :dest-lon "-7.965911", :dest-lat "37.014425", :distance 1246, :dest "Faro", :dept-lat "55.042778", :dept "City of Derry", :dept-lon "-7.161111", :dept-iata "LDY"}
 {:airline "FR", :dest-iata "LPL", :long-haul false, :dest-lon "-2.849722", :dest-lat "53.333611", :distance 210, :dest "Liverpool", :dept-lat "55.042778", :dept "City of Derry", :dept-lon "-7.161111", :dept-iata "LDY"}
 {:airline "FR", :dest-iata "PIK", :long-haul false, :dest-lon "-4.586667", :dest-lat "55.509444", :distance 106, :dest "Prestwick", :dept-lat "55.042778", :dept "City of Derry", :dept-lon "-7.161111", :dept-iata "LDY"}
 {:airline "FR", :dest-iata "STN", :long-haul false, :dest-lon "0.235", :dest-lat "51.885", :distance 374, :dest "Stansted", :dept-lat "55.042778",;; => :dept "City of Derry", :dept-lon "-7.161111", :dept-iata "LDY"})

A quick test with City Of Derry Airport (LDY) shows us the map and it’s keys. So that’s working and no long haul flights there, no surprise either.

Looking at Belfast International we get a lot more data.

ni-apd.core> (find-routes "BFS")
;; => ({:airline "LH", :dest-iata "EWR", :long-haul true, :dest-lon "-74.168667", :dest-lat "40.6925", :distance 3168, :dest "Newark Liberty Intl", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "LS", :dest-iata "ACE", :long-haul false, :dest-lon "-13.605225", :dest-lat "28.945464", :distance 1814, :dest "Lanzarote", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "LS", :dest-iata "AGP", :long-haul false, :dest-lon "-4.499106", :dest-lat "36.6749", :distance 1245, :dest "Malaga", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "LS", :dest-iata "ALC", :long-haul false, :dest-lon "-0.558156", :dest-lat "38.282169", :distance 1162, :dest "Alicante", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "LS", :dest-iata "DBV", :long-haul false, :dest-lon "18.268244", :dest-lat "42.561353", :distance 1384, :dest "Dubrovnik", :dept-lat "54.6;; => 575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "LS", :dest-iata "FAO", :long-haul false, :dest-lon "-7.965911", :dest-lat "37.014425", :distance 1221, :dest "Faro", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "LS", :dest-iata "MJV", :long-haul false, :dest-lon "-0.812389", :dest-lat "37.774972", :distance 1193, :dest "Murcia San Javier", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "LS", :dest-iata "PMI", :long-haul false, :dest-lon "2.727778", :dest-lat "39.55361", :distance 1122, :dest "Son Sant Joan", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "LS", :dest-iata "REU", :long-haul false, :dest-lon "1.167172", :dest-lat "41.147392", :distance 992, :dest "Reus", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "LS", :dest-iata "TFS", :long-haul false, :dest-lon "-16.572489", :dest-lat;; => "28.044475", :distance 1910, :dest "Tenerife Sur", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "TCX", :dest-iata "DLM", :long-haul true, :dest-lon "28.7925", :dest-lat "36.713056", :distance 2061, :dest "Dalaman", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "TCX", :dest-iata "LCA", :long-haul true, :dest-lon "33.62485", :dest-lat "34.875117", :distance 2336, :dest "Larnaca", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "TCX",
 :dest-iata "NBE",
 :long-haul false,
 :dest-lon "10.438611",
 :dest-lat "36.075833",
 :distance 1508,
 :dest "Enfidha - Zine El Abidine Ben Ali International Airport",
 :dept-lat "54.6575",
 :dept "Belfast Intl",
 :dept-lon "-6.215833",
 :dept-iata "BFS"}
 {:airline "TCX", :dest-iata "PMI", :long-haul false, :dest-lon "2.727778", :dest-lat "39.55361", :distance 1122, :dest "Son Sant Joan", :dept-lat "54.6575", :dept "Belfast I;; => ntl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "TCX", :dest-iata "TFS", :long-haul false, :dest-lon "-16.572489", :dest-lat "28.044475", :distance 1910, :dest "Tenerife Sur", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "AGP", :long-haul false, :dest-lon "-4.499106", :dest-lat "36.6749", :distance 1245, :dest "Malaga", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "ALC", :long-haul false, :dest-lon "-0.558156", :dest-lat "38.282169", :distance 1162, :dest "Alicante", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "AMS", :long-haul false, :dest-lon "4.763889", :dest-lat "52.308613", :distance 479, :dest "Schiphol", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "BCN", :long-haul false, :dest-lon "2.078464", :dest-lat "41.297078", :distance 997;; => , :dest "Barcelona", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "BHX", :long-haul false, :dest-lon "-1.748028", :dest-lat "52.453856", :distance 238, :dest "Birmingham", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "BRS", :long-haul false, :dest-lon "-2.719089", :dest-lat "51.382669", :distance 268, :dest "Bristol", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "CDG", :long-haul false, :dest-lon "2.55", :dest-lat "49.012779", :distance 539, :dest "Charles De Gaulle", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "EDI", :long-haul false, :dest-lon "-3.3725", :dest-lat "55.95", :distance 143, :dest "Edinburgh", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "FAO", :long-haul false, :dest-lon;; => "-7.965911", :dest-lat "37.014425", :distance 1221, :dest "Faro", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "GLA", :long-haul false, :dest-lon "-4.433056", :dest-lat "55.871944", :distance 109, :dest "Glasgow", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "KRK", :long-haul false, :dest-lon "19.784836", :dest-lat "50.077731", :distance 1134, :dest "Balice", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "LGW", :long-haul false, :dest-lon "-0.190278", :dest-lat "51.148056", :distance 348, :dest "Gatwick", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "LPL", :long-haul false, :dest-lon "-2.849722", :dest-lat "53.333611", :distance 164, :dest "Liverpool", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline ;; => "U2", :dest-iata "LTN", :long-haul false, :dest-lon "-0.368333", :dest-lat "51.874722", :distance 308, :dest "Luton", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "MAN", :long-haul false, :dest-lon "-2.27495", :dest-lat "53.353744", :distance 183, :dest "Manchester", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "MLA", :long-haul false, :dest-lon "14.4775", :dest-lat "35.857497", :distance 1630, :dest "Luqa", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "NCE", :long-haul false, :dest-lon "7.215872", :dest-lat "43.658411", :distance 969, :dest "Cote D\\\\'Azur", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "NCL", :long-haul false, :dest-lon "-1.691667", :dest-lat "55.0375", :distance 181, :dest "Newcastle", :dept-lat "54.6575", :dept "Belfast Intl";; => , :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "PMI", :long-haul false, :dest-lon "2.727778", :dest-lat "39.55361", :distance 1122, :dest "Son Sant Joan", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "U2", :dest-iata "STN", :long-haul false, :dest-lon "0.235", :dest-lat "51.885", :distance 328, :dest "Stansted", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"}
 {:airline "UA", :dest-iata "EWR", :long-haul true, :dest-lon "-74.168667", :dest-lat "40.6925", :distance 3168, :dest "Newark Liberty Intl", :dept-lat "54.6575", :dept "Belfast Intl", :dept-lon "-6.215833", :dept-iata "BFS"})

Now to turn our attention the final percentage, we know the find-routes is working (if you look at the long haul key on the last entry for BFS it’s true and the distance is 3168 miles.

So what’s the percentage of long haul flights from Belfast International?

ni-apd.core> (get-longhaul-percentage "BFS")
;; => 11.11111111111111

There we are, 11%. So 89% of flights from BFS are still liable for APD. Back to the consultation in Scotland, I wonder what Edinburgh looks like.

ni-apd.core> (get-longhaul-percentage "EDI")
;; => 5.882352941176471

Less than Belfast it turns out at 5.88%.

Imaging scrapping long haul APD from Heathrow?

ni-apd.core> (get-longhaul-percentage "LHR")
;; => 59.58254269449715

No me neither, not at 59%….


Openflights Data: http://openflights.org/data.html

Github Repository for this blog: https://github.com/jasebell/ni-airpassengerduty


Show me your data and workings and I might believe you.

The media has done a real good job of jumping on data, data science, big data, Hadoop, Spark and all the rest of the words that associate themselves around the core word, “data”.


What I’m finding more and more is that we’re expected to accept what is presented to us as verified and right. Guess what, that might not be the case. Your TED talk, your piece in the Economist and so on is all very well but if you can’t publish your data, your model and your path to getting to the conclusion then why should I believe you?

For any article I’m looking for the evidence, a link that takes me back to the beginning of your thoughts and processes. What was your hypothesis? “We mined 55,000 variables….”, great! What are they, where can I see them? Can I give you some feedback?

As we pedal more data stories (let’s put infographics to one side, they tend to be rubbish most of the time anyway) and expect the masses to accept as truth what you’ve presented, I’ll be the outlier on the sides, shouting and poking a stick annoying at your side, “where’s your raw data and where’s the model you used?”, why? Because I’m interested and want to know, but I want to verify it for myself.

“Do not accept what you cannot verify for yourself……..”



Get every new post delivered to your Inbox.

Join 839 other followers