It’s The End Of The World As We Know It….? #clojure


What does your to do list look like today? Don’t worry, it’ll all be over soon according to a group reported in the Guardian yesterday. Now I, for one, am not amused by this news at all, not today of all days, there’s too much cool stuff coming up over the next period so, well the end of the world can just stop right there.

Saying that, we have been here before a few times.

All The Coin Flips, Dead or Alive?

I’ll keep it simple, there are only two outcomes. We’re either still breathing or we’re all done for. Now there’s the best part of 40 predictions that I’ve seen predicting the end of the world. So 0.5 to the power of 40…..

user> (Math/pow 0.5 40)

That’s a lot of zeros. 0.0000000000009094% chance it is then. I think I’ll get a bottle of milk in the morning after all.

From The Department Of It Was Obvious But I Just Had To…..

Creating St Vincent Lyrics And Northern Ireland Assembly Questions With Markov Chains. #clojure @st_vincent #spark #opendata

The Story So Far

In previous posts I’ve covered basically loading data in Spark (with Sparkling in Clojure) and doing some half funky stuff with it. That’s all very well and a good point for starting with, but it’s a touch limiting. Ultimately it’s very easy to get some numbers out, crack some percentages and plot a 2d graph, Google Map or infographic.


What I want to do is something far more interesting than that (in my eyes), use some machine learning to create new things based on what we have.

Markov Chains

With a sufficient amounts of text we can do some interesting things. The nicer thing about Markov Chains is they are simple in terms of how they work.

With a corpus of text loaded we can create some fresh output text. More text, better results. A Markov Chain is will randomly walk an existing lookup, based on the corpus text, and randomly select the next word to use. By looking at the previous words in the original corpus the chain can weight what the next random word should be.

Examples I’ve seen have created Paul Graham startup stories and Garfield cartoons. I could create my own St Vincent song, in fact that’s what I’ll do.

How To Create New St Vincent Songs

“Jase, I think you might like this….”, said my dear friend, sound engineer and my soundscape recordist, Dez Rae. He was right. That was in 2010/2011 before rock royalty beckoned for Annie Clark (and rightly so)… I bought what I could on the spot, it was so unique.


The great thing is the variety of songs, no two come near each other and no two albums are the same.

The Corpus of Annie Clark

In a text editor I’ve copied/pasted the lyrics from the Strange Mercy album.

I spent the summer on my back
Another attack
Stay in just to get along
Turn off the TV, wade in bed
A blue and a red
A little something to get along
Best find a surgeon
Come cut me open
Dressing, undressing for the wall
If mother calls
She knows well we don't get along

An album full of lyrics (all copyright to Annie Clark I hasten to add), all the blank lines taken out, that’s our corpus.

Markov Chain Code In Clojure

Now I need some code to so the Markov Chain, I’m not writing it this time, someone else has done the work far better than I could of in Clojure so I’m using his.

You can look at it here:

Like I said, with a corpus of text loaded in the program will look at next words and create a lookup of words and scores. When I generate new sentences the next word will be governed by the lookup table and word scores. Simple.

I’m going to loop 15 times to create a song.

(defn -main [& args]
  (let [markov (transform (lazy-lines (first args)))]
    (for [loopcount (range 15)]
      (generate-sentence markov))))

From the REPL I can run:

markov.core> (-main "/Users/jasonbell/Documents/stvincentlyrics.txt")
("Oh little one I guess it makes my mulling days, through my lesson" "Chloe in just to get along" "Your hometown is" "I've told whole lies" "Let's not a party I owe you ever really care for me?" "But when you ever really stare at you could take us?" "Chloe in the tiger" "My own heels" "Did you say it was the piles\"" "While you" "Heal my clothes on" "But when you went off the tiger" "I've told whole lies" "Bodies, can't you can limp beside you ever really stare?" "Tried so they left more")

Which looks pretty neat….

Oh little one I guess it makes my mulling days, through my lesson
Chloe in just to get along
Your hometown is
I’ve told whole lies
Let’s not a party I owe you ever really care for me?
But when you ever really stare at you could take us?
Chloe in the tiger
My own heels
Did you say it was the piles
While you
Heal my clothes on
But when you went off the tiger
I’ve told whole lies
Bodies, can’t you can limp beside you ever really stare?
Tried so they left more

It’s still copyright to Annie Clark, they’re still her words just a little more random. If I was going for a title, “My Mulling Days” would be a front runner.

I could have put all the lyrics from all the albums in and come up with a more refined lyric set, but as a test and a wee tribute to one of my favourite artist’s, it’s a good start.

Do We Need An Executive?

So it looks like Stormont is getting a longer break than was originally planned. Which means that NI open data is going to be thin on the ground for new MLA questions. So in the meantime let’s turn the building into a Data Centre (we could ask Arlene if INI will fund it, she’s still there, she’s managed to hold on things….)


So I’ve got my new data centre.

With no MLA’s asking questions though we want to generate some to give the impression that something is happening up there. All those potential FDI clients will want to see the powerhouse working…. If we do a well enough job we would let the Markov Chains just do the work altogether but let’s not get ahead of ourselves just yet.

Repurposing NIAssembly Spark Code

I’m going to extract the question text from the MLA questions. I’m going to use the NI Assembly Spark code (you can read part 1 and part 2 if you want to know the inner workings) and extract just the text.

mlas.core> (def members (load-members sc members-path))
mlas.core> (def questions (load-questions sc questions-path))
mlas.core> (def mqrdd (join-members-questions members questions))

That gives me a [key, [value,value] set of members with questions for that member. Now I need to map through each member, then map each question block and extract the question text.

mlas.core> (def qtext (spark/map (s-de/key-val-val-fn (fn [k m qs] (map (fn 
(:questiontext question)) qs))) mqrdd)) #'mlas.core/qtext mlas.core> (spark/first qtext) ("To ask the First Minister and deputy First Minister for an update on the delivery of their Programme for Government 11/15 commitments." "To ask the First Minister and deputy First Minister for an update on the delivery of their Programme for Government 11/15 commitments." "To ask the Minister of Enterprise, Trade and Investment whether any of his departmental responsibilities have been affected by the actions of any proscribed organisations since 2011.") mlas.core>

That’s the first element of the RDD and it has three questions. There’s a lot more…. a whole lot more.

I want to save this out as a text file which requires a bit more mapping.

mlas.core> (def textarrays (spark/collect qtext))
mlas.core> (map (fn [qs] 
                    (spit "/Users/jasonbell/Documents/mlaquestions.txt" (apply str (interpose "\n" qs)) :append true)) textarrays)

That now gives me a large text file of MLA questions throughout history.

Jasons-Mac-mini:Documents jasonbell$ wc mlaquestions.txt
   94056 3007106 18327959 mlaquestions.txt
Jasons-Mac-mini:Documents jasonbell$


Random MLA Question Generation

With 94,000+ questions to train my Markov Chain I’m expecting some interesting results. I only want to generate one question at a time so I can remove the loop (where I was generating 15 lines for generating St Vincent lyrics.

I’m going to run this from the REPL so I’m not reloading and reindexing all the text. Let’s create some MLA questions for next week.

markov.core> (def markov (transform (lazy-lines "/Users/jasonbell/Documents/mlaquestions.txt")))
markov.core> (generate-sentence markov)
"To ask the First Minister of Finance and deputy First Minister what steps are entitled to ensure greater weight is the reasons that no reports into the Housing Executive Gateway Reviews his Department has been allocated to outline the Minister for Social Services and to a CCEA test; and (vi) South Armagh city area."
markov.core> (generate-sentence markov)
"To ask the Ethnic Development what recruitment process used to detail, broken down by (i) who are assessed as possible help graduates in the Minister and Personnel for each spouse or not personally signed off a whole."
markov.core> (generate-sentence markov)
"To ask the cost, of Ulster in the Minister of order an organisation, broken down by Health and Learning for exemption."
markov.core> (generate-sentence markov)
"To ask the Minister of the last three years."
markov.core> (generate-sentence markov)
"To ask the First Minister what sentences would bring forward to July bonfires on the progression on planning application for rural area of Health, Social Services Directive; and location and what they are assisting these guidelines; and Leisure for Social Services and Rural Development what additional counselling, including those in 2008/09."
markov.core> (generate-sentence markov)
"To ask the First Minister and (ii) if so (ii) whether students with identities outside the number of the Employment and whether the Office of the Environment Minister."

To be honest that was far too much fun!

Taking It Further

If you have access to plenty of text then you can run Markov Chains to produce new content with little difficulty. For a more refined method it’s worth looking at Artificial Neural Networks which is being used by some publishers for content creation.

All in all, to save Northern Ireland from having no news whatsoever…. well I’ve done my bit :)


So you wanna be a #datascientist? Well, apply for this then. #jobfairy

Channel Your Inner Nate Silver

So you’ve read the Smart Cities book, you’ve followed every Nate Silver post in 538….. now to put it all into practice. An opportunity to do some very serious future cities planning with the Greater London Authority and MastodonC.

Good luck!

The Full Job Posting Details

Salary: £41,209 per annum
Contract type: Full-time, fixed term
Reference: GLA2981
Interview Date: Monday 28 September 2015
Date posted: 28 August 2015
Closing date: 20 September 2015

Would you like to join an ambitious and forward looking unit of analysts, researchers and data experts, working for one of the world’s truly global cities?

The Greater London Authority (GLA) is working with the big data analytics specialist, Mastodon C to create a solution ‘Witan’ which allows subject experts and policy makers to integrate different types of hard and soft model, in order to explore scenarios for the futures of their cities.

You will play a key role in the project, working closely with GLA staff and Mastodon C’s team. You will have the opportunity to help build up a secure City Data counterpart to the GLA’s award winning open data London DataStore. As well as designing reproducible procedures to shape and clean the data, you will actively seek opportunities to link datasets together as part of creating an analytical data store.

You will also gather user stories from policy teams and analysts and devise/apply tests for Witan modules as they progress through Alpha and Beta releases.

This is a great opportunity to develop your skills and experience, but you will need to bring with you a strong technical background including practical application of data science in a work setting.

In addition to a good salary package, we offer an attractive range of benefits including 30 days annual leave, interest free season ticket loan, interest free bicycle loan, childcare voucher scheme and a career average pension scheme.

London’s diversity is its biggest asset and we strive to ensure our workforce reflects London’s diversity at all levels. Applications from Black, Asian and Minority ethnic candidates will be particularly welcomed as they are currently under-represented in this area of our organisation.

If you have a question about the role then please contact the Resourcing Team by email on quoting reference GLA2981.

Closing date for completed applications is midnight Sunday 20 September 2015.

Interviews will take place on Monday 28 September 2015.

Using Bayes Theorem for NI Startup Probabilities (#startups #clojure #statistics)

This is probably as close to serious as I’ll ever get on the subject, so hold on to your hipster pork pie hats…. The title headings are based on a fairly common path for Northern Ireland startups, other territories will have their own methods I’m sure. Regardless, I need a picture….


The Odds Are Against You.

The harsh reality is that the odds are stacked against you for succeeding. I’ll be ultra liberal with my probabilities and say 4% (I should really be saying 2% but it’s a Bank Holiday Weekend and I’m in a good mood and not my grumpy self). This number could be quantified by mining all the previous startups and seeing who lasted longer than 3 years for example. So four in every hundred isn’t a bad starting point. Let’s call this our prior probability.

What we’re trying to establish is that if an event happens during the startup journey what will that do to the existing probability. The nice thing with Bayes, as you’ll see, is that for every milestone event (or any event) we can re-run the numbers. Ready?

Wow! We Got Proof Of Concept 40k!

Current Prior Probability: 4%

Great news! The good folks at TechstartNI have shone the light on your idea and given the clawback funds for you to build via a respected development house or developer. Will that have an effect on our post probability? It may do but PoC is not a confirmation of your startup really, just access to build.  What we can do though is use Bayes Theorem to recalculate the probability now we have a new event to include.

Bayes In Clojure

So Bayes works on three inputs, the prior probability (in our case the 4% figure we started with), a positive impact on the hypothesis that you’ll last longer than three years and a negative impact on the hypothesis that you won’t last longer than three years.

Assuming that x is our prior, y is the positive event and z is the negative event. We can use the formula: (x * y) / ((x * y) + (z * (1 – x)))

If I were code that up in Clojure it would look like this:

(ns very-simple-bayes.core)

(defn calculate-bayes [prior-prob newevent-positive-hypothesis newevent-negative-hypothesis]
  (double (/ (* prior-prob newevent-positive-hypothesis) 
             (+ (* prior-prob newevent-positive-hypothesis) 
                (* newevent-negative-hypothesis (- 1 prior-prob))))))

(defn calculate-bayes-as-percent [prior-prob newevent-positive-hypothesis newevent-negative-hypothesis]
  (* 100 (calculate-bayes prior-prob newevent-positive-hypothesis newevent-negative-hypothesis)))

The first function does the actual Bayes calculation and the second function merely converts that in to a percentage for me.

Right, back to our TechstartNI PoC. Let’s see how that affects our chances of survival.

Just because PoC funds give you some ground to build a product it has little impact on the survival of the company as a whole. Being liberal again let’s say the positive impact on the hypothesis is 90% and the negative will be 10%. Can only be a good thing to have a product to sell.

very-simple-bayes.core> (calculate-bayes-as-percent 0.04 0.9 0.1)

While PoC has a huge effect on you getting product out of the door (do I dare utter the letters M, V and P at this point) it has little effect in your long term survival. So your 4% chance of three year survival has gone to 27.2%. A positive start but all you have is a product.

Put the Champagne on ice just don’t open it…..

Propelling Forward

Current Prior Probability: 27.2%

The next logical step is to look at something like the Propel Programme to get you in the sales and marketing mindset but also making you “investor ready” which is what I see the real aim of Propel to be. So with the new event we can recalculate our survival probability. The 20k doesn’t make a huge dint in your survival score, it helps you get through though, I will take that into account.

As I’ve not experienced Propel first hand it’s unfair to me to say how things will pan out, you’ll have to ask someone who’s done it. It doesn’t, though, stop me guessing some numbers out of the air to test against, and you should really do the same.

Propel will have a positive impact on your startup, no doubt, there’s a lot to learn and you’ll be in the same room as others going through the same process. The “up to” £20k is good to know but there’s no 100% certainty, apart from death and taxes, that you’ll get the full amount.

Propel’s positive probability on hypothesis: 40%

Propel’s false positive probability on the hypothesis: 80%

Running the numbers through Bayes again, let’s see what the new hypothesis probability is looking like.

very-simple-bayes.core> (calculate-bayes-as-percent 0.272 0.4 0.8)

That brought us back down to earth a bit. 15.74% chance of a positive hypothesis. No reflection on Propel at all, that’s just how the numbers came out. Now I could be all biased as say that if you do Propel you’re gonna be a unicorn-hipster-star but the reality is far from that.

The false positive is interesting, doing these things can sometimes fool the founder into thinking they’re doing far better than they think they are. If 100 startups went through Propel and 40 are still trading today then our positive event probability is right. And that’s the nice thing about applying Bayes in this way, we can make some fairly reasoned assumptions that we can use to calculate our scores with.

I’m Doing Springboard too Jase!

Okay! And the nice thing is these things can happen in parallel, but looks treat it as a sequential matter to preserve sanity in the probability.

Current Prior Probability: 15.74%

I think the same event +/- probabilities would apply here. Springboard is good for mentorship and contacts. My hand on heart gut thinks the numbers are going to be the same as Propel’s for what we are looking at here.

Springboard positive hypothesis: 40%

Springboard as a false positive on hypothesis: 80%

Let’s run the numbers again (now you can see why I wrote a Clojure program first).

very-simple-bayes.core> (calculate-bayes-as-percent 0.1574 0.4 0.8)

Interestingly, according the numbers, doing the two programmes has a negative impact on your startup if you have no revenue and no customers, well the numbers say so.

Getting That First Seed.

Current Prior Probability: 8.54%

Okay, so you’ve your Proof of Concept in hand, grafted through Propel and then gone through Springboard until your get that nice picture on the website. “Investor Ready” is an odd term, markets can change, fashions come and go and investors go looking for different things as time goes by. So all the while you’ve been slaving investors could be looking for something else.

So the opportunity arrives to pitch to one of the NI based VC’s/Angels for some “proper” money. Once again a fairly normal route to go down. It could be a mixture of different funding places (Crescent, Kernel or Techstart). If accepted the goalposts change as it’s people on the board and results focus (ie are you hitting target month on month).

The average figure for a seed round in NI is between £150-£300k but I’ll head for the upper figure. Regardless money in the bank (even though it’s not yours) is a good thing if you are prepared to give up some equity. Saying that investments can go bad, so we need to ying and yang this out a bit. So I’ve put in that there’s a 20% chance of the investment being a false positive. Once again if you had the term sheets of 20 companies you’d be able to do some maths yourself to get a better idea.

Seed round has positive outcome on hypothesis: 70%

Seed round has is a false positive on hypothesis: 20%

Investment is good PR and the hype cycle loves a good startup investment story. it opens up the doors to talking guff in far more many places than you did before. How does that affect our probability though?

very-simple-bayes.core> (calculate-bayes-as-percent 0.0854 0.7 0.2)

Positive indeed. From 4.0% to 24.6% is good. From 1 in 25 to 1 in 4 chance of lasting three years, though the majority hinged on investment in the latter stages of you getting investment. There’s a chance by this point you’d be two thirds the way in of the three year plan.

Blessed Are The 2%

At the start I used 4% as a very optimistic probability of a startup lasting more than three years. I wonder what would happen if I went for a realistic start point of 2%?

Proof Of Concept Stage

very-simple-bayes.core> (calculate-bayes-as-percent 0.02 0.9 0.1)

Propel Stage

very-simple-bayes.core> (calculate-bayes-as-percent 0.1551 0.4 0.8)

Springboard Stage

very-simple-bayes.core> (calculate-bayes-as-percent 0.0840 0.4 0.8)

Investor Stage

very-simple-bayes.core> (calculate-bayes-as-percent 0.0438 0.7 0.2)

So a 1 in 7 chance of you lasting 3 years….. you can also finally put your “passion” line to bed as well.


You can see why some folk decide to hold startup events. The risk is far lower and the chances of sponsorship are increased and repetitive. Saying that we know that those who want to work 17 hours a day by the seat of their knickers will continue to do so. Depending on your point of view it is indeed easier to sell the shovels instead.

And As For That Champagne….

marilynMarilyn drank it.











Focus on the MVDP – The Minimum Viable Data Product #startups #lean #data

Since the release of “The Lean Startup” there’s three letters that have been preached over and over…MVP – Minimum Viable Product. The over copied mantra of “what do you need to do to get this thing out of the door?”.

image4And in part it’s valid, there’s method in the madness. In part I also believe that it’s flawed as customers will buy in to a complete product not iterations and then used as little experiments to see what happens when you tweak the model. Life though in startup land it is never as clean cut as some people believe the lean model is. My issue with MVP is that the focus is on the product but I don’t see that as the most valuable asset to the business. For most startups the app, the website or API is just a delivery method to what really matters, the data.

So instead of the Minimum Viable Product, why do I never hear about the Minimum Viable Data Product? Let me give you a simple example.

Disrupting Bacon! The Sequel!

You can read the original bacon disruption post here. Sandwich ordering for me is simple I go into a shop order what I want, wait a bit for it to be made, pay and leave again. If you really want to appify the process then fair play to you.


Some people will look at that concept and think, “Simple Jase!, have the menus on the app, select what you want and pay within the app and pick it up. Job done and I’ll take my 8% fees from the sandwich shop owner.”

I proved in the original post that was going to be very hard work, growth of willing retailers has to be at least 75% month on month and then there’s small matter of getting users to download the app in first place. Remember, Just Eat didn’t come wading in the UK with £10m in their back pocket for no reason, UK wide marketing costs a lot of money.


Shifting The Focus To The Minimum Viable Data Product

Think about it, the data collected by the likes of Nectar and Clubcard is not really there for the customer’s benefit, that’s merely the pay off and by product of being given enough permission to mine every basket and every item that passes through the point of sale. The real value in that data is once it’s sliced and diced and you have customer profiles, regional segments of buying behaviour and detailed time blocks of seasonal purchases of everyday items. At that point the main sell is to the suppliers of the products.

Designing The MVDP

So thinking about it for sandwich and bacon disruption alike. If the first focus was on the minimum viable data product, how would it look? There’s usually some form of sign up process with a minimum level of information.

  • Firstname
  • Lastname
  • Email address

If the user, assuming it’s an app, is giving you permission to track their location to find the nearest vendors then location is a wonder metric. With permission there’s no reason you can’t store the app ID along with the location data. Does that customer move around a lot or are they stuck within a certain radius every day, every hour? Could I push information of the most popular eateries where they are if they move around a lot?

  • Location
  • Radius movement

The frequency and value of the customer is highly important. Are they using the your service everyday (signups is NOT a good metric, active users IS), what’s the value of their order, can you predict long term customer value based on the purchase data? Can you split it down per supplier? No, well find someone who can…

  • Date/time of purchase.
  • Purchase value ($/£)

What did the customer purchase? Even Subway miss this by a long shot, so do Starbucks (and they know what I’m buying). So here’s your chance for glory. Does the customer order the same thing everyday, creatures of habit, or is the customer varied in their purchases? Do they use the same supplier day in day out or do they move around the different ones in the area?

  • Retailer ID/name
  • Item list of ordered items.

There’s a possibility that an app will already be picking all this data up but has no idea how to process it, and that’s fine. Just store it and when you get to a point of thinking you have enough data find someone who can help you understand it.

Defining The Questions

There’s going to be two types of questions, one’s that you’ll ask yourself:

  • “Which suppliers are generating us the most revenue?”
  • “Who are our top purchasing customers?”
  • “How many orders a day is our app fulfilling?”

Those aren’t unreasonable questions and ones that I’d be expecting the board and management team to be asking everytime you have a meeting.

The retailers, your suppliers, on the other hand will have their own questions.

  • “How many customers purchased via the app?”
  • “What’s the sales volume?”
  • “How much in fees did I incur to fulfil those sales?”
  • “What were the top performing items in inventory month on month?”

Even with a simple list, trust me there’s loads more, there’s some wins in there for the retailer, especially in pushing distressed stock or capacity planning for more perishable goods. No point holding too much cottage cheese when no one is buying it.

From The Department Of “How Are We Going To Make Money?”

From the top, data has value. Not everything in life has to be open data. I made this very clear to a website many years ago, “The value is in my website”, to my reply, “No it’s not, the value is in the data your users are entering into your website. Don’t give it away”.

So as a new business (I’m going off the word Startup by this point) the key questions I’d be asking are:

  1. What is our minimum viable data model, what do we need to collect to make better decisions later?
  2. Can we mine, slice, dice this data and sell it on to our suppliers?
  3. What is the value of the data we will have? Do we sell it on as an extra or as part of the product package.
  4. What is the collection method to acquire the data? Is it a website, an app, smart TV application or something else? Will the data model change if the collection method changes.

And Finally…..

Customer buy in is crucial and it happens right at the start, the moment they sign up. From that point on everything can be tracked to aid your commercial goals. What’s important though is to chop and change the data model around once you get going, you need to make time to focus on the MVDP before you go into launch mode. If you change course halfway through and start asking for gender information users might get a bit irked and wonder if there’s really a customer segment where ordering sandwiches is really a matter of whether someone is transgender, female or male.

Just be careful what you ask for and how you ask for it.

The #NIStartupCommandments….. straight from the pinch of salt department.

Just thoughts, take what it for what you will…. I know there were only ten commandments but tablets have got more storage capacity so here’s a baker’s dozen.

1. Have a software developer willing to help/work for sweat equity. Without them you’re dead.

2. Figure out revenue, not funding opportunities, from day one. Target market segments etc.

3. The app economy is dead so get those ideas out of your head. Tell me which NI apps have “made it” with £1m+ rev?

4. Use PoC funds wisely, dev houses are expensive and make sure you are in control and own code. If not, go elsewhere

5. Treat startup events with caution. What will you learn and who will you meet? Doing the rounds wastes time.

6. The community is small, so don’t mess about, be nice to everyone you meet. Karma etc.

7. SRBI Competitions are good blocks of revenue while you get up and running. Good for network too.

8. Beware the Church of Lean.

9. Coffee shops are expensive, buy a kettle. Keep expenses low low low.

10. Download the @Mattermark app for iOS and look at your competitors, their growth score and what they raised. Wake up call.

11. Look at all the destination airports from Belfast. Those are your first tier customers.

12. Don’t apply for incubator or accelerator without a tech dev, you can’t sell something that doesn’t exist.

13. If you’re going to an analytics company avg minimum seed is about $5m.

NI Open Data – Mining Prescription Data Part 2. – #opendata #spark #clojure


The Story So Far….

You can read part 1 here.

A few weeks ago I started on finding out which were the most popular items that a GP practice would prescribe. Once again I turned to Sparkling and Clojure to do the grunt work for me.

The Practice List

What I didn’t have at the time was the practice list. You can download that from the HSC Business Services Organisation.

I’m going to create another PairRDD with the practice number as the key and then the information as it’s value.

(def practice-fields [:pracno :partnershipno :practicename :address1 :address2 :address3 :postcode :telno :lcg])

(defn load-practices [sc filepath] 
  (->> (spark/text-file sc filepath)
       (spark/map #(->> (csv/read-csv %) first))
       (spark/filter #(not= (first %) "PracNo"))
       (spark/map #(zipmap practice-fields %))
       (spark/map-to-pair (fn [rec] 
                            (let [practice-id (:pracno rec)]
                              (spark/tuple practice-id rec))))))

It’s very similar to the original function I used to load the prescription CSV.

Joining The RDD’s

With two pair RDD’s I can safely perform the join.

(defn join-practice-prescriptions [practices prescriptions]
  (spark/join practices prescriptions))

This will give me a Pair RDD with a [key, [value1, value2]] form. I’ll need to tweak the function that works out the frequencies so it takes into account this new data structure.

(defn def-practice-prescription-freq [prescriptiondata]
  (->> prescriptiondata
       (spark/map-to-pair (s-de/key-val-val-fn (fn [k v pr] 
                                               (let [freqmap (map (fn [rec] (:vmp_nm rec)) pr)]
                                                 (spark/tuple (str (:practicename v) " " (:postcode v)) (apply list (take 10 (reverse (sort-by val (frequencies freqmap))))))))))))

The last thing to do it wrap up the process-data function to load in the practice list.

(defn process-data [sc filepath practicefile outputpath] 
  (let [prescription-rdd (load-prescription-data sc filepath)
        practice-rdd (load-practices sc practicefile)]
    (->> (join-practice-prescriptions practice-rdd prescription-rdd)
         (spark/coalesce 1)
         (spark/save-as-text-file outputpath))))

Testing From The REPL

To test these changes is fairly trivial from the REPL.

; CIDER 0.8.1 (package: 20141120.1746) (Java 1.7.0_40, Clojure 1.6.0, nREPL 0.2.10)
nipresciptions.core> (def c (-> (conf/spark-conf)
             (conf/master "local[3]")
             (conf/app-name "niprescriptions-sparkjob")))
  (def sc (spark/spark-context c))
nipresciptions.core> (def practice-file "/Users/Jason/work/data/Northern_Ireland_Practice_List_0107151.csv")
nipresciptions.core> (def prescription-path "/Users/Jason/work/data/niprescriptions")
nipresciptions.core> (process-data sc prescription-path practice-file "/Users/Jason/testoutput")

The Spark job will take a bit of time as there’s 347 practices and a lot of prescription data…..


Detail Data Prescription CSV’s:

Github Repo for this project:

Practice List:




NI Open Data – Mining Prescription Data – #opendata #spark #clojure

Moving On From The NI Assembly

There was plenty of scope from the NI Assembly blog posts I did last time (you can read part 1 and part 2 for the background). While I received a lot of messages with “why don’t you do this” and “can you find xxxxxx subject out” it’s not something I wish to do. Kicking hornets nests isn’t really part of my job description.

Saying that when there’s open data for the taking then it’s worth looking at. Recently the Detail Data project opened up a number of datasets to be used. Buried within is the prescriptions that GP’s or Nurse within the practice has prescribed.

Acquiring the Data

The details of the prescription data are here: (though the data would suggest it’s not really statistics, just raw CSV data), the files are large but nothing I’m worrying about in the “BIG DATA” scheme of things, this is small in relative terms. I’ve downloaded October 2014 to March 2015, that’s a good six months worth of data.

Creating a Small Sample Test File

When developing these kind of jobs before jumping into any code it’s worth having a look at the data itself. See how many lines of data there are, this time as it’s a CSV file I know it’ll be one object per line.

Jason-Bells-MacBook-Pro:niprescriptions Jason$ wc -l 2014-10.csv 
 459188 2014-10.csv

Just under half a million lines for one month, that’s okay but too much for testing. I want to knock it down to 200 for testing. The UNIX head command will sort us out nicely.

head -n20 2014-10.csv > sample.csv

So for the time being I’ll be using my sample.csv file for development.

Loading Data In To Spark

First thing I need to do is define the header row of the CSV as a set of map keys. when Spark loads the data in then I’ll use zipmap to pair the values to the keys for each row of the data.

(def fields [:practice :year :month :vtm_nm :vmp_nm :amp_nm :presentation :strength :total-items :total-quantity :gross-cost :actual-cost :bnfcode :bnfchapter :bnfsection :bnfparagraph :bnfsub-paragraph :noname1 :noname2])

You might have noticed the final two keys, noname1 and noname2. The reason for this is simple, there are commas on the header row but no names so I’ve forced them to have a name to keep the importing simple.

PRACTICE,Year,Month,VTM_NM,VMP_NM,AMP_NM,Presentation,Strength,Total Items,Total Quantity,Gross Cost (<A3>),Actual Cost (<A3>),BNF Code,BNF Chapter,BNF Section,BNF Paragraph,BNF Sub-Paragraph,,

With that I can now create the function that loads in the data.

(defn load-prescription-data [sc filepath] 
 (->> (spark/text-file sc filepath)
      (spark/map #(->> (csv/read-csv %) first))
      (spark/filter #(not= (first %) "PRACTICE"))
      (spark/map #(zipmap fields %))
      (spark/map-to-pair (fn [rec]
         (let [practicekey (:practice rec)]
           (spark/tuple practicekey rec))))

Whereas the NI Assembly data was in JSON format so I had the keys already defined, this time I need to use the zipmap function to mix the values at the head keys together. This gives us a handy map to reference instead of relying on the element number of the CSV line. As you can see I’m grouping all the prescriptions by their GP key.


Counting The Prescription Frequencies

This function is very similar to the frequency function I used in the NI Assembly project, by mapping each prescription record and retaining the item prescribed I can then use the frequencies function to get counts for each distinct type.

(defn def-practice-prescription-freq [prescriptiondata]
 (->> prescriptiondata
   (spark/map-to-pair (s-de/key-value-fn (fn [k v] 
     (let [freqmap (map (fn [rec] (:vmp_nm rec)) v)]
       (spark/tuple k (frequencies freqmap))))))))

Getting The Top 10 Prescribed Items For Each GP

Suppose I want to find out what are the top ten prescribed items for each GP location. As the function I’ve got has the frequencies with a little tweaking we can return what I need. First I’m using sort-by to sort on the function, this will give me a sort smallest to largest, using the reverse function then flips it on it’s head and gives me largest to smallest. With me only wanting ten items I then use the take function to return the first ten items in the sequence.

(defn def-practice-prescription-freq [prescriptiondata]
 (->> prescriptiondata
    (spark/map-to-pair (s-de/key-value-fn (fn [k v] 
       (let [freqmap (map (fn [rec] (:vmp_nm rec)) v)]
           (spark/tuple k 
                        (take 10 (reverse (sort-by val (frequencies freqmap)))))))))))

Creating The Output File Process

So with two simple functions we have the workings of a complete Spark job. I’m going to create a function to do all the hard work for us and save us repeating lines in the REPL. This function will take in the Spark context, the file path of the raw data files (or file if I want) and an output directory path where the results will be written.

(defn process-data [sc filepath outputpath] 
 (let [prescription-rdd (load-prescription-data sc filepath)]
       (->> prescription-rdd
            (spark/coalesce 1)
            (spark/save-as-text-file outputpath))))

What’s going on here then? First of all we load the raw data in to a Spark Pair RDD and then using the thread last function we calculate the item frequencies and then reduce all the RDD’s down to a single RDD with the coalesce function. Finally we output everything to our output path. First of all I’ll test it from the REPL with the sample data I created earlier.

nipresciptions.core> (process-data sc "/Users/Jason/work/data/niprescriptions/sample.csv" "/Users/Jason/Desktop/testoutput/")

Looking at the file part-00000 in the output directory you can see the output.

(1,(["Blood glucose biosensor testing strips" 4] ["Ostomy skin protectives" 4] ["Macrogol compound oral powder sachets NPF sugar free" 3] ["Clotrimazole 1% cream" 2] ["Generic Dermol 200 shower emollient" 2] ["Chlorhexidine gluconate 0.2% mouthwash" 2] ["Clarithromycin 500mg modified-release tablets" 2] ["Betamethasone valerate 0.1% cream" 2] ["Alendronic acid 70mg tablets" 2] ["Two piece ostomy systems" 2]))

So we know it’s working okay…. now for the big test, let’s do it against all the data.

Running Against All The Data

First things first, don’t forget to remove sample.csv file if it’s in your data directory or it will get processed with the other raw files.

$rm sample.csv

Back to the REPL and this time my input path will just be the data directory and not a single file, this time all files will be processed (Oct 14 -> Mar 15).

nipresciptions.core> (process-data sc "/Users/Jason/work/data/niprescriptions/" "/Users/Jason/Desktop/output/")

This will take a lot longer as there’s much more data to process. When it does finish have a look at the part-00000 file again.

(610,(["Gluten free bread" 91] ["Blood glucose biosensor testing strips" 62] ["Isopropyl myristate 15% / Liquid paraffin 15% gel" 29] ["Lymphoedema garments" 27] ["Macrogol compound oral powder sachets NPF sugar free" 25] ["Ostomy skin protectives" 24] ["Gluten free mix" 21] ["Ethinylestradiol 30microgram / Levonorgestrel 150microgram tablets" 20] ["Gluten free pasta" 20] ["Carbomer '980' 0.2% eye drops" 19]))
(625,(["Blood glucose biosensor testing strips" 62] ["Gluten free bread" 38] ["Gluten free pasta" 27] ["Ispaghula husk 3.5g effervescent granules sachets gluten free sugar free" 24] ["Macrogol compound oral powder sachets NPF sugar free" 20] ["Isopropyl myristate 15% / Liquid paraffin 15% gel" 20] ["Isosorbide mononitrate 25mg modified-release capsules" 18] ["Alginate raft-forming oral suspension sugar free" 18] ["Isosorbide mononitrate 50mg modified-release capsules" 18] ["Oxycodone 40mg modified-release tablets" 16]))
(661,(["Blood glucose biosensor testing strips" 55] ["Gluten free bread" 55] ["Macrogol compound oral powder sachets NPF sugar free" 24] ["Salbutamol 100micrograms/dose inhaler CFC free" 20] ["Colecalciferol 400unit / Calcium carbonate 1.5g chewable tablets" 19] ["Venlafaxine 75mg modified-release capsules" 18] ["Isosorbide mononitrate 25mg modified-release capsules" 18] ["Isosorbide mononitrate 60mg modified-release tablets" 18] ["Alginate raft-forming oral suspension sugar free" 18] ["Venlafaxine 150mg modified-release capsules" 18]))
(17,(["Blood glucose biosensor testing strips" 55] ["Gluten free bread" 35] ["Macrogol compound oral powder sachets NPF sugar free" 29] ["Colecalciferol 400unit / Calcium carbonate 1.5g chewable tablets" 24] ["Gluten free biscuits" 22] ["Ispaghula husk 3.5g effervescent granules sachets gluten free sugar free" 21] ["Diclofenac 1.16% gel" 19] ["Sterile leg bags" 19] ["Glyceryl trinitrate 400micrograms/dose pump sublingual spray" 19] ["Ostomy skin protectives" 18]))

There we are, GP 610 prescribed 91 loaves of Gluten Free Bread over the six month period. The blood glucose testing strips are also high on the agenda but that would come as no surprise for any one who is diabetic.

So Which GP’s Are Prescribing What?

The first number in the raw data is the GP id. In the DetailData notes for the prescription data I read:

“Practices are identified by a Practice Number. These can be cross-referenced with the GP Practices lists.

As with the NI Assembly data I can load in the GP listing data and join the two by their key. Sadly on this occasion though I can’t, the data just isn’t there on the page. I’m not sure if it’s broken or removed on purpose. Shame but I’m not going to create a scene.


DetailData Prescription Data

Github Repo for this project

*** Note: I spelt “prescriptions” wrong in the Clojure project but as this is a throwaway kinda a thing I won’t be altering it…. ***

NIAssembly Open Data – Part 2 – Sankey Diagrams #opendata #clojure #spark #sankey

In the first part of this walk through I showed you how to use the excellent NI Assembly open data platform to find out the frequency of departments members were asking questions to.

A picture speaks a thousand words so they say, so it makes sense to attempt to visualise a diagram of the data we’ve worked on.

What’s A Sankey Diagram?

A sankey diagram is basically a collection of node labels with connections, these connections are weighted by value, the higher the value the thicker the conncetion.



The data is based on a CSV file with a source node, target node and a value. Simple as that.

Reusing the Spark Work

In the previous post we left Spark and the data in quite a nice position. A Pair RDD with the member id as a key and a vector of department names with the question frequencies.

The first job is to transform this in to a CSV file. As we left it we had a Pair RDD with [k, v] with the v being a map of department names and the value of that map the frequency. So in reality we’ve got [k, [{ k v, k v, k v…….k v}].

For example, let’s look at the first element in our RDD after doing a spark/collect on the Pair RDD.

mlas.core> (first dfvals)
["8" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

The first element is the member id and second element is the department/frequency map. Remember this is for one MLA, there are still 103 in the RDD altogether.

mlas.core> (first x)
mlas.core> (second x)
{"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}

We can use Clojure’s map function to process each key/pair of the map data.

mlas.core> (map (fn [[k v]] (println k " > " v)) (second x))
Department of Culture, Arts and Leisure > 32
Department of the Environment > 96
Department for Social Development > 76
Department of Agriculture and Rural Development > 53
Department for Employment and Learning > 40
Department for Regional Development > 128
Northern Ireland Assembly Commission > 18
Department of Education > 131
Department of Health, Social Services and Public Safety > 212
Department of Justice > 38
Department of Finance and Personnel > 105
Office of the First Minister and deputy First Minister > 151
Department of Enterprise, Trade and Investment > 66
(nil nil nil nil nil nil nil nil nil nil nil nil nil)

A couple of things to note, notice the use of [k v] being passed in to the map function. Secondly as I’m using the println function the result of the map function is going to be nil. The last line is the result of the Clojure map function.

In part we’ve got two thirds of the CSV output already done with the target node and the value, I need to redo the Spark function so instead of the member id being the key I want the name of the MLA in question.

(defn mlaname-department-frequencies-rdd [members-questions-rdd]
 (->> members-questions-rdd
        (s-de/key-val-val-fn (fn [key member questions]
          (let [freqmap (map (fn 
(:departmentname question)) questions)] (spark/tuple (:membername member) (frequencies freqmap))))))))

When I run this function with the existing member/question Pair RDD I get a new Pair RDD with the following:

mlas.core> (def mmdep-freq (mlaname-department-frequencies-rdd mq-rdd))
mlas.core> (spark/first mmdep-freq)
#sparkling/tuple ["Beggs, Roy" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

With that RDD we have all the elements for the required CSV file. A source node (the MLA’s name), a target node (the department) and a value (the frequency). Notice that I’m also removing the commas from the MLA name and the department name, otherwise I’ll break the sanaky diagram when it’s rendered on screen.

(defn generate-csv-output [mddep-freq]
  (->> mddep-freq 
       (spark/map (s-de/key-value-fn (fn [k v] (let [mlaname k]
           (map (fn [[department frequency]]
                        [(str/replace mlaname #"," "")
                         (str/replace department #"," "")
                         frequency]) v)))))

And then a method to write the actual cvs file.

(defn write-csv-file [filepath data]
 (with-open [out-file (io/writer (str filepath "sankey.csv"))]
 (csv/write-csv out-file data)))

To test I’m just going to write out the first MLA in the vector.

mlas.core> (def csv-to-output (generate-csv-output mldep-rdd))
mlas.core> (first csv-to-output)
(["Beggs, Roy" "Department of Culture, Arts and Leisure" 32] ["Beggs, Roy" "Department of the Environment" 96] ["Beggs, Roy" "Department for Social Development" 76] ["Beggs, Roy" "Department of Agriculture and Rural Development" 53] ["Beggs, Roy" "Department for Employment and Learning" 40] ["Beggs, Roy" "Department for Regional Development" 128] ["Beggs, Roy" "Northern Ireland Assembly Commission" 18] ["Beggs, Roy" "Department of Education" 131] ["Beggs, Roy" "Department of Health, Social Services and Public Safety" 212] ["Beggs, Roy" "Department of Justice " 38] ["Beggs, Roy" "Department of Finance and Personnel" 105] ["Beggs, Roy" "Office of the First Minister and deputy First Minister" 151] ["Beggs, Roy" "Department of Enterprise, Trade and Investment" 66])
mlas.core> (write-csv-file "/Users/Jason/Desktop/sankey.csv (first csv-to-output))

So far so good, checking on my desktop and there’s a CSV file ready for me to use. I just need to add the header (source,target,value) to the top line. In all honesty I should really insert that header row at the start of the vector.

Creating The Sankey Diagram

Where possible it’s best to learn from example and in all honesty I’m not a visualisation kinda guy. So when the going gets tough, the tough Google D3 examples.

So there’s a handy Sankey diagrams with CSV files that I can use. So a small amount of copy/paste to create the index.html and sankey.js files, all I have to do is copy the sankey.csv that Spark just output for us. I’ve extended the length of the canvas to paint the sankey diagram on to.

Appending a couple of CSV output files to sankey.csv will give us a starting point. If I reload the page (Dropbox doubles as a very handy web server for static pages if you put html files in the Public directory) you end up with something like the following.



Okay, it’s not perfect but it’s certainly a starting point. Just imagine how it would look with all the MLA’s….. maybe’s later.


Once again I’ve rattled through some Spark and Clojure but we’re essentially reusing what we have. The D3 outputs take some experimentation and time to get right. Keep in mind if you have a lot of nodes (notice how I’m only dealing with two MLA’s at the moment) the rendering can take some time.


Sankey with D3 and CSV files:

Github Repo of this project:



NIAssembly Open Data – #opendata #ni #spark #clojure

Data From Stormont

The Northern Ireland Assembly has opened up a fair chunk of data as a web service, returning results in either XML or JSON format. And from first plays with it, it’s rather well put together.


What I’ve also learned is that the team listen, a small suggestion was implemented no sooner had they returned to work on the Monday morning. Only goes to show that the team want this succeed.

The web service is split up in to various areas:

  • Members – current and historical MLA’s
  • Questions – written and verbal questions.
  • Organisations
  • Plenary
  • Hansard – contributions by members during plenary debate

With the data being under an Open Northern Ireland Assembly Licence, as long as you provide a link back to it.

The Project

I’m going to setup a Clojure/Spark project and start processing some of this data. I want to do the following items:

  1. Load the current members data.
  2. Save the questions for each member.
  3. Load the saved questions for each member.
  4. Join the data together by the member id.
  5. Find the frequency of departments specific member questions are directed at.

Setting Up Spark

Before I can do any of that I need to set up the require Spark context so I can handle the data how I want.

 (def c (-> (conf/spark-conf)
 (conf/master "local[3]")
 (conf/app-name "niassemblydata-sparkjob")))
 (def sc (spark/spark-context c)))

The reason I put this in a comment block is that it’s there for copy/pasting in the REPL when I need it.

Loading the Members

The Members API has details of the members past and present. Right now I’m only concerned about the current ones. So the end point I want to use is:

This returns the current members in JSON format with name, display name, organisation and so on. So I’m not hammering the API with a request each time I’ve downloaded the JSON contents and saved them to a file.

$ curl > members.json

Next I want to load this file in to Spark and create a pair RDD with the member id as the key and the JSON data as a map for that member as the value.

(defn load-members [sc filepath] 
 (->> (spark/whole-text-files sc filepath)
      (spark/flat-map (s-de/key-value-fn (fn [key value] 
          (-> (json/read-str value :key-fn (fn [key] (-> key
               (get-in [:allmemberslist :member])))))
       (spark/map-to-pair (fn [rec] (spark/tuple (:personid rec) rec)))))

As you can’t confirm that the JSON file is going to be neatly one object per line, which I know it isn’t in this case, I’ll use Spark’s wholeTextFile method to load in one file per RDD. This returns a pair RDD [filename, contents-of-file] then iterate through each RDD using Clojure’s JSON library to read in each member (nested within the AllMembersList->Member JSON Array), convert the key name to a Clojure map key and convert that name to lower case.

In the Clojure REPL I can test this easily:

mlas.core> (def members (load-members sc "/Users/Jason/work/data/niassembly/members.json"))
mlas.core> (spark/first members)
#sparkling/tuple ["307" {:partyorganisationid "111", :memberfulldisplayname "Mr S Agnew", :membertitle "MLA - North Down", :memberimgurl "", :membersortname "AgnewSteven", :memberlastname "Agnew", :personid "307", :constituencyname "North Down", :memberprefix "Mr", :constituencyid "11", :partyname "Green Party", :membername "Agnew, Steven", :affiliationid "2482", :memberfirstname "Steven"}]

So I have a tuple of members with the memberid being the key and a map of key/values for the actual record.

Saving the Questions For Each Member

With the members safely dealt with I can now turn my attention to the questions. There is a web service within the API that will return a JSON set of question for a given member, all I have to do is pass in the member ID as a value on the end point.

So, for example, if I want to get the questions for member ID 90 I would call the service with (copy/paste the url below in to browser to see the actual output):

As I want to load the questions for each member I’m going to iterate my paid RDD of members and use the key (as it’s the member id) to pull the data via URL with Clojure’s slurp function and the save the JSON response to disk with Clojure’s spit function.

(defn save-question-data [pair-rdd] 
 (->> pair-rdd 
      (spark/map (s-de/key-value-fn (fn [key value] 
               (spit (str questions-path key ".json") 
                   (slurp (str api-questions-by-member key))))))))

I can run this from the REPL easily but it will take a bit of time.

mlas.core>(spark/collect  (save-question-data members))

At this point all I’ve done is call the web service and save the questions to disk. I now need to load them in to Spark and create another pair RDD for each question and I want to use the :tablerpersonid as the key for the tuple.

Loading The Question Data Into Spark

In the same way we loaded the members data as a filename/filecontent pair we are going to do the same with the question data. This time there’s a whole directory of files.

Now at present as I’m in development mode I’m making an assumption here, I’m assuming that every question set has questions in it, there are though some MLA’s that don’t like asking questions for one reason or another.

$ ls -l -S
-rw-r--r-- 1 Jason staff 22 21 Jul 20:18 131.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:18 5223.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:17 5225.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:19 71.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:21 95.json

So for the minute I’m going to delete those and only concentrate with members who ask questions.

I’ve written a function that will iterate each member and load the questions file that member ID.

(defn load-questions [sc questions-path] 
 (->> (spark/whole-text-files sc questions-path)
      (spark/flat-map (s-de/key-value-fn (fn [key value] 
         (-> (json/read-str value
                  :key-fn (fn [key] 
                           (-> key 
             (get-in [:questionslist :question])))))
       (spark/map-to-pair (fn [rec] (spark/tuple (:tablerpersonid rec) rec)))

Notice the spark/group-by-key which gives us a key with a vector of maps [k, [v1, v2….vn]. If it was left out we’d have a pair RDD with lots of rows. With that loaded in to Spark we can have look at what we have.

mlas.core> (def questions (load-questions sc "/Users/Jason/work/data/niassembly/questions"))
mlas.core> (spark/count questions)
mlas.core> (spark/first questions)
#sparkling/tuple ["104" [{:tablerpersonid "104", :departmentid "76", :questiondetails "", :documentid "2828", :reference "AQW 141/07", :departmentname "Department of Agriculture and Rural Development", :tableddate "2007-05-22T00:00:00+01:00", :questiontext "To ask the Minister of Agricultur...............}]]

So far we have two pair RDD’s one for members and one for questions, both have the member id as the key. This is a good place to be as it means we can easily join the data.

Joining RDD Datasets

Using Spark’s join functionality we get a pair RDD with the key and then two RDD blocks, [key, [left-rdd, right-rdd]. If there is a left and right element then the join will happen, if not then it will be left off. Now if we were to use left-outer-join the left RDD would be preserved even if the right hand side has no value that matches by the key.

mlas.core> (def member-questions-rdd (spark/join members questions))
mlas.core> (spark/count member-questions-rdd)

A good rule of thumb to check is if the quantity of the joined data is more than the original left hand side RDD count. If that’s the case then check for duplicate member id’s in the member RDD.

Calculating Department Frequencies

First of all let’s discuss what we’re trying to achieve. For every member we want to see the frequency of departments that MLA’s are directing their questions at. Now we have the joined RDD with the members and questions we can run Spark to find out for us.

As Spark actions are immutable you will end up doing several Spark map segments to get to an answer. You’ve seen so far we’ve done a map to load members, a map to load the questions and a join to give us a [key, [member, questions]] pair RDD.

What I really want to do is refine this a bit further. One of the nice things with using Spark under Clojure is Sparkling Destructuring which gives you a handy set of functions for dealing with iterating the data. So for our key/value/value pair RDD we can use s-de/key-val-val-fn and this will give us an iterable function with the key, the first value (member) and the second value (questions) as accessible values.

(defn department-frequencies-rdd [members-questions-rdd] 
   (->> members-questions-rdd
           (s-de/key-val-val-fn (fn [key member questions]
              (let [freqmap (map (fn 
(:departmentname question)) questions)] (spark/tuple key (frequencies freqmap))))))))

When I run this I get the following output:

mlas.core> (def freqs (department-frequencies-rdd members-questions-rdd))
mlas.core> (spark/first freqs)
#sparkling/tuple ["8" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

We could extend this a little further with some more information on the member but we’ve essentially achieved what we proposed at the start.


This is merely scratching the surface of the NI Assembly data sets that are available. With some simple Clojure and Spark usage we’ve managed to pull the member data and questions, do a simple join and the find the frequency of departments.

Most data science is about consolidating large sets of data down to simple numbers that can be presented. Just by looking at the data I wouldn’t have know that a member has asked 212 questions to the Department of Health, Social Services and Safety.

Now I can.

You can download the source code for this project from the Github repository.



Get every new post delivered to your Inbox.

Join 566 other followers