Using Bayes Theorem for NI Startup Probabilities (#startups #clojure #statistics)

This is probably as close to serious as I’ll ever get on the subject, so hold on to your hipster pork pie hats…. The title headings are based on a fairly common path for Northern Ireland startups, other territories will have their own methods I’m sure. Regardless, I need a picture….

Nathan-Barley

The Odds Are Against You.

The harsh reality is that the odds are stacked against you for succeeding. I’ll be ultra liberal with my probabilities and say 4% (I should really be saying 2% but it’s a Bank Holiday Weekend and I’m in a good mood and not my grumpy self). This number could be quantified by mining all the previous startups and seeing who lasted longer than 3 years for example. So four in every hundred isn’t a bad starting point. Let’s call this our prior probability.

What we’re trying to establish is that if an event happens during the startup journey what will that do to the existing probability. The nice thing with Bayes, as you’ll see, is that for every milestone event (or any event) we can re-run the numbers. Ready?

Wow! We Got Proof Of Concept 40k!

Current Prior Probability: 4%

Great news! The good folks at TechstartNI have shone the light on your idea and given the clawback funds for you to build via a respected development house or developer. Will that have an effect on our post probability? It may do but PoC is not a confirmation of your startup really, just access to build.  What we can do though is use Bayes Theorem to recalculate the probability now we have a new event to include.

Bayes In Clojure

So Bayes works on three inputs, the prior probability (in our case the 4% figure we started with), a positive impact on the hypothesis that you’ll last longer than three years and a negative impact on the hypothesis that you won’t last longer than three years.

Assuming that x is our prior, y is the positive event and z is the negative event. We can use the formula: (x * y) / ((x * y) + (z * (1 – x)))

If I were code that up in Clojure it would look like this:

(ns very-simple-bayes.core)

(defn calculate-bayes [prior-prob newevent-positive-hypothesis newevent-negative-hypothesis]
  (double (/ (* prior-prob newevent-positive-hypothesis) 
             (+ (* prior-prob newevent-positive-hypothesis) 
                (* newevent-negative-hypothesis (- 1 prior-prob))))))

(defn calculate-bayes-as-percent [prior-prob newevent-positive-hypothesis newevent-negative-hypothesis]
  (* 100 (calculate-bayes prior-prob newevent-positive-hypothesis newevent-negative-hypothesis)))

The first function does the actual Bayes calculation and the second function merely converts that in to a percentage for me.

Right, back to our TechstartNI PoC. Let’s see how that affects our chances of survival.

Just because PoC funds give you some ground to build a product it has little impact on the survival of the company as a whole. Being liberal again let’s say the positive impact on the hypothesis is 90% and the negative will be 10%. Can only be a good thing to have a product to sell.

very-simple-bayes.core> (calculate-bayes-as-percent 0.04 0.9 0.1)
27.272727272727277

While PoC has a huge effect on you getting product out of the door (do I dare utter the letters M, V and P at this point) it has little effect in your long term survival. So your 4% chance of three year survival has gone to 27.2%. A positive start but all you have is a product.

Put the Champagne on ice just don’t open it…..

Propelling Forward

Current Prior Probability: 27.2%

The next logical step is to look at something like the Propel Programme to get you in the sales and marketing mindset but also making you “investor ready” which is what I see the real aim of Propel to be. So with the new event we can recalculate our survival probability. The 20k doesn’t make a huge dint in your survival score, it helps you get through though, I will take that into account.

As I’ve not experienced Propel first hand it’s unfair to me to say how things will pan out, you’ll have to ask someone who’s done it. It doesn’t, though, stop me guessing some numbers out of the air to test against, and you should really do the same.

Propel will have a positive impact on your startup, no doubt, there’s a lot to learn and you’ll be in the same room as others going through the same process. The “up to” £20k is good to know but there’s no 100% certainty, apart from death and taxes, that you’ll get the full amount.

Propel’s positive probability on hypothesis: 40%

Propel’s false positive probability on the hypothesis: 80%

Running the numbers through Bayes again, let’s see what the new hypothesis probability is looking like.

very-simple-bayes.core> (calculate-bayes-as-percent 0.272 0.4 0.8)
15.74074074074074

That brought us back down to earth a bit. 15.74% chance of a positive hypothesis. No reflection on Propel at all, that’s just how the numbers came out. Now I could be all biased as say that if you do Propel you’re gonna be a unicorn-hipster-star but the reality is far from that.

The false positive is interesting, doing these things can sometimes fool the founder into thinking they’re doing far better than they think they are. If 100 startups went through Propel and 40 are still trading today then our positive event probability is right. And that’s the nice thing about applying Bayes in this way, we can make some fairly reasoned assumptions that we can use to calculate our scores with.

I’m Doing Springboard too Jase!

Okay! And the nice thing is these things can happen in parallel, but looks treat it as a sequential matter to preserve sanity in the probability.

Current Prior Probability: 15.74%

I think the same event +/- probabilities would apply here. Springboard is good for mentorship and contacts. My hand on heart gut thinks the numbers are going to be the same as Propel’s for what we are looking at here.

Springboard positive hypothesis: 40%

Springboard as a false positive on hypothesis: 80%

Let’s run the numbers again (now you can see why I wrote a Clojure program first).

very-simple-bayes.core> (calculate-bayes-as-percent 0.1574 0.4 0.8)
8.54227721697601

Interestingly, according the numbers, doing the two programmes has a negative impact on your startup if you have no revenue and no customers, well the numbers say so.

Getting That First Seed.

Current Prior Probability: 8.54%

Okay, so you’ve your Proof of Concept in hand, grafted through Propel and then gone through Springboard until your get that nice picture on the website. “Investor Ready” is an odd term, markets can change, fashions come and go and investors go looking for different things as time goes by. So all the while you’ve been slaving investors could be looking for something else.

So the opportunity arrives to pitch to one of the NI based VC’s/Angels for some “proper” money. Once again a fairly normal route to go down. It could be a mixture of different funding places (Crescent, Kernel or Techstart). If accepted the goalposts change as it’s people on the board and results focus (ie are you hitting target month on month).

The average figure for a seed round in NI is between £150-£300k but I’ll head for the upper figure. Regardless money in the bank (even though it’s not yours) is a good thing if you are prepared to give up some equity. Saying that investments can go bad, so we need to ying and yang this out a bit. So I’ve put in that there’s a 20% chance of the investment being a false positive. Once again if you had the term sheets of 20 companies you’d be able to do some maths yourself to get a better idea.

Seed round has positive outcome on hypothesis: 70%

Seed round has is a false positive on hypothesis: 20%

Investment is good PR and the hype cycle loves a good startup investment story. it opens up the doors to talking guff in far more many places than you did before. How does that affect our probability though?

very-simple-bayes.core> (calculate-bayes-as-percent 0.0854 0.7 0.2)
 24.631231973629998

Positive indeed. From 4.0% to 24.6% is good. From 1 in 25 to 1 in 4 chance of lasting three years, though the majority hinged on investment in the latter stages of you getting investment. There’s a chance by this point you’d be two thirds the way in of the three year plan.

Blessed Are The 2%

At the start I used 4% as a very optimistic probability of a startup lasting more than three years. I wonder what would happen if I went for a realistic start point of 2%?

Proof Of Concept Stage

very-simple-bayes.core> (calculate-bayes-as-percent 0.02 0.9 0.1)
15.517241379310345

Propel Stage

very-simple-bayes.core> (calculate-bayes-as-percent 0.1551 0.4 0.8)
8.40695972681446

Springboard Stage

very-simple-bayes.core> (calculate-bayes-as-percent 0.0840 0.4 0.8)
4.3841336116910234

Investor Stage

very-simple-bayes.core> (calculate-bayes-as-percent 0.0438 0.7 0.2)
13.817034700315457

So a 1 in 7 chance of you lasting 3 years….. you can also finally put your “passion” line to bed as well.

Concluding…..

You can see why some folk decide to hold startup events. The risk is far lower and the chances of sponsorship are increased and repetitive. Saying that we know that those who want to work 17 hours a day by the seat of their knickers will continue to do so. Depending on your point of view it is indeed easier to sell the shovels instead.

And As For That Champagne….

marilynMarilyn drank it.

 

 

 

 

 

 

 

 

 

 

Focus on the MVDP – The Minimum Viable Data Product #startups #lean #data

Since the release of “The Lean Startup” there’s three letters that have been preached over and over…MVP – Minimum Viable Product. The over copied mantra of “what do you need to do to get this thing out of the door?”.

image4And in part it’s valid, there’s method in the madness. In part I also believe that it’s flawed as customers will buy in to a complete product not iterations and then used as little experiments to see what happens when you tweak the model. Life though in startup land it is never as clean cut as some people believe the lean model is. My issue with MVP is that the focus is on the product but I don’t see that as the most valuable asset to the business. For most startups the app, the website or API is just a delivery method to what really matters, the data.

So instead of the Minimum Viable Product, why do I never hear about the Minimum Viable Data Product? Let me give you a simple example.

Disrupting Bacon! The Sequel!

You can read the original bacon disruption post here. Sandwich ordering for me is simple I go into a shop order what I want, wait a bit for it to be made, pay and leave again. If you really want to appify the process then fair play to you.

bacon-sandwich-thumb-450x351

Some people will look at that concept and think, “Simple Jase!, have the menus on the app, select what you want and pay within the app and pick it up. Job done and I’ll take my 8% fees from the sandwich shop owner.”

I proved in the original post that was going to be very hard work, growth of willing retailers has to be at least 75% month on month and then there’s small matter of getting users to download the app in first place. Remember, Just Eat didn’t come wading in the UK with £10m in their back pocket for no reason, UK wide marketing costs a lot of money.

photo

Shifting The Focus To The Minimum Viable Data Product

Think about it, the data collected by the likes of Nectar and Clubcard is not really there for the customer’s benefit, that’s merely the pay off and by product of being given enough permission to mine every basket and every item that passes through the point of sale. The real value in that data is once it’s sliced and diced and you have customer profiles, regional segments of buying behaviour and detailed time blocks of seasonal purchases of everyday items. At that point the main sell is to the suppliers of the products.

Designing The MVDP

So thinking about it for sandwich and bacon disruption alike. If the first focus was on the minimum viable data product, how would it look? There’s usually some form of sign up process with a minimum level of information.

  • Firstname
  • Lastname
  • Email address

If the user, assuming it’s an app, is giving you permission to track their location to find the nearest vendors then location is a wonder metric. With permission there’s no reason you can’t store the app ID along with the location data. Does that customer move around a lot or are they stuck within a certain radius every day, every hour? Could I push information of the most popular eateries where they are if they move around a lot?

  • Location
  • Radius movement

The frequency and value of the customer is highly important. Are they using the your service everyday (signups is NOT a good metric, active users IS), what’s the value of their order, can you predict long term customer value based on the purchase data? Can you split it down per supplier? No, well find someone who can…

  • Date/time of purchase.
  • Purchase value ($/£)

What did the customer purchase? Even Subway miss this by a long shot, so do Starbucks (and they know what I’m buying). So here’s your chance for glory. Does the customer order the same thing everyday, creatures of habit, or is the customer varied in their purchases? Do they use the same supplier day in day out or do they move around the different ones in the area?

  • Retailer ID/name
  • Item list of ordered items.

There’s a possibility that an app will already be picking all this data up but has no idea how to process it, and that’s fine. Just store it and when you get to a point of thinking you have enough data find someone who can help you understand it.

Defining The Questions

There’s going to be two types of questions, one’s that you’ll ask yourself:

  • “Which suppliers are generating us the most revenue?”
  • “Who are our top purchasing customers?”
  • “How many orders a day is our app fulfilling?”

Those aren’t unreasonable questions and ones that I’d be expecting the board and management team to be asking everytime you have a meeting.

The retailers, your suppliers, on the other hand will have their own questions.

  • “How many customers purchased via the app?”
  • “What’s the sales volume?”
  • “How much in fees did I incur to fulfil those sales?”
  • “What were the top performing items in inventory month on month?”

Even with a simple list, trust me there’s loads more, there’s some wins in there for the retailer, especially in pushing distressed stock or capacity planning for more perishable goods. No point holding too much cottage cheese when no one is buying it.

From The Department Of “How Are We Going To Make Money?”

From the top, data has value. Not everything in life has to be open data. I made this very clear to a website many years ago, “The value is in my website”, to my reply, “No it’s not, the value is in the data your users are entering into your website. Don’t give it away”.

So as a new business (I’m going off the word Startup by this point) the key questions I’d be asking are:

  1. What is our minimum viable data model, what do we need to collect to make better decisions later?
  2. Can we mine, slice, dice this data and sell it on to our suppliers?
  3. What is the value of the data we will have? Do we sell it on as an extra or as part of the product package.
  4. What is the collection method to acquire the data? Is it a website, an app, smart TV application or something else? Will the data model change if the collection method changes.

And Finally…..

Customer buy in is crucial and it happens right at the start, the moment they sign up. From that point on everything can be tracked to aid your commercial goals. What’s important though is to chop and change the data model around once you get going, you need to make time to focus on the MVDP before you go into launch mode. If you change course halfway through and start asking for gender information users might get a bit irked and wonder if there’s really a customer segment where ordering sandwiches is really a matter of whether someone is transgender, female or male.

Just be careful what you ask for and how you ask for it.

The #NIStartupCommandments….. straight from the pinch of salt department.

Just thoughts, take what it for what you will…. I know there were only ten commandments but tablets have got more storage capacity so here’s a baker’s dozen.

1. Have a software developer willing to help/work for sweat equity. Without them you’re dead.

2. Figure out revenue, not funding opportunities, from day one. Target market segments etc.

3. The app economy is dead so get those ideas out of your head. Tell me which NI apps have “made it” with £1m+ rev?

4. Use PoC funds wisely, dev houses are expensive and make sure you are in control and own code. If not, go elsewhere

5. Treat startup events with caution. What will you learn and who will you meet? Doing the rounds wastes time.

6. The community is small, so don’t mess about, be nice to everyone you meet. Karma etc.

7. SRBI Competitions are good blocks of revenue while you get up and running. Good for network too.

8. Beware the Church of Lean.

9. Coffee shops are expensive, buy a kettle. Keep expenses low low low.

10. Download the @Mattermark app for iOS and look at your competitors, their growth score and what they raised. Wake up call.

11. Look at all the destination airports from Belfast. Those are your first tier customers.

12. Don’t apply for incubator or accelerator without a tech dev, you can’t sell something that doesn’t exist.

13. If you’re going to an analytics company avg minimum seed is about $5m.

NI Open Data – Mining Prescription Data Part 2. – #opendata #spark #clojure

medicine1-1

The Story So Far….

You can read part 1 here.

A few weeks ago I started on finding out which were the most popular items that a GP practice would prescribe. Once again I turned to Sparkling and Clojure to do the grunt work for me.

The Practice List

What I didn’t have at the time was the practice list. You can download that from the HSC Business Services Organisation.

I’m going to create another PairRDD with the practice number as the key and then the information as it’s value.

(def practice-fields [:pracno :partnershipno :practicename :address1 :address2 :address3 :postcode :telno :lcg])

(defn load-practices [sc filepath] 
  (->> (spark/text-file sc filepath)
       (spark/map #(->> (csv/read-csv %) first))
       (spark/filter #(not= (first %) "PracNo"))
       (spark/map #(zipmap practice-fields %))
       (spark/map-to-pair (fn [rec] 
                            (let [practice-id (:pracno rec)]
                              (spark/tuple practice-id rec))))))

It’s very similar to the original function I used to load the prescription CSV.

Joining The RDD’s

With two pair RDD’s I can safely perform the join.

(defn join-practice-prescriptions [practices prescriptions]
  (spark/join practices prescriptions))

This will give me a Pair RDD with a [key, [value1, value2]] form. I’ll need to tweak the function that works out the frequencies so it takes into account this new data structure.

(defn def-practice-prescription-freq [prescriptiondata]
  (->> prescriptiondata
       (spark/map-to-pair (s-de/key-val-val-fn (fn [k v pr] 
                                               (let [freqmap (map (fn [rec] (:vmp_nm rec)) pr)]
                                                 (spark/tuple (str (:practicename v) " " (:postcode v)) (apply list (take 10 (reverse (sort-by val (frequencies freqmap))))))))))))

The last thing to do it wrap up the process-data function to load in the practice list.

(defn process-data [sc filepath practicefile outputpath] 
  (let [prescription-rdd (load-prescription-data sc filepath)
        practice-rdd (load-practices sc practicefile)]
    (->> (join-practice-prescriptions practice-rdd prescription-rdd)
         (def-practice-prescription-freq)
         (spark/coalesce 1)
         (spark/save-as-text-file outputpath))))

Testing From The REPL

To test these changes is fairly trivial from the REPL.

; CIDER 0.8.1 (package: 20141120.1746) (Java 1.7.0_40, Clojure 1.6.0, nREPL 0.2.10)
nipresciptions.core> (def c (-> (conf/spark-conf)
             (conf/master "local[3]")
             (conf/app-name "niprescriptions-sparkjob")))
  (def sc (spark/spark-context c))
#'nipresciptions.core/c
#'nipresciptions.core/sc
nipresciptions.core> (def practice-file "/Users/Jason/work/data/Northern_Ireland_Practice_List_0107151.csv")
#'nipresciptions.core/practice-file
nipresciptions.core> (def prescription-path "/Users/Jason/work/data/niprescriptions")
#'nipresciptions.core/prescription-path
nipresciptions.core> (process-data sc prescription-path practice-file "/Users/Jason/testoutput")

The Spark job will take a bit of time as there’s 347 practices and a lot of prescription data…..

References:

Detail Data Prescription CSV’s: http://data.nicva.org/dataset/prescribing-statistics

Github Repo for this project: https://github.com/jasebell/niprescriptiondata

Practice List: http://www.hscbusiness.hscni.net/services/1816.htm

 

 

 

NI Open Data – Mining Prescription Data – #opendata #spark #clojure

Moving On From The NI Assembly

There was plenty of scope from the NI Assembly blog posts I did last time (you can read part 1 and part 2 for the background). While I received a lot of messages with “why don’t you do this” and “can you find xxxxxx subject out” it’s not something I wish to do. Kicking hornets nests isn’t really part of my job description.

Saying that when there’s open data for the taking then it’s worth looking at. Recently the Detail Data project opened up a number of datasets to be used. Buried within is the prescriptions that GP’s or Nurse within the practice has prescribed.

Acquiring the Data

The details of the prescription data are here: http://data.nicva.org/dataset/prescribing-statistics (though the data would suggest it’s not really statistics, just raw CSV data), the files are large but nothing I’m worrying about in the “BIG DATA” scheme of things, this is small in relative terms. I’ve downloaded October 2014 to March 2015, that’s a good six months worth of data.

Creating a Small Sample Test File

When developing these kind of jobs before jumping into any code it’s worth having a look at the data itself. See how many lines of data there are, this time as it’s a CSV file I know it’ll be one object per line.

Jason-Bells-MacBook-Pro:niprescriptions Jason$ wc -l 2014-10.csv 
 459188 2014-10.csv

Just under half a million lines for one month, that’s okay but too much for testing. I want to knock it down to 200 for testing. The UNIX head command will sort us out nicely.

head -n20 2014-10.csv > sample.csv

So for the time being I’ll be using my sample.csv file for development.

Loading Data In To Spark

First thing I need to do is define the header row of the CSV as a set of map keys. when Spark loads the data in then I’ll use zipmap to pair the values to the keys for each row of the data.

(def fields [:practice :year :month :vtm_nm :vmp_nm :amp_nm :presentation :strength :total-items :total-quantity :gross-cost :actual-cost :bnfcode :bnfchapter :bnfsection :bnfparagraph :bnfsub-paragraph :noname1 :noname2])

You might have noticed the final two keys, noname1 and noname2. The reason for this is simple, there are commas on the header row but no names so I’ve forced them to have a name to keep the importing simple.

PRACTICE,Year,Month,VTM_NM,VMP_NM,AMP_NM,Presentation,Strength,Total Items,Total Quantity,Gross Cost (<A3>),Actual Cost (<A3>),BNF Code,BNF Chapter,BNF Section,BNF Paragraph,BNF Sub-Paragraph,,
1,2015,3,-,-,-,-,-,19,0,755.00,737.28,-,99,0,0,0,,

With that I can now create the function that loads in the data.

(defn load-prescription-data [sc filepath] 
 (->> (spark/text-file sc filepath)
      (spark/map #(->> (csv/read-csv %) first))
      (spark/filter #(not= (first %) "PRACTICE"))
      (spark/map #(zipmap fields %))
      (spark/map-to-pair (fn [rec]
         (let [practicekey (:practice rec)]
           (spark/tuple practicekey rec))))
      (spark/group-by-key)))

Whereas the NI Assembly data was in JSON format so I had the keys already defined, this time I need to use the zipmap function to mix the values at the head keys together. This gives us a handy map to reference instead of relying on the element number of the CSV line. As you can see I’m grouping all the prescriptions by their GP key.

 

Counting The Prescription Frequencies

This function is very similar to the frequency function I used in the NI Assembly project, by mapping each prescription record and retaining the item prescribed I can then use the frequencies function to get counts for each distinct type.

(defn def-practice-prescription-freq [prescriptiondata]
 (->> prescriptiondata
   (spark/map-to-pair (s-de/key-value-fn (fn [k v] 
     (let [freqmap (map (fn [rec] (:vmp_nm rec)) v)]
       (spark/tuple k (frequencies freqmap))))))))

Getting The Top 10 Prescribed Items For Each GP

Suppose I want to find out what are the top ten prescribed items for each GP location. As the function I’ve got has the frequencies with a little tweaking we can return what I need. First I’m using sort-by to sort on the function, this will give me a sort smallest to largest, using the reverse function then flips it on it’s head and gives me largest to smallest. With me only wanting ten items I then use the take function to return the first ten items in the sequence.

(defn def-practice-prescription-freq [prescriptiondata]
 (->> prescriptiondata
    (spark/map-to-pair (s-de/key-value-fn (fn [k v] 
       (let [freqmap (map (fn [rec] (:vmp_nm rec)) v)]
           (spark/tuple k 
                        (take 10 (reverse (sort-by val (frequencies freqmap)))))))))))

Creating The Output File Process

So with two simple functions we have the workings of a complete Spark job. I’m going to create a function to do all the hard work for us and save us repeating lines in the REPL. This function will take in the Spark context, the file path of the raw data files (or file if I want) and an output directory path where the results will be written.

(defn process-data [sc filepath outputpath] 
 (let [prescription-rdd (load-prescription-data sc filepath)]
       (->> prescription-rdd
            (def-practice-prescription-freq)
            (spark/coalesce 1)
            (spark/save-as-text-file outputpath))))

What’s going on here then? First of all we load the raw data in to a Spark Pair RDD and then using the thread last function we calculate the item frequencies and then reduce all the RDD’s down to a single RDD with the coalesce function. Finally we output everything to our output path. First of all I’ll test it from the REPL with the sample data I created earlier.

nipresciptions.core> (process-data sc "/Users/Jason/work/data/niprescriptions/sample.csv" "/Users/Jason/Desktop/testoutput/")
nil

Looking at the file part-00000 in the output directory you can see the output.

(1,(["Blood glucose biosensor testing strips" 4] ["Ostomy skin protectives" 4] ["Macrogol compound oral powder sachets NPF sugar free" 3] ["Clotrimazole 1% cream" 2] ["Generic Dermol 200 shower emollient" 2] ["Chlorhexidine gluconate 0.2% mouthwash" 2] ["Clarithromycin 500mg modified-release tablets" 2] ["Betamethasone valerate 0.1% cream" 2] ["Alendronic acid 70mg tablets" 2] ["Two piece ostomy systems" 2]))

So we know it’s working okay…. now for the big test, let’s do it against all the data.

Running Against All The Data

First things first, don’t forget to remove sample.csv file if it’s in your data directory or it will get processed with the other raw files.

$rm sample.csv

Back to the REPL and this time my input path will just be the data directory and not a single file, this time all files will be processed (Oct 14 -> Mar 15).

nipresciptions.core> (process-data sc "/Users/Jason/work/data/niprescriptions/" "/Users/Jason/Desktop/output/")

This will take a lot longer as there’s much more data to process. When it does finish have a look at the part-00000 file again.

(610,(["Gluten free bread" 91] ["Blood glucose biosensor testing strips" 62] ["Isopropyl myristate 15% / Liquid paraffin 15% gel" 29] ["Lymphoedema garments" 27] ["Macrogol compound oral powder sachets NPF sugar free" 25] ["Ostomy skin protectives" 24] ["Gluten free mix" 21] ["Ethinylestradiol 30microgram / Levonorgestrel 150microgram tablets" 20] ["Gluten free pasta" 20] ["Carbomer '980' 0.2% eye drops" 19]))
(625,(["Blood glucose biosensor testing strips" 62] ["Gluten free bread" 38] ["Gluten free pasta" 27] ["Ispaghula husk 3.5g effervescent granules sachets gluten free sugar free" 24] ["Macrogol compound oral powder sachets NPF sugar free" 20] ["Isopropyl myristate 15% / Liquid paraffin 15% gel" 20] ["Isosorbide mononitrate 25mg modified-release capsules" 18] ["Alginate raft-forming oral suspension sugar free" 18] ["Isosorbide mononitrate 50mg modified-release capsules" 18] ["Oxycodone 40mg modified-release tablets" 16]))
(661,(["Blood glucose biosensor testing strips" 55] ["Gluten free bread" 55] ["Macrogol compound oral powder sachets NPF sugar free" 24] ["Salbutamol 100micrograms/dose inhaler CFC free" 20] ["Colecalciferol 400unit / Calcium carbonate 1.5g chewable tablets" 19] ["Venlafaxine 75mg modified-release capsules" 18] ["Isosorbide mononitrate 25mg modified-release capsules" 18] ["Isosorbide mononitrate 60mg modified-release tablets" 18] ["Alginate raft-forming oral suspension sugar free" 18] ["Venlafaxine 150mg modified-release capsules" 18]))
(17,(["Blood glucose biosensor testing strips" 55] ["Gluten free bread" 35] ["Macrogol compound oral powder sachets NPF sugar free" 29] ["Colecalciferol 400unit / Calcium carbonate 1.5g chewable tablets" 24] ["Gluten free biscuits" 22] ["Ispaghula husk 3.5g effervescent granules sachets gluten free sugar free" 21] ["Diclofenac 1.16% gel" 19] ["Sterile leg bags" 19] ["Glyceryl trinitrate 400micrograms/dose pump sublingual spray" 19] ["Ostomy skin protectives" 18]))

There we are, GP 610 prescribed 91 loaves of Gluten Free Bread over the six month period. The blood glucose testing strips are also high on the agenda but that would come as no surprise for any one who is diabetic.

So Which GP’s Are Prescribing What?

The first number in the raw data is the GP id. In the DetailData notes for the prescription data I read:

“Practices are identified by a Practice Number. These can be cross-referenced with the GP Practices lists.

As with the NI Assembly data I can load in the GP listing data and join the two by their key. Sadly on this occasion though I can’t, the data just isn’t there on the page. I’m not sure if it’s broken or removed on purpose. Shame but I’m not going to create a scene.

Resources

DetailData Prescription Datahttp://data.nicva.org/dataset/prescribing-statistics

Github Repo for this projecthttps://github.com/jasebell/niprescriptiondata

*** Note: I spelt “prescriptions” wrong in the Clojure project but as this is a throwaway kinda a thing I won’t be altering it…. ***

NIAssembly Open Data – Part 2 – Sankey Diagrams #opendata #clojure #spark #sankey

In the first part of this walk through I showed you how to use the excellent NI Assembly open data platform to find out the frequency of departments members were asking questions to.

A picture speaks a thousand words so they say, so it makes sense to attempt to visualise a diagram of the data we’ve worked on.

What’s A Sankey Diagram?

A sankey diagram is basically a collection of node labels with connections, these connections are weighted by value, the higher the value the thicker the conncetion.

sankeydemo

 

The data is based on a CSV file with a source node, target node and a value. Simple as that.

Reusing the Spark Work

In the previous post we left Spark and the data in quite a nice position. A Pair RDD with the member id as a key and a vector of department names with the question frequencies.

The first job is to transform this in to a CSV file. As we left it we had a Pair RDD with [k, v] with the v being a map of department names and the value of that map the frequency. So in reality we’ve got [k, [{ k v, k v, k v…….k v}].

For example, let’s look at the first element in our RDD after doing a spark/collect on the Pair RDD.

mlas.core> (first dfvals)
["8" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

The first element is the member id and second element is the department/frequency map. Remember this is for one MLA, there are still 103 in the RDD altogether.

mlas.core> (first x)
"8"
mlas.core> (second x)
{"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}

We can use Clojure’s map function to process each key/pair of the map data.

mlas.core> (map (fn [[k v]] (println k " > " v)) (second x))
Department of Culture, Arts and Leisure > 32
Department of the Environment > 96
Department for Social Development > 76
Department of Agriculture and Rural Development > 53
Department for Employment and Learning > 40
Department for Regional Development > 128
Northern Ireland Assembly Commission > 18
Department of Education > 131
Department of Health, Social Services and Public Safety > 212
Department of Justice > 38
Department of Finance and Personnel > 105
Office of the First Minister and deputy First Minister > 151
Department of Enterprise, Trade and Investment > 66
(nil nil nil nil nil nil nil nil nil nil nil nil nil)
mlas.core>

A couple of things to note, notice the use of [k v] being passed in to the map function. Secondly as I’m using the println function the result of the map function is going to be nil. The last line is the result of the Clojure map function.

In part we’ve got two thirds of the CSV output already done with the target node and the value, I need to redo the Spark function so instead of the member id being the key I want the name of the MLA in question.

(defn mlaname-department-frequencies-rdd [members-questions-rdd]
 (->> members-questions-rdd
     (spark/map-to-pair 
        (s-de/key-val-val-fn (fn [key member questions]
          (let [freqmap (map (fn 
(:departmentname question)) questions)] (spark/tuple (:membername member) (frequencies freqmap))))))))

When I run this function with the existing member/question Pair RDD I get a new Pair RDD with the following:

mlas.core> (def mmdep-freq (mlaname-department-frequencies-rdd mq-rdd))
#'mlas.core/mmdep-freq
mlas.core> (spark/first mmdep-freq)
#sparkling/tuple ["Beggs, Roy" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

With that RDD we have all the elements for the required CSV file. A source node (the MLA’s name), a target node (the department) and a value (the frequency). Notice that I’m also removing the commas from the MLA name and the department name, otherwise I’ll break the sanaky diagram when it’s rendered on screen.

(defn generate-csv-output [mddep-freq]
  (->> mddep-freq 
       (spark/map (s-de/key-value-fn (fn [k v] (let [mlaname k]
           (map (fn [[department frequency]]
                        [(str/replace mlaname #"," "")
                         (str/replace department #"," "")
                         frequency]) v)))))
       (spark/collect)))

And then a method to write the actual cvs file.

(defn write-csv-file [filepath data]
 (with-open [out-file (io/writer (str filepath "sankey.csv"))]
 (csv/write-csv out-file data)))

To test I’m just going to write out the first MLA in the vector.

mlas.core> (def csv-to-output (generate-csv-output mldep-rdd))
#'mlas.core/csv-to-output
mlas.core> (first csv-to-output)
(["Beggs, Roy" "Department of Culture, Arts and Leisure" 32] ["Beggs, Roy" "Department of the Environment" 96] ["Beggs, Roy" "Department for Social Development" 76] ["Beggs, Roy" "Department of Agriculture and Rural Development" 53] ["Beggs, Roy" "Department for Employment and Learning" 40] ["Beggs, Roy" "Department for Regional Development" 128] ["Beggs, Roy" "Northern Ireland Assembly Commission" 18] ["Beggs, Roy" "Department of Education" 131] ["Beggs, Roy" "Department of Health, Social Services and Public Safety" 212] ["Beggs, Roy" "Department of Justice " 38] ["Beggs, Roy" "Department of Finance and Personnel" 105] ["Beggs, Roy" "Office of the First Minister and deputy First Minister" 151] ["Beggs, Roy" "Department of Enterprise, Trade and Investment" 66])
mlas.core> (write-csv-file "/Users/Jason/Desktop/sankey.csv (first csv-to-output))
nil
mlas.core>

So far so good, checking on my desktop and there’s a CSV file ready for me to use. I just need to add the header (source,target,value) to the top line. In all honesty I should really insert that header row at the start of the vector.

Creating The Sankey Diagram

Where possible it’s best to learn from example and in all honesty I’m not a visualisation kinda guy. So when the going gets tough, the tough Google D3 examples.

So there’s a handy Sankey diagrams with CSV files that I can use. So a small amount of copy/paste to create the index.html and sankey.js files, all I have to do is copy the sankey.csv that Spark just output for us. I’ve extended the length of the canvas to paint the sankey diagram on to.

Appending a couple of CSV output files to sankey.csv will give us a starting point. If I reload the page (Dropbox doubles as a very handy web server for static pages if you put html files in the Public directory) you end up with something like the following.

sankey2

 

Okay, it’s not perfect but it’s certainly a starting point. Just imagine how it would look with all the MLA’s….. maybe’s later.

Conclusion

Once again I’ve rattled through some Spark and Clojure but we’re essentially reusing what we have. The D3 outputs take some experimentation and time to get right. Keep in mind if you have a lot of nodes (notice how I’m only dealing with two MLA’s at the moment) the rendering can take some time.

References

Sankey with D3 and CSV files: http://bl.ocks.org/d3noob/c9b90689c1438f57d649

Github Repo of this project: https://github.com/jasebell/niassembly-spark

 

 

NIAssembly Open Data – #opendata #ni #spark #clojure

Data From Stormont

The Northern Ireland Assembly has opened up a fair chunk of data as a web service, returning results in either XML or JSON format. And from first plays with it, it’s rather well put together.

StormontChamber

What I’ve also learned is that the team listen, a small suggestion was implemented no sooner had they returned to work on the Monday morning. Only goes to show that the team want this succeed.

The web service is split up in to various areas:

  • Members – current and historical MLA’s
  • Questions – written and verbal questions.
  • Organisations
  • Plenary
  • Hansard – contributions by members during plenary debate

With the data being under an Open Northern Ireland Assembly Licence, as long as you provide a link back to it.

The Project

I’m going to setup a Clojure/Spark project and start processing some of this data. I want to do the following items:

  1. Load the current members data.
  2. Save the questions for each member.
  3. Load the saved questions for each member.
  4. Join the data together by the member id.
  5. Find the frequency of departments specific member questions are directed at.

Setting Up Spark

Before I can do any of that I need to set up the require Spark context so I can handle the data how I want.

(comment
 (def c (-> (conf/spark-conf)
 (conf/master "local[3]")
 (conf/app-name "niassemblydata-sparkjob")))
 (def sc (spark/spark-context c)))

The reason I put this in a comment block is that it’s there for copy/pasting in the REPL when I need it.

Loading the Members

The Members API has details of the members past and present. Right now I’m only concerned about the current ones. So the end point I want to use is:

http://data.niassembly.gov.uk/members_json.ashx?m=GetAllCurrentMembers

This returns the current members in JSON format with name, display name, organisation and so on. So I’m not hammering the API with a request each time I’ve downloaded the JSON contents and saved them to a file.

$ curl http://data.niassembly.gov.uk/members_json.ashx?m=GetAllCurrentMembers > members.json

Next I want to load this file in to Spark and create a pair RDD with the member id as the key and the JSON data as a map for that member as the value.

(defn load-members [sc filepath] 
 (->> (spark/whole-text-files sc filepath)
      (spark/flat-map (s-de/key-value-fn (fn [key value] 
          (-> (json/read-str value :key-fn (fn [key] (-> key
               str/lower-case
               keyword)))
               (get-in [:allmemberslist :member])))))
       (spark/map-to-pair (fn [rec] (spark/tuple (:personid rec) rec)))))

As you can’t confirm that the JSON file is going to be neatly one object per line, which I know it isn’t in this case, I’ll use Spark’s wholeTextFile method to load in one file per RDD. This returns a pair RDD [filename, contents-of-file] then iterate through each RDD using Clojure’s JSON library to read in each member (nested within the AllMembersList->Member JSON Array), convert the key name to a Clojure map key and convert that name to lower case.

In the Clojure REPL I can test this easily:

mlas.core> (def members (load-members sc "/Users/Jason/work/data/niassembly/members.json"))
#'mlas.core/members
mlas.core> (spark/first members)
#sparkling/tuple ["307" {:partyorganisationid "111", :memberfulldisplayname "Mr S Agnew", :membertitle "MLA - North Down", :memberimgurl "http://aims.niassembly.gov.uk/images/mla/307_s.jpg", :membersortname "AgnewSteven", :memberlastname "Agnew", :personid "307", :constituencyname "North Down", :memberprefix "Mr", :constituencyid "11", :partyname "Green Party", :membername "Agnew, Steven", :affiliationid "2482", :memberfirstname "Steven"}]

So I have a tuple of members with the memberid being the key and a map of key/values for the actual record.

Saving the Questions For Each Member

With the members safely dealt with I can now turn my attention to the questions. There is a web service within the API that will return a JSON set of question for a given member, all I have to do is pass in the member ID as a value on the end point.

So, for example, if I want to get the questions for member ID 90 I would call the service with (copy/paste the url below in to browser to see the actual output):

http://data.niassembly.gov.uk/questions_json.ashx?m=GetQuestionsByMember&personId=90

As I want to load the questions for each member I’m going to iterate my paid RDD of members and use the key (as it’s the member id) to pull the data via URL with Clojure’s slurp function and the save the JSON response to disk with Clojure’s spit function.

(defn save-question-data [pair-rdd] 
 (->> pair-rdd 
      (spark/map (s-de/key-value-fn (fn [key value] 
               (spit (str questions-path key ".json") 
                   (slurp (str api-questions-by-member key))))))))

I can run this from the REPL easily but it will take a bit of time.

mlas.core>(spark/collect  (save-question-data members))

At this point all I’ve done is call the web service and save the questions to disk. I now need to load them in to Spark and create another pair RDD for each question and I want to use the :tablerpersonid as the key for the tuple.

Loading The Question Data Into Spark

In the same way we loaded the members data as a filename/filecontent pair we are going to do the same with the question data. This time there’s a whole directory of files.

Now at present as I’m in development mode I’m making an assumption here, I’m assuming that every question set has questions in it, there are though some MLA’s that don’t like asking questions for one reason or another.

$ ls -l -S
-rw-r--r-- 1 Jason staff 22 21 Jul 20:18 131.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:18 5223.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:17 5225.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:19 71.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:21 95.json

So for the minute I’m going to delete those and only concentrate with members who ask questions.

I’ve written a function that will iterate each member and load the questions file that member ID.

(defn load-questions [sc questions-path] 
 (->> (spark/whole-text-files sc questions-path)
      (spark/flat-map (s-de/key-value-fn (fn [key value] 
         (-> (json/read-str value
                  :key-fn (fn [key] 
                           (-> key 
                               str/lower-case 
                               keyword)))
             (get-in [:questionslist :question])))))
       (spark/map-to-pair (fn [rec] (spark/tuple (:tablerpersonid rec) rec)))
       (spark/group-by-key)))

Notice the spark/group-by-key which gives us a key with a vector of maps [k, [v1, v2….vn]. If it was left out we’d have a pair RDD with lots of rows. With that loaded in to Spark we can have look at what we have.

mlas.core> (def questions (load-questions sc "/Users/Jason/work/data/niassembly/questions"))
#'mlas.core/questions
mlas.core> (spark/count questions)
103
mlas.core> (spark/first questions)
#sparkling/tuple ["104" [{:tablerpersonid "104", :departmentid "76", :questiondetails "http://data.niassembly.gov.uk/questions.asmx/GetQuestionDetails?documentId=2828", :documentid "2828", :reference "AQW 141/07", :departmentname "Department of Agriculture and Rural Development", :tableddate "2007-05-22T00:00:00+01:00", :questiontext "To ask the Minister of Agricultur...............}]]
mlas.core>

So far we have two pair RDD’s one for members and one for questions, both have the member id as the key. This is a good place to be as it means we can easily join the data.

Joining RDD Datasets

Using Spark’s join functionality we get a pair RDD with the key and then two RDD blocks, [key, [left-rdd, right-rdd]. If there is a left and right element then the join will happen, if not then it will be left off. Now if we were to use left-outer-join the left RDD would be preserved even if the right hand side has no value that matches by the key.

mlas.core> (def member-questions-rdd (spark/join members questions))
#'mlas.core/member-questions-rdd
mlas.core> (spark/count member-questions-rdd)
103
mlas.core>

A good rule of thumb to check is if the quantity of the joined data is more than the original left hand side RDD count. If that’s the case then check for duplicate member id’s in the member RDD.

Calculating Department Frequencies

First of all let’s discuss what we’re trying to achieve. For every member we want to see the frequency of departments that MLA’s are directing their questions at. Now we have the joined RDD with the members and questions we can run Spark to find out for us.

As Spark actions are immutable you will end up doing several Spark map segments to get to an answer. You’ve seen so far we’ve done a map to load members, a map to load the questions and a join to give us a [key, [member, questions]] pair RDD.

What I really want to do is refine this a bit further. One of the nice things with using Spark under Clojure is Sparkling Destructuring which gives you a handy set of functions for dealing with iterating the data. So for our key/value/value pair RDD we can use s-de/key-val-val-fn and this will give us an iterable function with the key, the first value (member) and the second value (questions) as accessible values.

(defn department-frequencies-rdd [members-questions-rdd] 
   (->> members-questions-rdd
        (spark/map-to-pair 
           (s-de/key-val-val-fn (fn [key member questions]
              (let [freqmap (map (fn 
(:departmentname question)) questions)] (spark/tuple key (frequencies freqmap))))))))

When I run this I get the following output:

mlas.core> (def freqs (department-frequencies-rdd members-questions-rdd))
#'mlas.core/freqs
mlas.core> (spark/first freqs)
#sparkling/tuple ["8" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]
mlas.core>

We could extend this a little further with some more information on the member but we’ve essentially achieved what we proposed at the start.

Concluding

This is merely scratching the surface of the NI Assembly data sets that are available. With some simple Clojure and Spark usage we’ve managed to pull the member data and questions, do a simple join and the find the frequency of departments.

Most data science is about consolidating large sets of data down to simple numbers that can be presented. Just by looking at the data I wouldn’t have know that a member has asked 212 questions to the Department of Health, Social Services and Safety.

Now I can.

You can download the source code for this project from the Github repository.

 

Using iBeacons for Airside Retail Loyalty – #airport #marketing #retail #loyalty

6-image

 

Confession, I Love Airport Retail Spaces

My fascination with loyalty goes way back, I’ve pushed and prodded the corners of this area and it’s enabled me to learn and eventually work in some very cool areas such as mobile and, more importantly to me, data science and engineering.

So I always keep an eye on things as you never know when opportunity will present itself. Over the weekend yet again I was thinking about airport retail and mobile loyalty and I glanced over the editorial press releases in Airports Magazine (yes there is a magazine for airports) and I read a press release from Eye Airports partnering with Proxama to deploy 200 iBeacons in 8 UK airports.

Is This How Passengers Want to Be Treated?

I’ve done a lot of flying this year so I’ve spent a stupid amount of time sat in departure lounges. So plenty of time to observe. There are three issues with iBeacons that always have bugged me.

  • It’s an Apple based technology
  • What’s the battery drain impact?
  • What are the privacy implications?

It’s an Apple Based Technology

Ultimately the iBeacon is an Apple creation, while the underlying technology isn’t anything more than Bluetooth 4 scanning for phones and devices Apple are going to greater lengths to cut out other vendors. So much so that Google are developing their own for Android. Now no retail space wants to start keeping two beacon devices just to keep two vendors happy. If there’s anything I’ve noticed recently there are a lot of Android devices being used by travellers.

Battery Implications

Air travel is stressful, the main aim is to get through security relatively unscathed.

pragmaimage3

Once you’re through then stress levels drop while you enter the retail space known as the departure lounge. Even when you board your flight the stress level is far lower than getting through security.

A new element in the mix is that of device battery life. With the era of electronic boarding cards preserving battery life is now a clean cut fly/can’t fly decision. Unless you can find a charging point (Newcastle does this well, Leeds Bradford I struggled a bit and Belfast International has one table at Starbucks holds plug point nirvana) then you will do everything to preserve battery, first things first is turning Bluetooth off. Once this happens then every marketing play by beacons is redundant. You always need a Plan B for how you’re going to reach a customer.

Privacy Implications

Privacy will always be in the back of people’s minds when it comes to this form of technology. There’s a pseudo opt-in mechanism, you’ll need some app developed by Proxama for example in order to be picked up by the beacons.

But every push and pull will be recorded so the data that can be gleaned is going to be retail gold dust to those that can analyse it. And recently customers are turning away from deals, daily deals and being force fed “buy this”.

There are a couple of instances where I’d want to be forced ads but that’s from an operational standpoint where certain events could happen.

The Trial

Eye Airports and Proxama are on a two year contract to roll this out and see how it performs. I’ve got reservations of the amount of passengers that will actually use the technology just based on the basic logic of how passengers behave in airports with mobile technology, the technology stack itself and whether passengers want to be creeped out by being tracked.

I’ll never get to see the final metrics I’m sure, but I think it will make interesting reading to those who can.

 

 

 

 

Invest NI’s “new jobs” headlines…. how many in a lifetime?

I received a question for Boris Drakemandrillsquirrelhugger*, “Jase, you do data science, how many new jobs have Invest Northern Ireland announced in total?”.

“Bless My Cotton Socks I’m In The News”

First we need headlines and in one line of Linux we can have the whole lot.

$for i in {1..314}; do curl http://www.investni.com/news/index.html?page=$i > news_$i.html; done

This is exactly the same as how I pulled nijobs.com data in a previous blog post. Each page is 10 headlines and there’s 3138 headlines, so 314 pages will be fine. While that’s pulling all the html down you may as well get a cuppa….

1950s-woman-smiling-holding-platter-of-hors-d-oeuvres-snacks

Messing With The Output

The output is basically html pages. You could fire up Python and BeautifulSoup parsers and anything else that takes your fancy, or just use good old command line data science.

egrep -ohi "\d+ new jobs" *.html | egrep -o "\d+" | awk '{ sum+=$1} END {print sum}'

I’m piping three Linux commands, two egreps, the first to pull out “[a number] new jobs”. The -o flag is to only show the matching string from the regular expression, -i ignores the case, “New jobs” and “new jobs” is different otherwise and -h drops the filename in the output.

58 new jobs
61 new jobs
61 new jobs
84 new jobs
84 new jobs
84 new jobs
30 new jobs
30 new jobs
10 new jobs

The second just to get the figure.

30
30
30
40
82
82
15
300
300
23
540
540
36
125
125

And the exciting part is the awk command at the end where it adds up the stream numbers.

70758

Now that last figure is what we’re after. One caveat to that, any headline with a comma in the figure got ignored…. the first regexp will need tweaking…. you can play with that. So a rough estimate is to say that since June 2003 there have been over 70,000 new jobs announced in INI headlines.

The number you won’t get is how many were filled.

* The names have been changed to protect the innocent, in fact, just made up….. no one asked at all.

Taylor’s Power Law and Apple’s Small Change Moves.

Artists can command power, it’s a universal law. Madonna did it, Lady Gaga did it and now Taylor’s doing it too. Fine, but this time it didn’t go far enough.

swift14f-1-webWhile correctly arguing that all artists should be paid for their creativity and, so it seems, getting Apple to reverse a decision on not paying artists for the streaming trial period. Smaller artists still lose out in the long run.

The power law in action once again, only the top artists will make the income, the rest will scramble around the long tail.

300px-Long_tail.svg

What should have really been discussed is the value for each stream across the entire lifetime. It falls way below anything that an artist got in traditional CD sales. And while the internet has created the vast distribution network the long term payouts aren’t that great.

Taylor should have added another paragraph about the amount of money paid to artists.

Just my tuppence.

 

Follow

Get every new post delivered to your Inbox.

Join 553 other followers