NIAssembly Open Data – Part 2 – Sankey Diagrams #opendata #clojure #spark #sankey

In the first part of this walk through I showed you how to use the excellent NI Assembly open data platform to find out the frequency of departments members were asking questions to.

A picture speaks a thousand words so they say, so it makes sense to attempt to visualise a diagram of the data we’ve worked on.

What’s A Sankey Diagram?

A sankey diagram is basically a collection of node labels with connections, these connections are weighted by value, the higher the value the thicker the conncetion.

sankeydemo

 

The data is based on a CSV file with a source node, target node and a value. Simple as that.

Reusing the Spark Work

In the previous post we left Spark and the data in quite a nice position. A Pair RDD with the member id as a key and a vector of department names with the question frequencies.

The first job is to transform this in to a CSV file. As we left it we had a Pair RDD with [k, v] with the v being a map of department names and the value of that map the frequency. So in reality we’ve got [k, [{ k v, k v, k v…….k v}].

For example, let’s look at the first element in our RDD after doing a spark/collect on the Pair RDD.

mlas.core> (first dfvals)
["8" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

The first element is the member id and second element is the department/frequency map. Remember this is for one MLA, there are still 103 in the RDD altogether.

mlas.core> (first x)
"8"
mlas.core> (second x)
{"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}

We can use Clojure’s map function to process each key/pair of the map data.

mlas.core> (map (fn [[k v]] (println k " > " v)) (second x))
Department of Culture, Arts and Leisure > 32
Department of the Environment > 96
Department for Social Development > 76
Department of Agriculture and Rural Development > 53
Department for Employment and Learning > 40
Department for Regional Development > 128
Northern Ireland Assembly Commission > 18
Department of Education > 131
Department of Health, Social Services and Public Safety > 212
Department of Justice > 38
Department of Finance and Personnel > 105
Office of the First Minister and deputy First Minister > 151
Department of Enterprise, Trade and Investment > 66
(nil nil nil nil nil nil nil nil nil nil nil nil nil)
mlas.core>

A couple of things to note, notice the use of [k v] being passed in to the map function. Secondly as I’m using the println function the result of the map function is going to be nil. The last line is the result of the Clojure map function.

In part we’ve got two thirds of the CSV output already done with the target node and the value, I need to redo the Spark function so instead of the member id being the key I want the name of the MLA in question.

(defn mlaname-department-frequencies-rdd [members-questions-rdd]
 (->> members-questions-rdd
     (spark/map-to-pair 
        (s-de/key-val-val-fn (fn [key member questions]
          (let [freqmap (map (fn 
(:departmentname question)) questions)] (spark/tuple (:membername member) (frequencies freqmap))))))))

When I run this function with the existing member/question Pair RDD I get a new Pair RDD with the following:

mlas.core> (def mmdep-freq (mlaname-department-frequencies-rdd mq-rdd))
#'mlas.core/mmdep-freq
mlas.core> (spark/first mmdep-freq)
#sparkling/tuple ["Beggs, Roy" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

With that RDD we have all the elements for the required CSV file. A source node (the MLA’s name), a target node (the department) and a value (the frequency). Notice that I’m also removing the commas from the MLA name and the department name, otherwise I’ll break the sanaky diagram when it’s rendered on screen.

(defn generate-csv-output [mddep-freq]
  (->> mddep-freq 
       (spark/map (s-de/key-value-fn (fn [k v] (let [mlaname k]
           (map (fn [[department frequency]]
                        [(str/replace mlaname #"," "")
                         (str/replace department #"," "")
                         frequency]) v)))))
       (spark/collect)))

And then a method to write the actual cvs file.

(defn write-csv-file [filepath data]
 (with-open [out-file (io/writer (str filepath "sankey.csv"))]
 (csv/write-csv out-file data)))

To test I’m just going to write out the first MLA in the vector.

mlas.core> (def csv-to-output (generate-csv-output mldep-rdd))
#'mlas.core/csv-to-output
mlas.core> (first csv-to-output)
(["Beggs, Roy" "Department of Culture, Arts and Leisure" 32] ["Beggs, Roy" "Department of the Environment" 96] ["Beggs, Roy" "Department for Social Development" 76] ["Beggs, Roy" "Department of Agriculture and Rural Development" 53] ["Beggs, Roy" "Department for Employment and Learning" 40] ["Beggs, Roy" "Department for Regional Development" 128] ["Beggs, Roy" "Northern Ireland Assembly Commission" 18] ["Beggs, Roy" "Department of Education" 131] ["Beggs, Roy" "Department of Health, Social Services and Public Safety" 212] ["Beggs, Roy" "Department of Justice " 38] ["Beggs, Roy" "Department of Finance and Personnel" 105] ["Beggs, Roy" "Office of the First Minister and deputy First Minister" 151] ["Beggs, Roy" "Department of Enterprise, Trade and Investment" 66])
mlas.core> (write-csv-file "/Users/Jason/Desktop/sankey.csv (first csv-to-output))
nil
mlas.core>

So far so good, checking on my desktop and there’s a CSV file ready for me to use. I just need to add the header (source,target,value) to the top line. In all honesty I should really insert that header row at the start of the vector.

Creating The Sankey Diagram

Where possible it’s best to learn from example and in all honesty I’m not a visualisation kinda guy. So when the going gets tough, the tough Google D3 examples.

So there’s a handy Sankey diagrams with CSV files that I can use. So a small amount of copy/paste to create the index.html and sankey.js files, all I have to do is copy the sankey.csv that Spark just output for us. I’ve extended the length of the canvas to paint the sankey diagram on to.

Appending a couple of CSV output files to sankey.csv will give us a starting point. If I reload the page (Dropbox doubles as a very handy web server for static pages if you put html files in the Public directory) you end up with something like the following.

sankey2

 

Okay, it’s not perfect but it’s certainly a starting point. Just imagine how it would look with all the MLA’s….. maybe’s later.

Conclusion

Once again I’ve rattled through some Spark and Clojure but we’re essentially reusing what we have. The D3 outputs take some experimentation and time to get right. Keep in mind if you have a lot of nodes (notice how I’m only dealing with two MLA’s at the moment) the rendering can take some time.

References

Sankey with D3 and CSV files: http://bl.ocks.org/d3noob/c9b90689c1438f57d649

Github Repo of this project: https://github.com/jasebell/niassembly-spark

 

 

NIAssembly Open Data – #opendata #ni #spark #clojure

Data From Stormont

The Northern Ireland Assembly has opened up a fair chunk of data as a web service, returning results in either XML or JSON format. And from first plays with it, it’s rather well put together.

StormontChamber

What I’ve also learned is that the team listen, a small suggestion was implemented no sooner had they returned to work on the Monday morning. Only goes to show that the team want this succeed.

The web service is split up in to various areas:

  • Members – current and historical MLA’s
  • Questions – written and verbal questions.
  • Organisations
  • Plenary
  • Hansard – contributions by members during plenary debate

With the data being under an Open Northern Ireland Assembly Licence, as long as you provide a link back to it.

The Project

I’m going to setup a Clojure/Spark project and start processing some of this data. I want to do the following items:

  1. Load the current members data.
  2. Save the questions for each member.
  3. Load the saved questions for each member.
  4. Join the data together by the member id.
  5. Find the frequency of departments specific member questions are directed at.

Setting Up Spark

Before I can do any of that I need to set up the require Spark context so I can handle the data how I want.

(comment
 (def c (-> (conf/spark-conf)
 (conf/master "local[3]")
 (conf/app-name "niassemblydata-sparkjob")))
 (def sc (spark/spark-context c)))

The reason I put this in a comment block is that it’s there for copy/pasting in the REPL when I need it.

Loading the Members

The Members API has details of the members past and present. Right now I’m only concerned about the current ones. So the end point I want to use is:

http://data.niassembly.gov.uk/members_json.ashx?m=GetAllCurrentMembers

This returns the current members in JSON format with name, display name, organisation and so on. So I’m not hammering the API with a request each time I’ve downloaded the JSON contents and saved them to a file.

$ curl http://data.niassembly.gov.uk/members_json.ashx?m=GetAllCurrentMembers > members.json

Next I want to load this file in to Spark and create a pair RDD with the member id as the key and the JSON data as a map for that member as the value.

(defn load-members [sc filepath] 
 (->> (spark/whole-text-files sc filepath)
      (spark/flat-map (s-de/key-value-fn (fn [key value] 
          (-> (json/read-str value :key-fn (fn [key] (-> key
               str/lower-case
               keyword)))
               (get-in [:allmemberslist :member])))))
       (spark/map-to-pair (fn [rec] (spark/tuple (:personid rec) rec)))))

As you can’t confirm that the JSON file is going to be neatly one object per line, which I know it isn’t in this case, I’ll use Spark’s wholeTextFile method to load in one file per RDD. This returns a pair RDD [filename, contents-of-file] then iterate through each RDD using Clojure’s JSON library to read in each member (nested within the AllMembersList->Member JSON Array), convert the key name to a Clojure map key and convert that name to lower case.

In the Clojure REPL I can test this easily:

mlas.core> (def members (load-members sc "/Users/Jason/work/data/niassembly/members.json"))
#'mlas.core/members
mlas.core> (spark/first members)
#sparkling/tuple ["307" {:partyorganisationid "111", :memberfulldisplayname "Mr S Agnew", :membertitle "MLA - North Down", :memberimgurl "http://aims.niassembly.gov.uk/images/mla/307_s.jpg", :membersortname "AgnewSteven", :memberlastname "Agnew", :personid "307", :constituencyname "North Down", :memberprefix "Mr", :constituencyid "11", :partyname "Green Party", :membername "Agnew, Steven", :affiliationid "2482", :memberfirstname "Steven"}]

So I have a tuple of members with the memberid being the key and a map of key/values for the actual record.

Saving the Questions For Each Member

With the members safely dealt with I can now turn my attention to the questions. There is a web service within the API that will return a JSON set of question for a given member, all I have to do is pass in the member ID as a value on the end point.

So, for example, if I want to get the questions for member ID 90 I would call the service with (copy/paste the url below in to browser to see the actual output):

http://data.niassembly.gov.uk/questions_json.ashx?m=GetQuestionsByMember&personId=90

As I want to load the questions for each member I’m going to iterate my paid RDD of members and use the key (as it’s the member id) to pull the data via URL with Clojure’s slurp function and the save the JSON response to disk with Clojure’s spit function.

(defn save-question-data [pair-rdd] 
 (->> pair-rdd 
      (spark/map (s-de/key-value-fn (fn [key value] 
               (spit (str questions-path key ".json") 
                   (slurp (str api-questions-by-member key))))))))

I can run this from the REPL easily but it will take a bit of time.

mlas.core>(spark/collect  (save-question-data members))

At this point all I’ve done is call the web service and save the questions to disk. I now need to load them in to Spark and create another pair RDD for each question and I want to use the :tablerpersonid as the key for the tuple.

Loading The Question Data Into Spark

In the same way we loaded the members data as a filename/filecontent pair we are going to do the same with the question data. This time there’s a whole directory of files.

Now at present as I’m in development mode I’m making an assumption here, I’m assuming that every question set has questions in it, there are though some MLA’s that don’t like asking questions for one reason or another.

$ ls -l -S
-rw-r--r-- 1 Jason staff 22 21 Jul 20:18 131.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:18 5223.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:17 5225.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:19 71.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:21 95.json

So for the minute I’m going to delete those and only concentrate with members who ask questions.

I’ve written a function that will iterate each member and load the questions file that member ID.

(defn load-questions [sc questions-path] 
 (->> (spark/whole-text-files sc questions-path)
      (spark/flat-map (s-de/key-value-fn (fn [key value] 
         (-> (json/read-str value
                  :key-fn (fn [key] 
                           (-> key 
                               str/lower-case 
                               keyword)))
             (get-in [:questionslist :question])))))
       (spark/map-to-pair (fn [rec] (spark/tuple (:tablerpersonid rec) rec)))
       (spark/group-by-key)))

Notice the spark/group-by-key which gives us a key with a vector of maps [k, [v1, v2….vn]. If it was left out we’d have a pair RDD with lots of rows. With that loaded in to Spark we can have look at what we have.

mlas.core> (def questions (load-questions sc "/Users/Jason/work/data/niassembly/questions"))
#'mlas.core/questions
mlas.core> (spark/count questions)
103
mlas.core> (spark/first questions)
#sparkling/tuple ["104" [{:tablerpersonid "104", :departmentid "76", :questiondetails "http://data.niassembly.gov.uk/questions.asmx/GetQuestionDetails?documentId=2828", :documentid "2828", :reference "AQW 141/07", :departmentname "Department of Agriculture and Rural Development", :tableddate "2007-05-22T00:00:00+01:00", :questiontext "To ask the Minister of Agricultur...............}]]
mlas.core>

So far we have two pair RDD’s one for members and one for questions, both have the member id as the key. This is a good place to be as it means we can easily join the data.

Joining RDD Datasets

Using Spark’s join functionality we get a pair RDD with the key and then two RDD blocks, [key, [left-rdd, right-rdd]. If there is a left and right element then the join will happen, if not then it will be left off. Now if we were to use left-outer-join the left RDD would be preserved even if the right hand side has no value that matches by the key.

mlas.core> (def member-questions-rdd (spark/join members questions))
#'mlas.core/member-questions-rdd
mlas.core> (spark/count member-questions-rdd)
103
mlas.core>

A good rule of thumb to check is if the quantity of the joined data is more than the original left hand side RDD count. If that’s the case then check for duplicate member id’s in the member RDD.

Calculating Department Frequencies

First of all let’s discuss what we’re trying to achieve. For every member we want to see the frequency of departments that MLA’s are directing their questions at. Now we have the joined RDD with the members and questions we can run Spark to find out for us.

As Spark actions are immutable you will end up doing several Spark map segments to get to an answer. You’ve seen so far we’ve done a map to load members, a map to load the questions and a join to give us a [key, [member, questions]] pair RDD.

What I really want to do is refine this a bit further. One of the nice things with using Spark under Clojure is Sparkling Destructuring which gives you a handy set of functions for dealing with iterating the data. So for our key/value/value pair RDD we can use s-de/key-val-val-fn and this will give us an iterable function with the key, the first value (member) and the second value (questions) as accessible values.

(defn department-frequencies-rdd [members-questions-rdd] 
   (->> members-questions-rdd
        (spark/map-to-pair 
           (s-de/key-val-val-fn (fn [key member questions]
              (let [freqmap (map (fn 
(:departmentname question)) questions)] (spark/tuple key (frequencies freqmap))))))))

When I run this I get the following output:

mlas.core> (def freqs (department-frequencies-rdd members-questions-rdd))
#'mlas.core/freqs
mlas.core> (spark/first freqs)
#sparkling/tuple ["8" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]
mlas.core>

We could extend this a little further with some more information on the member but we’ve essentially achieved what we proposed at the start.

Concluding

This is merely scratching the surface of the NI Assembly data sets that are available. With some simple Clojure and Spark usage we’ve managed to pull the member data and questions, do a simple join and the find the frequency of departments.

Most data science is about consolidating large sets of data down to simple numbers that can be presented. Just by looking at the data I wouldn’t have know that a member has asked 212 questions to the Department of Health, Social Services and Safety.

Now I can.

You can download the source code for this project from the Github repository.

 

Using iBeacons for Airside Retail Loyalty – #airport #marketing #retail #loyalty

6-image

 

Confession, I Love Airport Retail Spaces

My fascination with loyalty goes way back, I’ve pushed and prodded the corners of this area and it’s enabled me to learn and eventually work in some very cool areas such as mobile and, more importantly to me, data science and engineering.

So I always keep an eye on things as you never know when opportunity will present itself. Over the weekend yet again I was thinking about airport retail and mobile loyalty and I glanced over the editorial press releases in Airports Magazine (yes there is a magazine for airports) and I read a press release from Eye Airports partnering with Proxama to deploy 200 iBeacons in 8 UK airports.

Is This How Passengers Want to Be Treated?

I’ve done a lot of flying this year so I’ve spent a stupid amount of time sat in departure lounges. So plenty of time to observe. There are three issues with iBeacons that always have bugged me.

  • It’s an Apple based technology
  • What’s the battery drain impact?
  • What are the privacy implications?

It’s an Apple Based Technology

Ultimately the iBeacon is an Apple creation, while the underlying technology isn’t anything more than Bluetooth 4 scanning for phones and devices Apple are going to greater lengths to cut out other vendors. So much so that Google are developing their own for Android. Now no retail space wants to start keeping two beacon devices just to keep two vendors happy. If there’s anything I’ve noticed recently there are a lot of Android devices being used by travellers.

Battery Implications

Air travel is stressful, the main aim is to get through security relatively unscathed.

pragmaimage3

Once you’re through then stress levels drop while you enter the retail space known as the departure lounge. Even when you board your flight the stress level is far lower than getting through security.

A new element in the mix is that of device battery life. With the era of electronic boarding cards preserving battery life is now a clean cut fly/can’t fly decision. Unless you can find a charging point (Newcastle does this well, Leeds Bradford I struggled a bit and Belfast International has one table at Starbucks holds plug point nirvana) then you will do everything to preserve battery, first things first is turning Bluetooth off. Once this happens then every marketing play by beacons is redundant. You always need a Plan B for how you’re going to reach a customer.

Privacy Implications

Privacy will always be in the back of people’s minds when it comes to this form of technology. There’s a pseudo opt-in mechanism, you’ll need some app developed by Proxama for example in order to be picked up by the beacons.

But every push and pull will be recorded so the data that can be gleaned is going to be retail gold dust to those that can analyse it. And recently customers are turning away from deals, daily deals and being force fed “buy this”.

There are a couple of instances where I’d want to be forced ads but that’s from an operational standpoint where certain events could happen.

The Trial

Eye Airports and Proxama are on a two year contract to roll this out and see how it performs. I’ve got reservations of the amount of passengers that will actually use the technology just based on the basic logic of how passengers behave in airports with mobile technology, the technology stack itself and whether passengers want to be creeped out by being tracked.

I’ll never get to see the final metrics I’m sure, but I think it will make interesting reading to those who can.

 

 

 

 

Invest NI’s “new jobs” headlines…. how many in a lifetime?

I received a question for Boris Drakemandrillsquirrelhugger*, “Jase, you do data science, how many new jobs have Invest Northern Ireland announced in total?”.

“Bless My Cotton Socks I’m In The News”

First we need headlines and in one line of Linux we can have the whole lot.

$for i in {1..314}; do curl http://www.investni.com/news/index.html?page=$i > news_$i.html; done

This is exactly the same as how I pulled nijobs.com data in a previous blog post. Each page is 10 headlines and there’s 3138 headlines, so 314 pages will be fine. While that’s pulling all the html down you may as well get a cuppa….

1950s-woman-smiling-holding-platter-of-hors-d-oeuvres-snacks

Messing With The Output

The output is basically html pages. You could fire up Python and BeautifulSoup parsers and anything else that takes your fancy, or just use good old command line data science.

egrep -ohi "\d+ new jobs" *.html | egrep -o "\d+" | awk '{ sum+=$1} END {print sum}'

I’m piping three Linux commands, two egreps, the first to pull out “[a number] new jobs”. The -o flag is to only show the matching string from the regular expression, -i ignores the case, “New jobs” and “new jobs” is different otherwise and -h drops the filename in the output.

58 new jobs
61 new jobs
61 new jobs
84 new jobs
84 new jobs
84 new jobs
30 new jobs
30 new jobs
10 new jobs

The second just to get the figure.

30
30
30
40
82
82
15
300
300
23
540
540
36
125
125

And the exciting part is the awk command at the end where it adds up the stream numbers.

70758

Now that last figure is what we’re after. One caveat to that, any headline with a comma in the figure got ignored…. the first regexp will need tweaking…. you can play with that. So a rough estimate is to say that since June 2003 there have been over 70,000 new jobs announced in INI headlines.

The number you won’t get is how many were filled.

* The names have been changed to protect the innocent, in fact, just made up….. no one asked at all.

Taylor’s Power Law and Apple’s Small Change Moves.

Artists can command power, it’s a universal law. Madonna did it, Lady Gaga did it and now Taylor’s doing it too. Fine, but this time it didn’t go far enough.

swift14f-1-webWhile correctly arguing that all artists should be paid for their creativity and, so it seems, getting Apple to reverse a decision on not paying artists for the streaming trial period. Smaller artists still lose out in the long run.

The power law in action once again, only the top artists will make the income, the rest will scramble around the long tail.

300px-Long_tail.svg

What should have really been discussed is the value for each stream across the entire lifetime. It falls way below anything that an artist got in traditional CD sales. And while the internet has created the vast distribution network the long term payouts aren’t that great.

Taylor should have added another paragraph about the amount of money paid to artists.

Just my tuppence.

 

Processing JSON with Sparkling – #sparkling #spark #bigdata #clojure

Spark-logo-192x100px

While many developers crave the loveliness and simplicity of JSON data it can come with its own set of problems. This is very true when using tools like Spark for consuming data as you cannot guarantee that one line of the text file contains one complete block of a JSON object for processing. Resilient Distributed Datasets (RDD’s) can never be trusted to be complete for processing.

For many Spark is becoming the data processing engine of choice. While the support is based around Scala, Python and Java there are other languages getting their own support too.  I’m pretty much 100% using Clojure now for doing big data work and the Sparkling project is excellent for getting Spark working under Clojure.

Spark has JSON support under the SparkSQL library but this involves loading in JSON data and assuming it as a table for queries. I’m not after that…

Normally you would load data into Spark (in Clojure) like this:

(spark/text-file sc "/Path/To/My/Files")

This will load text into RDD blocks which can make JSON parsing difficult as you can’t assume that all JSON objects are going to be equal and nicely placed on one line.

Spark does have a function called wholeTextFiles which will load in a single or directory full of text files using the filepath/url as the key and the file contents as the value. This functionality has now been included in Sparkling 1.2.2.

(spark/whole-text-files sc "/Path/To/My/Files" 4)

Which loads each text file into it’s own single RDD. You end up with a JavaPairRDD with the key being the file path. With Sparkling destructuring you can map through the files easily. So to load the file in, parse the JSON and set the keys up (converting to lower case for tidiness) you end up with something like this:

(->> (spark/whole-text-files sc filepath 4)  
     (spark/map (s-de/key-value-fn (fn [k v] 
       (-> v
          (clojure.data.json/read-str 
             :key-fn (fn [key] 
               (-> key
                   str/lower-case
                   keyword)))))))

Obviously with large JSON files going into single RDD’s the processing can take some time so be careful with huge files on a single cluster.

Cassandra Invalid Token for Murmur3Partitioner Problems. #cassandra

If you are manually booting an Apache Cassandra server and you get the following message:

Fatal configuration error; unable to start server. See log for stacktrace.
 INFO 21:29:27,638 Announcing shutdown
ERROR 21:29:27,639 Exception in thread Thread[StorageServiceShutdownHook,5,main]
java.lang.IllegalArgumentException: Invalid token for Murmur3Partitioner. Got true but expected a long value (unsigned 8 bytes integer).
 at org.apache.cassandra.dht.Murmur3Partitioner$1.fromString(Murmur3Partitioner.java:190)
 at org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1456)
 at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1518)
 at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1355)
 at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1145)
 at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1374)
 at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1400)
 at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:584)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.lang.Thread.run(Thread.java:724)

No need to panic. Chances are the server is still running, it’s just a case of killing the process running it and starting the server up again.

Sounds daft but it saves Googling around trying to find the answer….

 

Harvesting Data Collaboration in Northern Ireland #data #startups #prediction

There’s plenty of talk about data, analytics, Big Data, artificial intelligence, deep learning and so on. Nerdy conversations that tend to keep the geeks, the marketing department and the press release writers happy but the rest of the population completely cold.

Who’s the Real Data Audience?

Let’s remind ourselves where the rest of the population actually are.

normal_distribution_500x263

Can you guess? A hint for you, they reside with two standard deviations of the average and make up the majority of people.

All the talk of open data, developers taking to social media to give the likes of Translink asking for their god given right to open data, all very well but it doesn’t resonate with the key stakeholders…. the public, the businesses and the day to day humdrum of Northern Ireland.

There’s excellent work going on with the open data initiatives from DETI and other interested parties. Progress may be slow but I’d expect it to be slow (with expected public service cuts don’t expect Translink data to be high on anyone’s list). I know the tech heads are itchy to do things and hackathons are happening (the Urban Hackathon is coming up at the end of June). The real questions are these: Does it resonate with the public? Where’s the win win? What’s in it for them?

With all the talk of opening data up, “do you do open data, everyone does open data”, there’s little talk of the potential data collaboration between small medium enterprises (SME’s) in the province. Does is matter? Yes of course it does.

A Simple Collaborative Example

Let’s take a hotel, Hastings Hotels operate a number of locations in the province. Is it possible to predict room rates 40 days out based on certain factors? Of course it is, there’s last years bookings and repeat visitors. That’s looking back, I’m more interested in predicting forward. Assuming occupancy is at 80% what will it take to hit 90%?

Now I could rest on laurels and assume that Game of Thrones is going to push up the numbers, stick my finger in the air and see which way the wind is blowing.

Even better though would be to take a feed from somewhere that has plenty of rich event data, large scale events and smaller ones the area. With a feed of dates, event types you could calculate the peak nights of occupancy. Data from What’s On NI (http://www.whatsonni.com) is about as rich as it gets, local events, big events and major events get listed. That data has value.

So the question is: Taking a feed from whatsonni.com can I (Hastings) calculate room rates for the next 40 days based on peak event data?

I believe it’s possible and a win for both parties, whatsonni.com could gain revenue from the feed for each of the hotels and if Hastings could raise peak pricing even by 7-10% on an average time of year room rate, the multiples involved would be a big win for them.

Concluding

I’ve dreamt up one example, a simple but highly effective one. It’s an easy sell to both parties, “I’d love both of you to increase revenue by collaborating, let’s do a trial for six months”. Now have a think about all the other businesses out there, data interconnecting and collaborating with each other. A series of paid for end points where everyone else could potentially benefit. This sort of thinking will raise NI’s bottom line and it’s all possible.

It’s also a perfect fit for proof of concept grants, where there is a solid basis of potential to see real benefit in all business sectors, not just development of what I would consider limited use mobile applications.

You’ll still need the help from the nerds, there’ll always be a need.

 

Sliding Window calculations in #Clojure

For time series calculations the sliding window is a tool for applying some calculation against the numbers in incremental stages.

This could be calculation the average temperature across a series or readings, or heart rate or something similar.

A set of numbers…. here you are in your REPL.

user> (def readings [1 3 4 5 3 4 2 6 5 4 3 5 7 8])
#'user/readings

The partition function will split those numbers up into a sequence of sequences. This is effectively your set of sliding windows.

user> (partition 3 1 readings)
((1 3 4) (3 4 5) (4 5 3) (5 3 4) (3 4 2) (4 2 6) (2 6 5) (6 5 4) (5 4 3) (4 3 5) (3 5 7) (5 7 8))

You can see the partition function has created a set for the first three numbers, then stepped one number to the right and created another set. It does that for the entire sequence of numbers you supply it.

Perhaps you want to calculate the average of each set of numbers. You can now apply a map function to work on each set the partition function has given you.

user> (map (fn [window] (double (/ (apply + window) (count window)))) (partition 3 1 readings))

(2.666666666666667 4.0 4.0 4.0 3.0 4.0 4.333333333333333 5.0 4.0 4.0 5.0 6.666666666666667)

Handy for monitoring internet of things readings and getting the average. Actually loads of uses when start thinking of the possibilities.

Myth Busting Growth Figures….

Let’s consider this really quickly…..

tweetmadmen

I respect this sort of thing being tweeted by @naomhs, on the whole it’s actually a good piece on taking the social media advantage, increasing eyeballs and digital engagement. That’s fine.

Anything with a percentage sign is like catnip to me though, especially when it’s about growth. In fact any press release where an organisation claims to have n% growth gets my attention because I’m always looking for two things, a starting number and an ending number.

For example:

Clicks last month = 10

Clicks this month = 100

Growth is ((this month – last month) / last month) * 100.

So ((100 -10) / 10) * 100 = 900% = DRAFT A PRESS RELEASE!

Even if last months figure was 1 and this month is 10, it’s still a 900% growth rate! DRAFT ANOTHER PRESS RELEASE WITH KITTENS THIS TIME!

 

Follow

Get every new post delivered to your Inbox.

Join 533 other followers