In the first part of this walk through I showed you how to use the excellent NI Assembly open data platform to find out the frequency of departments members were asking questions to.

A picture speaks a thousand words so they say, so it makes sense to attempt to visualise a diagram of the data we’ve worked on.

What’s A Sankey Diagram?

A sankey diagram is basically a collection of node labels with connections, these connections are weighted by value, the higher the value the thicker the conncetion.

sankeydemo

 

The data is based on a CSV file with a source node, target node and a value. Simple as that.

Reusing the Spark Work

In the previous post we left Spark and the data in quite a nice position. A Pair RDD with the member id as a key and a vector of department names with the question frequencies.

The first job is to transform this in to a CSV file. As we left it we had a Pair RDD with [k, v] with the v being a map of department names and the value of that map the frequency. So in reality we’ve got [k, [{ k v, k v, k v…….k v}].

For example, let’s look at the first element in our RDD after doing a spark/collect on the Pair RDD.

mlas.core> (first dfvals)
["8" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

The first element is the member id and second element is the department/frequency map. Remember this is for one MLA, there are still 103 in the RDD altogether.

mlas.core> (first x)
"8"
mlas.core> (second x)
{"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}

We can use Clojure’s map function to process each key/pair of the map data.

mlas.core> (map (fn [[k v]] (println k " > " v)) (second x))
Department of Culture, Arts and Leisure > 32
Department of the Environment > 96
Department for Social Development > 76
Department of Agriculture and Rural Development > 53
Department for Employment and Learning > 40
Department for Regional Development > 128
Northern Ireland Assembly Commission > 18
Department of Education > 131
Department of Health, Social Services and Public Safety > 212
Department of Justice > 38
Department of Finance and Personnel > 105
Office of the First Minister and deputy First Minister > 151
Department of Enterprise, Trade and Investment > 66
(nil nil nil nil nil nil nil nil nil nil nil nil nil)
mlas.core>

A couple of things to note, notice the use of [k v] being passed in to the map function. Secondly as I’m using the println function the result of the map function is going to be nil. The last line is the result of the Clojure map function.

In part we’ve got two thirds of the CSV output already done with the target node and the value, I need to redo the Spark function so instead of the member id being the key I want the name of the MLA in question.

(defn mlaname-department-frequencies-rdd [members-questions-rdd]
 (->> members-questions-rdd
     (spark/map-to-pair 
        (s-de/key-val-val-fn (fn [key member questions]
          (let [freqmap (map (fn 
(:departmentname question)) questions)] (spark/tuple (:membername member) (frequencies freqmap))))))))

When I run this function with the existing member/question Pair RDD I get a new Pair RDD with the following:

mlas.core> (def mmdep-freq (mlaname-department-frequencies-rdd mq-rdd))
#'mlas.core/mmdep-freq
mlas.core> (spark/first mmdep-freq)
#sparkling/tuple ["Beggs, Roy" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

With that RDD we have all the elements for the required CSV file. A source node (the MLA’s name), a target node (the department) and a value (the frequency). Notice that I’m also removing the commas from the MLA name and the department name, otherwise I’ll break the sanaky diagram when it’s rendered on screen.

(defn generate-csv-output [mddep-freq]
  (->> mddep-freq 
       (spark/map (s-de/key-value-fn (fn [k v] (let [mlaname k]
           (map (fn [[department frequency]]
                        [(str/replace mlaname #"," "")
                         (str/replace department #"," "")
                         frequency]) v)))))
       (spark/collect)))

And then a method to write the actual cvs file.

(defn write-csv-file [filepath data]
 (with-open [out-file (io/writer (str filepath "sankey.csv"))]
 (csv/write-csv out-file data)))

To test I’m just going to write out the first MLA in the vector.

mlas.core> (def csv-to-output (generate-csv-output mldep-rdd))
#'mlas.core/csv-to-output
mlas.core> (first csv-to-output)
(["Beggs, Roy" "Department of Culture, Arts and Leisure" 32] ["Beggs, Roy" "Department of the Environment" 96] ["Beggs, Roy" "Department for Social Development" 76] ["Beggs, Roy" "Department of Agriculture and Rural Development" 53] ["Beggs, Roy" "Department for Employment and Learning" 40] ["Beggs, Roy" "Department for Regional Development" 128] ["Beggs, Roy" "Northern Ireland Assembly Commission" 18] ["Beggs, Roy" "Department of Education" 131] ["Beggs, Roy" "Department of Health, Social Services and Public Safety" 212] ["Beggs, Roy" "Department of Justice " 38] ["Beggs, Roy" "Department of Finance and Personnel" 105] ["Beggs, Roy" "Office of the First Minister and deputy First Minister" 151] ["Beggs, Roy" "Department of Enterprise, Trade and Investment" 66])
mlas.core> (write-csv-file "/Users/Jason/Desktop/sankey.csv (first csv-to-output))
nil
mlas.core>

So far so good, checking on my desktop and there’s a CSV file ready for me to use. I just need to add the header (source,target,value) to the top line. In all honesty I should really insert that header row at the start of the vector.

Creating The Sankey Diagram

Where possible it’s best to learn from example and in all honesty I’m not a visualisation kinda guy. So when the going gets tough, the tough Google D3 examples.

So there’s a handy Sankey diagrams with CSV files that I can use. So a small amount of copy/paste to create the index.html and sankey.js files, all I have to do is copy the sankey.csv that Spark just output for us. I’ve extended the length of the canvas to paint the sankey diagram on to.

Appending a couple of CSV output files to sankey.csv will give us a starting point. If I reload the page (Dropbox doubles as a very handy web server for static pages if you put html files in the Public directory) you end up with something like the following.

sankey2

 

Okay, it’s not perfect but it’s certainly a starting point. Just imagine how it would look with all the MLA’s….. maybe’s later.

Conclusion

Once again I’ve rattled through some Spark and Clojure but we’re essentially reusing what we have. The D3 outputs take some experimentation and time to get right. Keep in mind if you have a lot of nodes (notice how I’m only dealing with two MLA’s at the moment) the rendering can take some time.

References

Sankey with D3 and CSV files: http://bl.ocks.org/d3noob/c9b90689c1438f57d649

Github Repo of this project: https://github.com/jasebell/niassembly-spark

 

 

Advertisements