Spark-logo-192x100px

While many developers crave the loveliness and simplicity of JSON data it can come with its own set of problems. This is very true when using tools like Spark for consuming data as you cannot guarantee that one line of the text file contains one complete block of a JSON object for processing. Resilient Distributed Datasets (RDD’s) can never be trusted to be complete for processing.

For many Spark is becoming the data processing engine of choice. While the support is based around Scala, Python and Java there are other languages getting their own support too.  I’m pretty much 100% using Clojure now for doing big data work and the Sparkling project is excellent for getting Spark working under Clojure.

Spark has JSON support under the SparkSQL library but this involves loading in JSON data and assuming it as a table for queries. I’m not after that…

Normally you would load data into Spark (in Clojure) like this:

(spark/text-file sc "/Path/To/My/Files")

This will load text into RDD blocks which can make JSON parsing difficult as you can’t assume that all JSON objects are going to be equal and nicely placed on one line.

Spark does have a function called wholeTextFiles which will load in a single or directory full of text files using the filepath/url as the key and the file contents as the value. This functionality has now been included in Sparkling 1.2.2.

(spark/whole-text-files sc "/Path/To/My/Files" 4)

Which loads each text file into it’s own single RDD. You end up with a JavaPairRDD with the key being the file path. With Sparkling destructuring you can map through the files easily. So to load the file in, parse the JSON and set the keys up (converting to lower case for tidiness) you end up with something like this:

(->> (spark/whole-text-files sc filepath 4)  
     (spark/map (s-de/key-value-fn (fn [k v] 
       (-> v
          (clojure.data.json/read-str 
             :key-fn (fn [key] 
               (-> key
                   str/lower-case
                   keyword)))))))

Obviously with large JSON files going into single RDD’s the processing can take some time so be careful with huge files on a single cluster.

Advertisements