One of the highlights of my job as a Data Engineer (I’m not a data scientist) is that I get to do some very cool stuff with text mining and all that data schizz.

So to that end I’m using Apache Spark, Clojure and Sparkling a lot. With that in mind I do a lot of bag of words, word vectors and such things to get topics and classifications from word documents. And it’s at this point that SparkML fails like a complete worn out donkey because it’s one of those small overlooked elements that you come across once in a while.

In topic modelling though it’s nice to know (actually pretty important) which document was labelled with which terms. So anything using SparkML’s hashingTF function has no trace of which document the term frequency came from. Which is rather pointless and, let’s face it, pretty annoying.

There got that off my chest….