London Fashion Week kicked off today which means it’s a fine excuse to run some more mining algorithms. One thing that becomes very apparent is the lack of definition of hashtags. Now hashtags (#thosethings) are under no definition apart from the author of the tweet. That’s all very well for the normal readers of the tweets, you can read past the hashtags (a bit like bad spelling, you read into the sentence and past them).
For machine learning programs it causes all sorts of issues. Consider the following scenario of Antoni & Alison who did a digital catwalk today. Here’s the hashtags we pulled out with the Cloudatics platform (we gathered our data via the excellent Repknight)
As you may have noticed there’s an issue with the word “and” with its multiple ways of saying the same thing. “And”, “+” and “&” all mean “and”. Our issue is when you start to mine the data with the likes of Hadoop/MapReduce the tags above are classed as individual classifications with their own count.
The importantance of extract, transform and load (ETL) now becomes very important in order to gain a good working set of data.
So a mapping of “and” should connect to “+” and “&” within the context of the tweet.
I’ll be looking at this in some detail over the next few blog posts and how we can put these into use with Hadoop and Cloudatics in general.