It's All a Bet

Jason Bell – Author, Advisor and Practitioner in Machine Learning and Artificial Intelligence

Over 475 machine learning citations and 50+ patent citations on machine learning.

Twitter Fashion Analytics in Spring XD [Part 1] #BigData #Fashion @editd

Careers Teachers are Dangerous

20091019-DSCF0212-copy

Careers teachers are people with responsibility beyond most. They can either empower or crush the dreams of the teenager in a heartbeat. Mine were crushed in an instant and my teacher laughed at my wish to be a fashion photographer, and if I were to be one (according to him) it would only be “mundane catalogue work”. So much for aspiring dreams….

Two things happened that day, firstly I took up the bass guitar as some form of rebellion to photography and I also took my computing work a lot more seriously. The fashion thing never really left me though I never got into the industry in the end. I did go back to photography though (I know your minds are racing, it wasn’t about the models).  Through the work of Datasentiment the fashion industry was pivotal to my data design choices, fashion represented “distressed stock/inventory” items that had time critical value. Where are the sales peaks? When do the discounts start? Do the percentages slide? I’ve got notebooks full of this stuff.

Fast forward to November 2013.

tt

To my mind EditD are one of the best companies doing realtime insights with data. I’ve not had the joy of seeing the whole application (and I’ll probably never will), I have to stand from the sidelines and read the reports. They’re all excellent and very very well informed from the data collected all over the web, retail and beyond.

The reports for fashion weeks and brand updates make reference to online mentions and I’m assuming this suggests the usual suspects, Twitter and Facebook. I don’t do the Facebook thing so I’ll put that to one side. But oh oh oh, mentions, sentiment and all that other garb, yup the Twitterati in full flow give us enough information to keep us occupied for a long time.

But wait Jase!

Yes, I know what you’re thinking, I’ve done all this before. Well I have in part. Twitter sentiment analysis in 30 Seconds (done in R), the Raspberry Pi Twitter Sentiment Server (in R and Python).  And yes I’ve done shoes sizes before as well….. but this is different. These were based on searches not streams, it was also a bit clunky, though acceptable.

Previously to do this sort of stuff there’s been a big technical overload to get to the point of getting the data in and stored. Whether that was to a normal database server, HDFS or even a text file, well it was a pain.

Spring XD

Spring call XD a “unified, distributed and extensible system for data ingestion, real time analytics, batch processing and data export”. Sold to the man with the curly hair…. so let’s get cracking.

Right now, part 1, I’m only bothered about getting Twitter data into a file.  Part 2 we’ll start to do things with the data.

Spring XD implements the Twitter Streaming and Search API’s, we’ll use the streaming API for our needs.  We’re going to set up some streams for a few shoe brands.

Download the Spring XD application from the Spring download site.  Once you have unzipped that to a directory we can start getting everything together.

Defining the Twitter Application

Firstly we need a Twitter application with consumer key and secret. So you’ll need a Twitter account and a developer account.  Create an application and make note of the consumer key/secret and access token/secret. We’ll be using those in a minute.

twittersetup

Getting Spring XD Started

Grab yourself two terminal windows, trust me you’ll need them.

The first terminal window we’ll get the server started:

user@myserver:/home/jason/spring-xd-1.0.0.M3/xd/bin# ./xd-singlenode

This will get the XD server running under single node, fine for my means.

terminal1

Setting up the Twitter credentials for XD

In the spring-xd-1.0.0.M3/xd/config folder is a file called twitter.properties. Using the values of the consumer key/secret and the access token and token secret and paste them in the correct places (the properties file is clearly marked which values go where).

Starting the client shell

At the point all I want to do is get Twitter streaming data saved to a file. In the next part we’ll start coding some modules to do things with the data as it comes in.

I like XD to a massive unix pipe command. This time thought we can give these streams (pipes) names so we can configure what happens to the data in these pipes. XD provides a shell program for us to do the work on the streams.

user@myserver:~/spring-xd-1.0.0.M3/shell/bin$ ./xd-shell

terminal2

Consuming Data

So far we’ve got the server running, the client running and our Twitter credentials set up in the configuration. The final part to do is create the stream to consume some Twitter data.  Within the console type in the following (watch out of the position of single and double quotes). It should look something like below:

xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#louboutins'| file"
Created new stream 'tweetLouboutins'

xd:>stream create --name tweetJimmyChoo --definition "twitterstream --track='#jimmychoo'| file"
Created new stream 'tweetJimmyChoo'

The –name flag lets defines the name the XD will refer the stream as. The definition is what XD is expected to do. In this case it’s a Twitter stream (‘twitterstream’) and there’s a target keyword to stream for, in this case it’s #jimmychoo and #louboutins. Last the definition is piped through to a file. The filename will be the same as the –name.

There are other options like refining the location of the tweets, whether to include follows and set the filter level.

Once those streams are created they go to work and if your target keyword is quite generic then your storage volume will start filling up quickly, so be careful.  The data is stored in /tmp/xd/output:

user@myserver:/tmp/xd/output$ ls -l
total 14068
-rw-r--r-- 1 user user    17326 Nov  9 11:25 tweetJimmyChoo.out
-rw-r--r-- 1 user user 14361376 Nov  9 11:25 tweetLouboutins.out

A quick inspection of either of the files you’ll see the entire JSON output of the streams. We have data, now we can do things with it. In a normal world we’d let this run but for now I’m going to stop my streams and preserve disk space.

xd:>stream destroy --name tweetLouboutins
Destroyed stream 'tweetLouboutins'
xd:>stream destroy --name tweetJimmyChoo
Destroyed stream 'tweetJimmyChoo'
xd:>

The nice thing with XD is that it can will relative ease ingest just about anything. That could be RSS feeds, HTTP calls, web site pages, email, unix monitoring commands and social media. Will you become the next EditD? Well we can all dream can’t we…. in the same way I dreamt of being a photographer, I did do some in the end like that photo at the start of the post.

Next time….

In part 2 we’ll get programatic and start creating processing modules to manipulate the data and do some analysis on it.

7 responses to “Twitter Fashion Analytics in Spring XD [Part 1] #BigData #Fashion @editd”

  1. […] Part 1 I introduced you to Spring XD and it’s lovely ways of being able to pull in streaming Twitter […]

  2. Great four part series Jason – really good introduction to the new spring XD world, I really think this project will open a lot of doors. I’m not sure if its my twitter application but now when I try and create more than one stream I get disconnected with a: {“disconnect”:{“code”:7,”stream_name”:”Sion_Smith-statuses1889513″,”reason”:”admin logout”}} can check your application to see if this is consistant or whether its just me. Checked the logs and I get:
    22:52:20,441 WARN task-scheduler-4 client.RestTemplate:566 – GET request for “https://stream.twitter.com/1.1/statuses/filter.json?track=%23louboutins” resulted in 420 (Client Error (420)); invoking error handler

    Any help would be grateful.

    Thanks
    Sion

  3. […] the output is pretty uniformed I set a SpringXD stream to pull the scores from Twitter as they […]

  4. Hi getting the same 420 error has the previous post. Some other people also have the same problem (https://twittercommunity.com/t/error-420-on-first-request/35506). If anyone has a solution or explanation, this would be helpfull. Thxs.

  5. Solved. I got the error using an old version of Spring XD (same as in this post: spring-xd-1.0.0.M3). I instead downloaded the latest version available (1.2.0.RC1), created a twitterstream and had no error this time. It works, I got the steam filling in my file with JSON tweets.

  6. Alain, apologies for not commenting sooner as I was travelling when you sent your first comment. The whole series of posts needs another look and edit because I know SpringXD changed quite significantly in configuration since I wrote those posts and also the chapter in the Machine Learning book. Annoyingly, it’s time which is not on my side right now.

    Thanks for taking the time to investigate and let me know how you got on, much appreciated.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.