• About
  • Conference Speaking
  • Machine Learning, The Book
  • The Darcey Coefficient

Jason Bell

~ Kafka, Clojure, Java, Spark, Hadoop and Process Automation

Jason Bell

Monthly Archives: November 2013

Amazon and the warehouse, you and your job: if it can be automated, it will be automated.

30 Saturday Nov 2013

Posted by Jason Bell in Uncategorized

≈ Leave a comment

A few days ago I watched the BBC Panorama program “Amazon: The Truth Behind the Click” (I’ve put the iPlayer link up for UK folks but don’t forget it’ll expire soon). For me non of it came as a great surprise, because while I’m happy to click and order my books of choice, with experience in supply chain and micro analysed cost based reduction I know the lengths that companies will go to in order to reduce the bottom line.

amazon-warehouse-4

The more we click the more we acknowledge that we agree to this system. “I want it cheap and I want it tomorrow”, well what were you expecting?

It’s not just Amazon who are guilty. Think of anything where the price is an issue: supermarkets, cut price airlines and so and you can be safely assured that a squeezed human cost will be involved in order to make our lives all the more happier.

If it can be automated then it will be automated

Think about computer programming – when you drill down to the basics it’s to do one thing, do repetitive tasks. It saves us time doing them (and we’ll get bored and make mistakes in the long run). That’s it, nothing more, nothing less. Okay there’s all the funky stuff afterwards but when it comes down to it then it’s about repetition.

It’s hardly been a silent takeover either, over the last 80 years job growth in the US has declined. And while digital technology has boosted economies it has also ripped the beating heart out of the mid level service sector.  Think about payroll, accounts, bookkeeping, postal work, bank clerks and cashiers are vulnerable. There’s one real conclusion that if it can be automated then it will be automated.

Kiva Systems, the automation robot system used in warehouses, was setup in 2002. It was sold in 2012 for $775m to Amazon. Amazon have plans.

And what is it you do……..?

Think about your job, what you do for a living and then ask yourself this next question seriously, even take a few days to ponder it. “Will my job run the risk of being automated in the next five years?”

Even as a computer programmer (not a sexy job title may I add regardless of what others may dress it up as) there’s a pressure to constantly reinvent while gripping onto the past. Luckily for me Java is making good in roads again (thanks the BigData, Android and other JVM related things). Even so I’m constantly having to look five years out and make educated guesses on what’s got longevity.

I agree there is a shortage of computer programmers, I agree that it needs to be taught in schools. The key is also for teaching establishments to be honest and try and figure out what’s going to be relevant by the time the pupils leave.

Without a manufacturing base (yes that can be, and is, automated) we’re left with high paid jobs and low paid jobs and nothing in the middle.

If there’s a time to look forward and reinvent yourself, then it’s probably now.

“…the one good thing we’re good at, machines can’t do” (@cimota)

 

 

Advertisements

Twitter Fashion Analytics in Spring XD [Part 4]#BigData #Fashion

17 Sunday Nov 2013

Posted by Jason Bell in BigData, Data and Statistics, Hadoop, Java, Spring XD, Twitter

≈ Leave a comment

The final phase, time to visualise.

Visualisation is not always the be all and end all of BigData but it is important in terms of telling the story.

At the end of the last post we had mined the data, got counts of all the hashtags and also whittled it down to three brands we wanted to focus on.

Visualising in D3

D3 (Data Driven Documents) is a Javascript based library and takes a lot of the work out of creating graphs, maps and all sorts of other things. From our perspective it’s handy as we can load in CSV/TSV files with ease.

Loading D3 is as simple as

<script src="http://d3js.org/d3.v3.min.js"></script>

I’m going to create a Pie Chart based on example on the D3 site. You can have a look at the HTML on my github repo for this blog post series.

d3output

Spring XD/Hadoop/D3 Considerations

In the four posts in this series we’ve covered data consumption, storage, processing and visualisation. With Spring XD it’s going to continue gathering data until we say stop. The Hadoop job we ran was a one off, there’s nothing to stop us putting these things in a cron job to update every hour.  Though we do have to keep an eye on the time it’s taking to run those jobs.

Also remember D3 takes time to load and parse on the client side so don’t over run the user with too much information. From an aesthetic point of view too much information would be confusing for the reader.

As with all these things it’s a case of getting your hands dirty and trying things out.

Twitter Fashion Analytics in Spring XD [Part 3]#BigData #Fashion

17 Sunday Nov 2013

Posted by Jason Bell in BigData, Data and Statistics, Hadoop, Java, Sentiment Analysis, Spring XD, Twitter

≈ 1 Comment

Just jumping in on part 3? You can read part 1 and part 2 to get you up to speed.

Dovima with elephants, Evening dress by Dior, Cirque d'Hiver, Pa

The picture reference gag could be wasted on some….

Bringing in Hadoop

So far we’ve used Spring XD to pull a stream of tweets into a file. And, normally, at this point we’d be looking to copy the file into HDFS and then run some MapReduce job on it.

Historically you’d be saying:

hadoop fs -put /tmp/xd/output/tweetFashion.out tweetFashion.out

Do the job and then pull the results from HDFS into a file again.

hadoop fs -getmerge output output.txt

If you need a reminder on setting up a single node Hadoop cluster then you can read a quick intro here.

Spring XD and Hadoop

Creating a stream from with the shell we can stream the data into HDFS without the hassle of doing it manually.

First we must tell Spring XD default to Hadoop 1.2.1 (in XD build M4 at time of writing).

Assuming that Hadoop is installed and running you can run the single command:

xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#fashion'| twitterstreamtransformer | hdfs"

With the rollover parameter we can ensure that periodic writes are happening within HDFS and not too much data is being stored in memory.

xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#fashion'|twitterstreamtransformer | hdfs"

Within the HDFS directory you’ll start to see the data appear very much in the same way as it was being dumped out to the tmp directory.

The MapReduce Job

The Mapper and Reducer are a very simple word count demo but looking for hashtags instead of words. The difference here is the output, I don’t want tab delimited as I’m not a great fan of it. I want to output a comma separated file.

Depending on the version of Hadoop you are using defines the property setting you need to set. So the easiest way to sort this is to cover all bases.

static void setTextoutputformatSeparator(final Job job, final String separator){
        final Configuration conf = job.getConfiguration(); //ensure accurate config ref
        conf.set("mapred.textoutputformat.separator", separator); //Prior to Hadoop 2 (YARN)
        conf.set("mapreduce.textoutputformat.separator", separator); //Hadoop v2+ (YARN)
        conf.set("mapreduce.output.textoutputformat.separator", separator);
        conf.set("mapreduce.output.key.field.separator", separator);
        conf.set("mapred.textoutputformat.separatorText", separator); // ?
        }
Then within the job definition code:
Job job = new Job();
job.setJarByClass(TwitterHashtagJob.class);
setTextoutputformatSeparator(job, ",");
When this job is run the output will be comma separated.
#Aksesuar,1
#AkshayJewellery,1
#AlSalamanty,1
#Alabama,1
#Aladin,1
#AlaskanWomen,2
#Alaventa,1
#Albright,8
#Albuquerque,1
#Alcohol,1
#Aldo,1
#AldoShoes,2
#AlecBaldwin,3
#Alejandro…,1
#Alencon,1
#Alert,2
#AlessandraAmbrosio!,2
#AlessiaMarcuzzi,2
#AlexSekella,1
#AlexTurner,3
#Alexa,3
#AlexaChung,8
#AlexaChungIt,1
#AlexanderMcQueen,7
#AlexanderWang,7
#Alexandra,1
#Alexandria,2
#Alexis,1
#Alfani,1
#Alfombraroja,1
#AlfredDunner,1
#Ali,1
#Alice,1

With a little command line refining we can start to pick out the data we actually want. The last thing that’s pleasing on the eye is a visualisation on every hashtag.  For example I want to see the counts for Asos, Primark and TopShop.

jason@bigdatagames:~$ egrep "#(asos|primark|topshop)," output20131117.txt 
#asos,42
#primark,16
#topshop,64

Next time….

We’ll get into the visualisation side of things and also look at maintaining a process to keep the output up to date.

Twitter Fashion Analytics in Spring XD [Part 2]#BigData #Fashion

10 Sunday Nov 2013

Posted by Jason Bell in BigData, Data and Statistics, Hadoop, Java, Sentiment Analysis, Spring XD, Twitter

≈ 3 Comments

In Part 1 I introduced you to Spring XD and it’s lovely ways of being able to pull in streaming Twitter data. Think of it like a continuous catwalk of data….

f_6cbc2c3fbb5e8ecc927db404726d43efThe-fashion-industry

The story so far….

We’ve got streams of data coming into the server and they are being store. All very well but Twitter streaming responses are huge chances of JSON data. And when they are coming in thick and fast well it takes up disc space and quick.

I’m only bothered about two things from all this data, firstly the date/time of the tweet and secondly the content.

Within the grand data chuck I see that “created_at” and “text” are what we really need.

Transformers

We can write custom pieces of code to act as extra bits to the pipe and manipulate the data as it comes in. We’ve established that we’re looking for two things and I want this to be output to the text file.

So where I currently have:

xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#louboutins'| file"

I want to add a transformer to strip back all the JSON and just give me the bits I want.  To create a transformer we can do that in code and then deploy it to our Spring XD node.

The code and bean definition.

Here’s the main body of the code:

Map<String, Object> tweet = mapper.readValue(payload,new TypeReference<Map<String, Object>>() {});
sb.append(tweet.get("created_at").toString());
sb.append("|");
sb.append(tweet.get("text").toString());
return sb.toString();
If you want to read the full class you can do as the project is on Github.

The last thing we need before deploying is a XML file that defines our transformation class.

<?xml version="1.0" encoding="UTF-8"?>
<beans:beans xmlns="http://www.springframework.org/schema/integration"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:beans="http://www.springframework.org/schema/beans"
  xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/integration
http://www.springframework.org/schema/integration/spring-integration.xsd">
  <channel id="input"/>
  <transformer input-channel="input" output-channel="output">
    <beans:bean class="co.uk.dataissexy.xd.samples.TwitterStreamTransform" />
  </transformer>
 <channel id="output"/>
</beans:beans>

Deployment

Spring XD wants your code to be stored as a jar file and placed the xd/lib directory. The xml definition file needs to be placed in the  xd/modules/processor directory, then restart the server for the changes to take effect.
Now we can run the transformer in a stream. Where before we had:
xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#fashion'| file"

We now need to add in our new transformer.

xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#fashion'| twitterstreamtransformer | file"
And now a quick inspection of the data directory we’ll see the data is a lot more manageable.
Sun Nov 10 11:42:06 +0000 2013|RT @GliStolti: Starry Night bag http://t.co/Xa1342og1G #fashion #trend #style #design #handmade #handicraft #shopping #rome #italy #madeini…
Sun Nov 10 11:42:16 +0000 2013|RT @GliStolti: #VanGogh necklace http://t.co/08v0Jwd4r7 #fashion #trend #style #design #handmade #handicraft #madeinitaly #shopping #rome #…
Sun Nov 10 11:42:16 +0000 2013|Was at @StuntDolly yesterday getting the #XMASCARBOOT organised for Sat the 16th! Be sure to come! #fashion #dalston http://t.co/lT6aLyf60r

Next time….

In Part 3 I’m going to bring Hadoop into the fold and collate the hashtags and attempt to create some form of visualisation with D3.

Twitter Fashion Analytics in Spring XD [Part 1] #BigData #Fashion @editd

09 Saturday Nov 2013

Posted by Jason Bell in Android Development, BigData, Data and Statistics, Java, Retail and Loyalty, RSS/XML, Sentiment Analysis, Spring XD, Startups/Business, Twitter

≈ 7 Comments

Careers Teachers are Dangerous

20091019-DSCF0212-copy

Careers teachers are people with responsibility beyond most. They can either empower or crush the dreams of the teenager in a heartbeat. Mine were crushed in an instant and my teacher laughed at my wish to be a fashion photographer, and if I were to be one (according to him) it would only be “mundane catalogue work”. So much for aspiring dreams….

Two things happened that day, firstly I took up the bass guitar as some form of rebellion to photography and I also took my computing work a lot more seriously. The fashion thing never really left me though I never got into the industry in the end. I did go back to photography though (I know your minds are racing, it wasn’t about the models).  Through the work of Datasentiment the fashion industry was pivotal to my data design choices, fashion represented “distressed stock/inventory” items that had time critical value. Where are the sales peaks? When do the discounts start? Do the percentages slide? I’ve got notebooks full of this stuff.

Fast forward to November 2013.

tt

To my mind EditD are one of the best companies doing realtime insights with data. I’ve not had the joy of seeing the whole application (and I’ll probably never will), I have to stand from the sidelines and read the reports. They’re all excellent and very very well informed from the data collected all over the web, retail and beyond.

The reports for fashion weeks and brand updates make reference to online mentions and I’m assuming this suggests the usual suspects, Twitter and Facebook. I don’t do the Facebook thing so I’ll put that to one side. But oh oh oh, mentions, sentiment and all that other garb, yup the Twitterati in full flow give us enough information to keep us occupied for a long time.

But wait Jase!

Yes, I know what you’re thinking, I’ve done all this before. Well I have in part. Twitter sentiment analysis in 30 Seconds (done in R), the Raspberry Pi Twitter Sentiment Server (in R and Python).  And yes I’ve done shoes sizes before as well….. but this is different. These were based on searches not streams, it was also a bit clunky, though acceptable.

Previously to do this sort of stuff there’s been a big technical overload to get to the point of getting the data in and stored. Whether that was to a normal database server, HDFS or even a text file, well it was a pain.

Spring XD

Spring call XD a “unified, distributed and extensible system for data ingestion, real time analytics, batch processing and data export”. Sold to the man with the curly hair…. so let’s get cracking.

Right now, part 1, I’m only bothered about getting Twitter data into a file.  Part 2 we’ll start to do things with the data.

Spring XD implements the Twitter Streaming and Search API’s, we’ll use the streaming API for our needs.  We’re going to set up some streams for a few shoe brands.

Download the Spring XD application from the Spring download site.  Once you have unzipped that to a directory we can start getting everything together.

Defining the Twitter Application

Firstly we need a Twitter application with consumer key and secret. So you’ll need a Twitter account and a developer account.  Create an application and make note of the consumer key/secret and access token/secret. We’ll be using those in a minute.

twittersetup

Getting Spring XD Started

Grab yourself two terminal windows, trust me you’ll need them.

The first terminal window we’ll get the server started:

user@myserver:/home/jason/spring-xd-1.0.0.M3/xd/bin# ./xd-singlenode

This will get the XD server running under single node, fine for my means.

terminal1

Setting up the Twitter credentials for XD

In the spring-xd-1.0.0.M3/xd/config folder is a file called twitter.properties. Using the values of the consumer key/secret and the access token and token secret and paste them in the correct places (the properties file is clearly marked which values go where).

Starting the client shell

At the point all I want to do is get Twitter streaming data saved to a file. In the next part we’ll start coding some modules to do things with the data as it comes in.

I like XD to a massive unix pipe command. This time thought we can give these streams (pipes) names so we can configure what happens to the data in these pipes. XD provides a shell program for us to do the work on the streams.

user@myserver:~/spring-xd-1.0.0.M3/shell/bin$ ./xd-shell

terminal2

Consuming Data

So far we’ve got the server running, the client running and our Twitter credentials set up in the configuration. The final part to do is create the stream to consume some Twitter data.  Within the console type in the following (watch out of the position of single and double quotes). It should look something like below:

xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#louboutins'| file"
Created new stream 'tweetLouboutins'

xd:>stream create --name tweetJimmyChoo --definition "twitterstream --track='#jimmychoo'| file"
Created new stream 'tweetJimmyChoo'

The –name flag lets defines the name the XD will refer the stream as. The definition is what XD is expected to do. In this case it’s a Twitter stream (‘twitterstream’) and there’s a target keyword to stream for, in this case it’s #jimmychoo and #louboutins. Last the definition is piped through to a file. The filename will be the same as the –name.

There are other options like refining the location of the tweets, whether to include follows and set the filter level.

Once those streams are created they go to work and if your target keyword is quite generic then your storage volume will start filling up quickly, so be careful.  The data is stored in /tmp/xd/output:

user@myserver:/tmp/xd/output$ ls -l
total 14068
-rw-r--r-- 1 user user    17326 Nov  9 11:25 tweetJimmyChoo.out
-rw-r--r-- 1 user user 14361376 Nov  9 11:25 tweetLouboutins.out

A quick inspection of either of the files you’ll see the entire JSON output of the streams. We have data, now we can do things with it. In a normal world we’d let this run but for now I’m going to stop my streams and preserve disk space.

xd:>stream destroy --name tweetLouboutins
Destroyed stream 'tweetLouboutins'
xd:>stream destroy --name tweetJimmyChoo
Destroyed stream 'tweetJimmyChoo'
xd:>

The nice thing with XD is that it can will relative ease ingest just about anything. That could be RSS feeds, HTTP calls, web site pages, email, unix monitoring commands and social media. Will you become the next EditD? Well we can all dream can’t we…. in the same way I dreamt of being a photographer, I did do some in the end like that photo at the start of the post.

Next time….

In part 2 we’ll get programatic and start creating processing modules to manipulate the data and do some analysis on it.

Speaking at @devbash about big shiny #BigData and #Hadoop

05 Tuesday Nov 2013

Posted by Jason Bell in Uncategorized

≈ Leave a comment

logo

I’ll be speaking at the Devbash on Wednesday 6th November at the Black Box in Belfast.  Once again (some may groan and some may sigh) I’ll be talking about Hadoop, the good the bad and the ugly (but mainly the good).

More details of the more sensible speakers here.

Advertisements

Subscribe

  • Entries (RSS)
  • Comments (RSS)

Archives

  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • May 2018
  • March 2018
  • February 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • November 2008
  • October 2008

Categories

  • Advertising
  • Airports
  • Android Development
  • APIWatch
  • Artificial Intelligence
  • BigData
  • Books
  • Cassandra
  • Chatbots
  • Clojure
  • Coding
  • CryptoCurrency
  • Data
  • Data and Statistics
  • Data Science
  • Emacs
  • Facebook
  • Finance and Investing
  • Hadoop
  • influencers
  • iOS Development
  • Java
  • Kafka
  • Machine Learning
  • Maths and Stats
  • Mattermark
  • Messaging
  • Modelling
  • Neo4J
  • nitechrank
  • Northern Ireland
  • Onyx Platform
  • Open Data
  • Prediction Markets and Betting
  • Python
  • R
  • RabbitMQ
  • Raspberry Pi
  • Reading List
  • Retail and Loyalty
  • RSS/XML
  • Ruby
  • Scala
  • Sentiment Analysis
  • Spark
  • Sparkling
  • Spring XD
  • Startups/Business
  • StrataData
  • Strictly Come Dancing
  • Stuff and Nonsense
  • TechTalks
  • Twitter
  • Uncategorized
  • Weird

Meta

  • Register
  • Log in

Create a free website or blog at WordPress.com.

Cancel
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy