Invest NI’s “new jobs” headlines…. how many in a lifetime?

I received a question for Boris Drakemandrillsquirrelhugger*, “Jase, you do data science, how many new jobs have Invest Northern Ireland announced in total?”.

“Bless My Cotton Socks I’m In The News”

First we need headlines and in one line of Linux we can have the whole lot.

$for i in {1..314}; do curl http://www.investni.com/news/index.html?page=$i > news_$i.html; done

This is exactly the same as how I pulled nijobs.com data in a previous blog post. Each page is 10 headlines and there’s 3138 headlines, so 314 pages will be fine. While that’s pulling all the html down you may as well get a cuppa….

1950s-woman-smiling-holding-platter-of-hors-d-oeuvres-snacks

Messing With The Output

The output is basically html pages. You could fire up Python and BeautifulSoup parsers and anything else that takes your fancy, or just use good old command line data science.

egrep -ohi "\d+ new jobs" *.html | egrep -o "\d+" | awk '{ sum+=$1} END {print sum}'

I’m piping three Linux commands, two egreps, the first to pull out “[a number] new jobs”. The -o flag is to only show the matching string from the regular expression, -i ignores the case, “New jobs” and “new jobs” is different otherwise and -h drops the filename in the output.

58 new jobs
61 new jobs
61 new jobs
84 new jobs
84 new jobs
84 new jobs
30 new jobs
30 new jobs
10 new jobs

The second just to get the figure.

30
30
30
40
82
82
15
300
300
23
540
540
36
125
125

And the exciting part is the awk command at the end where it adds up the stream numbers.

70758

Now that last figure is what we’re after. One caveat to that, any headline with a comma in the figure got ignored…. the first regexp will need tweaking…. you can play with that. So a rough estimate is to say that since June 2003 there have been over 70,000 new jobs announced in INI headlines.

The number you won’t get is how many were filled.

* The names have been changed to protect the innocent, in fact, just made up….. no one asked at all.

Taylor’s Power Law and Apple’s Small Change Moves.

Artists can command power, it’s a universal law. Madonna did it, Lady Gaga did it and now Taylor’s doing it too. Fine, but this time it didn’t go far enough.

swift14f-1-webWhile correctly arguing that all artists should be paid for their creativity and, so it seems, getting Apple to reverse a decision on not paying artists for the streaming trial period. Smaller artists still lose out in the long run.

The power law in action once again, only the top artists will make the income, the rest will scramble around the long tail.

300px-Long_tail.svg

What should have really been discussed is the value for each stream across the entire lifetime. It falls way below anything that an artist got in traditional CD sales. And while the internet has created the vast distribution network the long term payouts aren’t that great.

Taylor should have added another paragraph about the amount of money paid to artists.

Just my tuppence.

 

Processing JSON with Sparkling – #sparkling #spark #bigdata #clojure

Spark-logo-192x100px

While many developers crave the loveliness and simplicity of JSON data it can come with its own set of problems. This is very true when using tools like Spark for consuming data as you cannot guarantee that one line of the text file contains one complete block of a JSON object for processing. Resilient Distributed Datasets (RDD’s) can never be trusted to be complete for processing.

For many Spark is becoming the data processing engine of choice. While the support is based around Scala, Python and Java there are other languages getting their own support too.  I’m pretty much 100% using Clojure now for doing big data work and the Sparkling project is excellent for getting Spark working under Clojure.

Spark has JSON support under the SparkSQL library but this involves loading in JSON data and assuming it as a table for queries. I’m not after that…

Normally you would load data into Spark (in Clojure) like this:

(spark/text-file sc "/Path/To/My/Files")

This will load text into RDD blocks which can make JSON parsing difficult as you can’t assume that all JSON objects are going to be equal and nicely placed on one line.

Spark does have a function called wholeTextFiles which will load in a single or directory full of text files using the filepath/url as the key and the file contents as the value. This functionality has now been included in Sparkling 1.2.2.

(spark/whole-text-files sc "/Path/To/My/Files" 4)

Which loads each text file into it’s own single RDD. You end up with a JavaPairRDD with the key being the file path. With Sparkling destructuring you can map through the files easily. So to load the file in, parse the JSON and set the keys up (converting to lower case for tidiness) you end up with something like this:

(->> (spark/whole-text-files sc filepath 4)  
     (spark/map (s-de/key-value-fn (fn [k v] 
       (-> v
          (clojure.data.json/read-str 
             :key-fn (fn [key] 
               (-> key
                   str/lower-case
                   keyword)))))))

Obviously with large JSON files going into single RDD’s the processing can take some time so be careful with huge files on a single cluster.

Cassandra Invalid Token for Murmur3Partitioner Problems. #cassandra

If you are manually booting an Apache Cassandra server and you get the following message:

Fatal configuration error; unable to start server. See log for stacktrace.
 INFO 21:29:27,638 Announcing shutdown
ERROR 21:29:27,639 Exception in thread Thread[StorageServiceShutdownHook,5,main]
java.lang.IllegalArgumentException: Invalid token for Murmur3Partitioner. Got true but expected a long value (unsigned 8 bytes integer).
 at org.apache.cassandra.dht.Murmur3Partitioner$1.fromString(Murmur3Partitioner.java:190)
 at org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1456)
 at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1518)
 at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1355)
 at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1145)
 at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1374)
 at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1400)
 at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:584)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.lang.Thread.run(Thread.java:724)

No need to panic. Chances are the server is still running, it’s just a case of killing the process running it and starting the server up again.

Sounds daft but it saves Googling around trying to find the answer….

 

Harvesting Data Collaboration in Northern Ireland #data #startups #prediction

There’s plenty of talk about data, analytics, Big Data, artificial intelligence, deep learning and so on. Nerdy conversations that tend to keep the geeks, the marketing department and the press release writers happy but the rest of the population completely cold.

Who’s the Real Data Audience?

Let’s remind ourselves where the rest of the population actually are.

normal_distribution_500x263

Can you guess? A hint for you, they reside with two standard deviations of the average and make up the majority of people.

All the talk of open data, developers taking to social media to give the likes of Translink asking for their god given right to open data, all very well but it doesn’t resonate with the key stakeholders…. the public, the businesses and the day to day humdrum of Northern Ireland.

There’s excellent work going on with the open data initiatives from DETI and other interested parties. Progress may be slow but I’d expect it to be slow (with expected public service cuts don’t expect Translink data to be high on anyone’s list). I know the tech heads are itchy to do things and hackathons are happening (the Urban Hackathon is coming up at the end of June). The real questions are these: Does it resonate with the public? Where’s the win win? What’s in it for them?

With all the talk of opening data up, “do you do open data, everyone does open data”, there’s little talk of the potential data collaboration between small medium enterprises (SME’s) in the province. Does is matter? Yes of course it does.

A Simple Collaborative Example

Let’s take a hotel, Hastings Hotels operate a number of locations in the province. Is it possible to predict room rates 40 days out based on certain factors? Of course it is, there’s last years bookings and repeat visitors. That’s looking back, I’m more interested in predicting forward. Assuming occupancy is at 80% what will it take to hit 90%?

Now I could rest on laurels and assume that Game of Thrones is going to push up the numbers, stick my finger in the air and see which way the wind is blowing.

Even better though would be to take a feed from somewhere that has plenty of rich event data, large scale events and smaller ones the area. With a feed of dates, event types you could calculate the peak nights of occupancy. Data from What’s On NI (http://www.whatsonni.com) is about as rich as it gets, local events, big events and major events get listed. That data has value.

So the question is: Taking a feed from whatsonni.com can I (Hastings) calculate room rates for the next 40 days based on peak event data?

I believe it’s possible and a win for both parties, whatsonni.com could gain revenue from the feed for each of the hotels and if Hastings could raise peak pricing even by 7-10% on an average time of year room rate, the multiples involved would be a big win for them.

Concluding

I’ve dreamt up one example, a simple but highly effective one. It’s an easy sell to both parties, “I’d love both of you to increase revenue by collaborating, let’s do a trial for six months”. Now have a think about all the other businesses out there, data interconnecting and collaborating with each other. A series of paid for end points where everyone else could potentially benefit. This sort of thinking will raise NI’s bottom line and it’s all possible.

It’s also a perfect fit for proof of concept grants, where there is a solid basis of potential to see real benefit in all business sectors, not just development of what I would consider limited use mobile applications.

You’ll still need the help from the nerds, there’ll always be a need.

 

Sliding Window calculations in #Clojure

For time series calculations the sliding window is a tool for applying some calculation against the numbers in incremental stages.

This could be calculation the average temperature across a series or readings, or heart rate or something similar.

A set of numbers…. here you are in your REPL.

user> (def readings [1 3 4 5 3 4 2 6 5 4 3 5 7 8])
#'user/readings

The partition function will split those numbers up into a sequence of sequences. This is effectively your set of sliding windows.

user> (partition 3 1 readings)
((1 3 4) (3 4 5) (4 5 3) (5 3 4) (3 4 2) (4 2 6) (2 6 5) (6 5 4) (5 4 3) (4 3 5) (3 5 7) (5 7 8))

You can see the partition function has created a set for the first three numbers, then stepped one number to the right and created another set. It does that for the entire sequence of numbers you supply it.

Perhaps you want to calculate the average of each set of numbers. You can now apply a map function to work on each set the partition function has given you.

user> (map (fn [window] (double (/ (apply + window) (count window)))) (partition 3 1 readings))

(2.666666666666667 4.0 4.0 4.0 3.0 4.0 4.333333333333333 5.0 4.0 4.0 5.0 6.666666666666667)

Handy for monitoring internet of things readings and getting the average. Actually loads of uses when start thinking of the possibilities.

Myth Busting Growth Figures….

Let’s consider this really quickly…..

tweetmadmen

I respect this sort of thing being tweeted by @naomhs, on the whole it’s actually a good piece on taking the social media advantage, increasing eyeballs and digital engagement. That’s fine.

Anything with a percentage sign is like catnip to me though, especially when it’s about growth. In fact any press release where an organisation claims to have n% growth gets my attention because I’m always looking for two things, a starting number and an ending number.

For example:

Clicks last month = 10

Clicks this month = 100

Growth is ((this month – last month) / last month) * 100.

So ((100 -10) / 10) * 100 = 900% = DRAFT A PRESS RELEASE!

Even if last months figure was 1 and this month is 10, it’s still a 900% growth rate! DRAFT ANOTHER PRESS RELEASE WITH KITTENS THIS TIME!

 

The most important piece advice I can give to a #startup: own your #code.

I’m going to get supported and slated in equal measure I feel but I’ve seen this so many times now that it’s becoming the elephant in the room so I’m going to comment all the same.

Dear Founder…..

What we do know, especially in Northern Ireland, is that there’s a lack of developer talent that is willing to work on a startup from the initial stages, no sweating out the product in the small hours. Founders have little option than to go to development houses to get their concepts built so that they can be proven to the market.

Buyer Beware

When you are shopping around make sure you ask this simple question:

“The work that you do, do I own ALL the code?”

Hint, if they say “no” or “we use some of our own custom software” politely end the discussion immediate and walk away. 

Emphasis on the question is on the word ALL. In order for your business to survive you have to be able to adapt your code at any time. Software houses are not their with your best interests at heart (regardless of what they might actually say to you, they’re a business they need to survive too, it’s all about recurring revenue). If you don’t own ALL the code then you can’t adapt quickly or adapt at all.

If open source libraries are mentioned check the licensing agreements on them, not all open source is free. And make sure your developer in waiting shows you want libraries they are using and get the links so you can see them too.

I’ve seen many company start well and within time end up like

scared-lady

 

Trust me, it will hurt your revenue far more than it will hurt the development house.

My Advice To You, Founder

With the big wedge of cash (yours, an investor’s or the government’s funding) you are the customer who can call the shots. So demand 100% source code ownership, in your hands, in a Github account. In the event you need someone else to do some work as you grow, well then you can.

Even better is get friendly with a coder, even if they have a full time job, coders like to code so if you offer them a rate they’ll support you too. Have a developer fallback plan, you owe it to your business and your investor if she/he is putting the money in.

Review any SLA’s you have with development houses and see exactly what you are getting for your money. Insist on a monthly statement of how many hours were actually spend on your business. Complain bitterly if you need too, the customer is king here though every development house would make you think you are nothing without them.

Basically you need the following for a fairly run of the mill web/mobile tech startup:

  • Someone who knows your web side code (PHP, Ruby, Java, Python or what have you)
  • An iOS developer if you have an Apple supported mobile product.
  • A developer who knows good Android development.
  • A server guy or gal who’ll advice, stress test and update your server (a lot of development houses will steer clear of the hard stuff and just code)

One person can cover all of those roles, well they are rare but they do exist. I’d look to spread the workload where possible. In Northern Ireland we are acutely aware of a complete lack of good CTO material for startups but try and find a technical person that can articulate comments and ideas to the development house, nothing a developer house hates more than a person who knows what they do.

With so many new ideas coming out development houses are only too happy to greet you with open arms and discuss your dreams and visions. Corny as it might sound it’s a long term relationship so make sure you’ve done a bit of dating first to find a suitable match.

Ultimately though, make sure you’ve got your prenup in order for when you want to move, you need 100% of your code with you in order to continue your life once the developer separation happens.

 

 

Remember your daft ideas, well they’re not that daft after all. (@foldingathome @WiredUK)

My office is littered with notebooks, mainly Moleskines as I’m a colossal hipster nerd (sans beard) and I only write in them with Uniball Eye Micro pens (I knew you were all thinking it).

On the 6th June 2013 I scribbled some notes about using mobile phones as cluster node devices to do small chunks of processing, I even wrote a blog post. My rationale was simple, with all the phones on the planet in standby wouldn’t it make sense to use the downtime like we did with screen savers, trying to solve a bigger problem, a medical one for instance.

20150505_202919It didn’t get very far in my mind, too complex and while a few had tried Hadoop on Android the phones just couldn’t handle the load in a decent way. I toyed with using Zookeeper or RabbitMQ to do the communication work. Regardless of which way I was to do it, it was going to be hard for one man and a kitchen table to do this sort of thing. Not impossible, just hard.

Mistake 1 – I kept the idea to myself. 

Well that’s not 100% true, I did email one person and tell another. On the grounds of it being too complex for one person to handle I closed the notebook and left it at that, a fresh page waiting for the next idea.

That was a mistake….

Mistake 2 – Didn’t share with my collaboration partners

When I say “collaboration parters” I mean, my developer network, me mates. Perhaps they’d see something I didn’t. Was I just way ahead of the right time (usually the way, far too early stage for my own good sometimes). Regardless I didn’t show it to them.

Mistake 3 – Believing your own internal critic

After staring at the said page in my notebook for a few days I left it as a bad idea. Too ahead of the curve, limited use and I couldn’t see that anyone would be interested.

Sometimes the inner rascal will tell you something is right when it’s wrong, sometimes it will tell you the opposite.

With all that in mind, I just found out something.

Turns out I was completely wrong……

The new issue of Wired landed on my door mat this afternoon, yes the print edition, can’t be doing with digital edition of these sorts of things (apart from HBR). At the bottom of one page something about an app that cures cancer while you sleep. The premise is built on using mobiles as a cluster….. beautiful, and the numbers are highly encouraging. “…at one point 53,000 phones were working at one time. This is twice as fast as any supercomputer in the US“.

So Folding@Home out of Stanford University. Well done for daring, poking the devices and seeing what was possible. Vijay Pande I salute you! Well done and brilliant work.

If you want to download the app and putting your device to work, have a look at the Folding@Home website.

 

 

 

Reflections beyond Big Data Week – #bigdata #bdwbelfast

I’ve had some time to reflect on a few things recently, one of which was the Big Data Week Belfast panel.

bdw

It was a delight to sit with my friend Tom Gray, CTO of Kainos, Adele Marshall director of research at Centre for Statistical Science at Queens University and Padraic Sheerin of the Prudential who had the unenviable task of keeping us all in order, he did a good job.

Now for the record I usually make sure of two things when I’m asked to do a panel. Firstly, I’ll be an independent voice and I certainly don’t arrive to tow a sales departments voice or check with PR or HR about what can or can’t be said. Spade’s a spade and all that. Secondly, I tend not to hang about afterwards…..

I made a few points but time was precious so I didn’t get time to elaborate as much as I could have.

“A new set of cliches”

I think we’re at a point now where big data is just data. The real mission is how to, if there’s a case to, process it. BigData has now resolved itself to a worn out marketing term but it’s fair to be said that it’s the term that companies still look up to.

So I firmly believe it’s time to use a new set of cliches and that means coming up with a new set of terms first.

“What advice for SME’s?”

We did establish that the price of utility computing is coming down, something I also emphasised in The Profit Margin interview. What I will say again and again is know what question you’re trying to answer. Then ensure you have the right data to hand. I’ve come across companies who want to know their target audience but never actually retained the customer data in order to get that answer.

I know it sounds daft but the harsh reality is that a lot of companies don’t know.  Also, these kinds of questions are rarely from a technical perspective but come from all parts of the board, C level and employee levels of the company.

I liken a data project in the same was as a web design project, we’ve got to that point. I don’t believe one tool will save you but a suite of skills that may come from different people. Storages costs are down, processing costs are down, putting it in the cloud costs are down…. it’s the brains to make it all work that are the main cost.

“The Health Opportunities…..”

A huge talking point at the moment is between health in terms of prediction, savings and monitoring and the over hyped Internet of Things (IoT). To be honest I didn’t say much here this is really more Tom and Adele’s gig than mine. Though saying that I did throw in a curve ball.

I used the Clubcard data as an example (don’t I always), this time though with the emphasis on data collaboration. The majority of hands went up when I asked who was a Clubcard holder. With those hands up I threw the ball, “who’d be happy for that transaction data to be sold for health monitoring purposes?”.

No surprises, 99% of the hands went down. I believe this is what’s next, data collaboration on a massive scale. Tesco sharing data with insurance companies (they have one already so chances are it’s already happening to a point). A health startup being able to see the rough levels and categories of food shopping, alcohol consumption and so on. Could you predict a family’s health outlook by they shopping habits, you probably could. If a supermarket can determine what trimester a pregnant mother is in, then yes anything is possible.

“A New Name for Big Data”

Gerald, Bruce…. I’ll settle with Data Analytics.