Cutting down on the verbosity of #Spark messages

Spark shell is great but one of the major issues is the amount of logging it dishes out, it can get frustrating when you are trying to debug things.

Easily solved though.

In your SPARK_HOME/conf directory you’ll find a Make a copy of it.


Edit with your favourite text editor and change:

log4j.rootCategory=INFO, console


log4j.rootCategory=WARN, console

When you restart the Spark shell you’ll have a fighting chance of seeing the output.


Cutting down on the verbosity of #Spark messages

NI Software Skills – Reality Check Time (@PathXL)

Seems to me that things are hitting not-quite-crisis-point. So what I’m about to say is opinion and not a criticism of any of the fine companies involved. My word to NI tech companies is simple: Wanting to be a programmer is a choice, not an expectation because of demand.

“NI students ‘training for wrong careers’ says PathXL head.”

The first headline I read this morning……


The irony is that the photo probably illustrates the reason why so many people don’t want to be programmers. It comes across as a grey and boring profession. And to be fair, those views are at times justified.

I’ve said during my 27 year career as an engineer, programmer, technologist and big data/machine learning nerd that it takes a certain type of person to do this job. And while Mr Speed has a very good point, those mid tier jobs will be automated over the next 5-10 years, his words sound like a cry of a chief exec at the point of outsourcing.

It’s All About the Percentages

In December 2013 I was invited by Momentum to speak at one of their BringItOn sessions, as it was local to me I duly obliged as a civic duty to inform. Two things struck me that day.

1. No Belfast company bothered their arse to attend. If you want programmers so bad then you are going to have to find them, they may not come to you.

2. Out of a room of 400+ students about 10% stuck their hands up at the end about wanting to explore this career further. This didn’t come as a complete shock to me but I was talking to the students about it and asked why some of them didn’t put their hands up, “well, no offence but it looks boring”.

The Skills Shortage

Mr Speed is right though, jobs will go unfilled. Not just at PathXL but at AllState, Citi, Kainos, Liberty IT and all the other companies that have made announcements. All great companies, great stories and great results. Lovely people to boot.

Focusing on education to sort it all out, well the UK has been here before. Late 90’s the universities were pumping out computer science graduates like lemmings. All fine until the dot com bubble finally burst and the supply crashed to the floor.

What you cannot do is streamline a production line of government forced education to create programmers to satisfy the needs of companies. It’s not the done thing.

Education should be the rich fabric of disciplines, from science to art and everything in-between. Just because students want to study law or become a teacher does not mean they are wasting their time. Their prize could be the flight departing from Belfast International to a new life in another country. Who said they’d stay in the first place?

The Conundrum

The second one in six months, I’m doing well.

1. You can’t tell people what to do. Telling an individual they’re doing the wrong course or degree just to satisfy your company’s skills demands, well it’s damaging in the long term. Workers will get bored if their hearts and not in the profession and then leave. That leaves the company with the same problem longer down the line.

2. Experienced programmers are very difficult to find. Always have been and always will be.

3. Even on the mainland the question is popped, “Are you willing to relocate?”. So ask that here, would a good programmer be willing to relocate to Belfast? Let me put it this way, I’m classed as the rank outsider being in Limavady and knowing what I know.


(I’m the one on the far left of the graph).


Your money, your great working environment, your blow football table, your team get togethers…. they are not selling points but additional extras. Real programmers only are bothered about the challenge of the task in hand. Everything else is a little extra.

In terms of the skills gap I don’t think you can quickly educate your way out of it. Certainly not in the short term, perhaps in the 5-10 year bracket. So coding in schools is a good start but you’ll hit the same issue again and again….

….ultimately, programming and coding is not for everyone.

(And Mr Speed, I’m happy to talk this over any time. Here’s my phone number, 07900 316333).

Jason Bell is a Data/Hadoop consultant based in Northern Ireland but helps companies globally with various BigData, Hadoop and Spark projects. He also offers training on Hadoop, the Hadoop Ecosystem and Spark to developers and anyone interested in what these technologies can do. He’s also the author of “Machine Learning – Hands On For Developers and Technical Professionals“.



NI Software Skills – Reality Check Time (@PathXL)

#Oscars – How did I do?

Yesterday evening I posted a bunch of predictions without resorting to data mining, Twitter analysis or reading anything by Nate Silver. Just good old guessing. In terms of a result then guessing didn’t do me too bad. Result 14/24 (58.3%)

[Update: Turns out that FiveThirtyEight’s predictions in the “top six” were the same,we got 5/6 (83%) and missed on the best director. I would have liked to have seen how Nate and Co. managed on the other categories which are much harder to predict.]


Winner: Birdman Prediction: Birdman


Winner: Eddie Redmayne Prediction: Eddie Redmayne


Winner: Julianne Moore Prediction: Julianne Moore


Winner: J.K. Simmons Prediction: J.K. Simmons


Winner: Patricia Arquette Prediction: Patricia Arquette


Winner: Big Hero 6 Prediction: How To Train Your Dragon 2


Winner: Birdman Prediction: Birdman


Winner: The Grand Budapest Hotel Prediction: The Grand Budapest Hotel


Winner: Alejandro Gonzalez Inarritu (Birdman) Prediction: Richard Linklater


Winner: CitizenFour Prediction: CitizenFour


Winner: Crisis Line: Veterans Press 1 Prediction: Crisis Line: Veterans Press 1


Winner: Whiplash Prediction: The Imitation Game


Winner: Ida Prediction: Ida


Winner: The Grand Budapest Hotel Prediction: The Grand Budapest Hotel


Winner: Alexandre Desplat Prediction: Hans Zimmer


Winner: Glory from Selma Prediction: “I’m Not Gonna Miss You”


Winner: The Grand Budapest Hotel Prediction: Into The Woods


Winner: Feast Prediction: Feast


Winner: The Phone Call Prediction: Boogaloo and Graham


Winner: American Sniper Prediction: The Hobbit: Battle of the Five Armies


Winner: Whiplash Prediction: American Sniper


Winner: Interstellar Prediction: Guardians Of The Galaxy


Winner: The Imitation Game Prediction: The Imitation Game


Winner: Birdman Prediction: Birdman

#Oscars – How did I do?

#Oscar Winner Predictions

<> on October 19, 2009 in Santa Clarita, California.

No BigData, no machine learning, no Hadoop and no Spark. Just outright guesses and gut instinct. I’m not going to stream Twitter for hashtags…. nothing, nada, just have a best guess.

There’s no method here. I must check the betting odds….. let’s have another look in the morning to see how it all turned out.

Best Picture


Actor in a Leading Role

Eddie Redmayne

Actress in a Leading Role

Julianne Moore

Actor in a Supporting Role

J.K. Simmons

Actress in a Supporting Role

Patricia Arquette

Animated Feature Film

How To Train Your Dragon 2



Costume Design

The Grand Budapest Hotel


Richard Linklater

Documentary Feature


Documentary Short Subject

Crisis Line: Veterans Press 1

Film Editing

The Imitation Game

Foreign Language Film


Makeup and Hairstyling

The Grand Budapest Hotel

Music, Original Score

Hans Zimmer

Music, Original Song

“I’m Not Gonna Miss You”

Production Design

Into The Woods

Short Film, Animated


Short Film, Live Action

Boogaloo and Graham

Sound Editing

The Hobbit: Battle of the Five Armies

Sound Mixing

American Sniper

Visual Effects

Guardians Of The Galaxy

Writing, Adapted Screenplay

The Imitation Game

Writing, Original Screenplay





#Oscar Winner Predictions

Running Scala scripts in #Spark

The Spark shell serves us all well, you can quickly prototype some simple lines of Scala (or Python with PySpark) and you quit the program with a little more insight than you started with.


There are points in time when those scraps of code are handy enough to warrant keeping hold of them. Scala is nice in the sense that you can either run the script without compiling or you can compile your code to a full application.

WordCount From The Shell

Take the (classic) word count functionality. With Spark it’s a doddle…

scala> val text = sc.textFile("/Users/Jason/coffee.csv")
scala> val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
scala> counts.collect

15/02/21 14:52:55 INFO DAGScheduler: Job 0 finished: collect at <console>:17, took 0.898995 s
res0: Array[(String, Int)] = Array((Tea,66461), (Latte,8324), (Capuccino,8391), (Flat_White,8499), (Americano,8325))

It’s not fun retyping that in every time you want to do a quick word count though.

WordCount From The Command Line with a Script

Saving the lines you ran in the shell as a script is easy enough to do. Create a text file, let’s call this one wc.scala


To run from the command is just a case of firing up the shell again but using the -i flag to specify an input file.

$SPARKHOME/bin/spark-shell -i wc.spark

Note that the shell doesn’t exit. So edit your wc.scala file and add an exit call as the last line.





Running Scala scripts in #Spark

Sigma for Programmers (#math #programming #machinelearning #bigdata)

During 2014 while I was writing the book “Machine Learning – Hands-On For Developers and Technical Professionals” it became very clear that I was going to have to tackle an issue that I’d done well to avoid most of my 27 year long career in computing…. In the UK the concept of mathematical notation was never really made clear, it was just put in the whole “algebra” camp and left at that. Things may have changed now (hopefully) but it’s left a bit of a gap when I actually needed it.

Scary Monsters

Writing the book I’d keep coming across mathematical notation that would prove concepts, they were everywhere. And to the ageing programmer that was well versed in experience but not so in academic training well it got a bit scary. 772b4701bbe6846f4e2ed0edb928809c Especially this big foreboding scary one…. sigma1

There’s Something About Sigma

Perhaps it’s because it’s big, it looks serious and it looks like it means business. It means “sum”, add it all up. That’s it. Something so foreboding to show something so simple. Let’s take a simple example: sigmademo What’s being said here is “add up every value of i *3 from 1 to 100″ or (1*3) + (2*3) + (3*3)+……(99*3) + (100*3) From a programming perspective what I have here is a for loop. An iterator starting at one and finishing at 100. The iterator starts at the value below the sigma (1) and runs each time until the top value is reached. The action performed on each value of i is to the right of the sigma. In Java it would look like:

public class SigmaTest {
    public static void main(String[] args) {
        int total = 0;
        for(int i = 1; i <= 100; i++) {
            total = total + (i * 3);

Or Python (using the Python shell) it would look like:

>>> for i in range(1,101):
... total = total + (i*3)
>>> total

Ultimately sigma isn’t that scary after all, I should know, it’s all over the place in the book :)

Sigma for Programmers (#math #programming #machinelearning #bigdata)

Lottery Frequencies…. #hadoop #datamining

Not Impossible, Just Improbable

Lotteries with prizes have been kicking around since the 15th Century but the idea of randomly drawing lots goes back way further to the Chinese Han Dynasty around about 200BC give or take a few years.


The six ball National Lottery in the UK (6 out of 49) gives you a 1 in 13,983,816 chance of winning the jackpot….. slim but doable.

Number Frequencies

So which is the number drawn the most? Easy to find out, you can pull the last 180 days draw data as a csv file and have a look.

I’m only interested in the main six numbers, not the bonus ball.

41 9 8 35 21 10
17 35 3 6 18 15
10 46 13 22 40 17
33 7 39 44 10 16
3 34 17 4 24 30
44 8 19 4 49 35
6 34 45 1 32 37
49 47 46 42 37 29
36 17 29 28 33 20
43 13 14 41 24 16
20 23 10 5 12 4
10 18 19 15 17 31
49 37 38 22 12 18
46 28 11 23 30 32
13 47 44 9 48 7
18 23 32 42 40 22
19 33 46 2 35 24
18 33 30 48 34 38
14 15 47 36 31 42
19 45 23 49 40 43
17 35 4 37 19 25
11 42 18 19 6 38
49 41 30 29 28 26
15 5 29 22 3 2
2 39 36 35 15 38
42 30 26 28 5 44
9 5 44 41 13 10
7 22 27 42 6 35
21 25 34 5 36 2
35 3 9 47 28 5
31 14 13 17 25 49
43 15 11 17 49 30
42 5 28 24 36 47
30 47 40 22 1 33
19 43 24 6 26 42
26 2 32 23 8 5
28 34 27 4 43 29
24 34 4 18 36 48
5 4 47 1 18 7
46 20 3 19 1 7
33 48 29 38 8 4
22 6 26 2 33 48
34 9 41 19 46 22
30 29 12 15 35 22
24 31 13 16 18 43
11 37 32 48 29 40
35 22 27 23 12 34
1 20 35 46 19 30
17 29 5 49 37 36
11 42 38 7 1 41
12 24 35 47 15 6

Finally! A use for Hadoop’s Wordcount!

I could write a program to work out the frequencies but there’s something in Hadoop that’s much ridiculed but will do the job perfectly, our friend the word count example.

$ /usr/local/hadoop-1.2.1/bin/hadoop jar /usr/local/hadoop-1.2.1/hadoop-examples-1.2.1.jar wordcount lottery.txt lotteryout

Running the script sets off a local Hadoop job and gives us the following output:

$ sort -k2rn part-r-00000
35 12
19 10
22 10
5 10
17 9
18 9
29 9
30 9
42 9
15 8
24 8
34 8
4 8
47 8
49 8
28 7
33 7
36 7
46 7
6 7
1 6
10 6
13 6
2 6
23 6
37 6
38 6
41 6
43 6
48 6
7 6
11 5
12 5
26 5
3 5
32 5
40 5
44 5
9 5
20 4
31 4
8 4
14 3
16 3
25 3
27 3
21 2
39 2
45 2

So the number 35 balls has been drawn 12 times in the last 51 draws (23.5%).

What Chances Do My Numbers Have?

I can check the frequencies of my numbers against the last batch of results (the ones I’ve just processed with Hadoop) by crafting a really quick bash script.

$for i in 15 17 21 23 25 27; do sort -k2rn part-r-00000 | egrep "^$i\t";done
15 8
17 9
21 2
23 6
25 3
27 3

So there’s a few numbers in there that have done well, 15 and 17. The rest remain a surprise.

It’s All Random

Let’s not forget, there’s no secret solution, no method. It’s all random, though looking’s at the number frequencies you would wonder why ball 35 crops up 12 times compared to ball 45 which only appeared twice.

This is only a small sample size too, 51 draws over the last 180 days. The UK lottery has been in operation since 1994 so there’s been many draws. When all the results are analysed you’d expect the frequencies to even out over time.

And if that isn’t random enough the Bulgarian lottery of 2009 saw the same six numbers draw two weeks in a row. It wasn’t fraud, or a fix, it was random.


Lottery Frequencies…. #hadoop #datamining