I received a question for Boris Drakemandrillsquirrelhugger*, “Jase, you do data science, how many new jobs have Invest Northern Ireland announced in total?”.

“Bless My Cotton Socks I’m In The News”

First we need headlines and in one line of Linux we can have the whole lot.

$for i in {1..314}; do curl http://www.investni.com/news/index.html?page=$i > news_$i.html; done

This is exactly the same as how I pulled nijobs.com data in a previous blog post. Each page is 10 headlines and there’s 3138 headlines, so 314 pages will be fine. While that’s pulling all the html down you may as well get a cuppa….

1950s-woman-smiling-holding-platter-of-hors-d-oeuvres-snacks

Messing With The Output

The output is basically html pages. You could fire up Python and BeautifulSoup parsers and anything else that takes your fancy, or just use good old command line data science.

egrep -ohi "\d+ new jobs" *.html | egrep -o "\d+" | awk '{ sum+=$1} END {print sum}'

I’m piping three Linux commands, two egreps, the first to pull out “[a number] new jobs”. The -o flag is to only show the matching string from the regular expression, -i ignores the case, “New jobs” and “new jobs” is different otherwise and -h drops the filename in the output.

58 new jobs
61 new jobs
61 new jobs
84 new jobs
84 new jobs
84 new jobs
30 new jobs
30 new jobs
10 new jobs

The second just to get the figure.

30
30
30
40
82
82
15
300
300
23
540
540
36
125
125

And the exciting part is the awk command at the end where it adds up the stream numbers.

70758

Now that last figure is what we’re after. One caveat to that, any headline with a comma in the figure got ignored…. the first regexp will need tweaking…. you can play with that. So a rough estimate is to say that since June 2003 there have been over 70,000 new jobs announced in INI headlines.

The number you won’t get is how many were filled.

* The names have been changed to protect the innocent, in fact, just made up….. no one asked at all.

Advertisements