I’ve been reading up on R and playing with it here and there.  With little time on my hands I can only turn to small snippets of self made challenges to enable to me learn something new.  I also thought it might be nice to share the process and see where we go…

There be data in thems hills.

 First of all we need data. So, off I tramps to the excellent Lee Munroe creation www.tweetni.com to rip the backside out of it… in a nice way.  I don’t want names, I want numbers.  So let’s scrape the follower numbers from the tweeters listed.

A simple shell script will pull the data for us and save it to a html page.  All good data mining starts with good data and finding a way to make mundane repeat processes easier to do.  So……

jasonbell$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ; do curl http://tweetni.com/people?order=followers&page=$i > page$i.html ; done

The shell script loops around the page numbers which we have defined, I know that there’s 19 pages of data so I can use curl to pull the html pages and save them to a file.  Easy.

That gives us the following files to play with:
jasonbell$ ls -l
total 1888
-rw-r–r–  1 jasonbell  staff  51184 24 Aug 21:04 page1.html
-rw-r–r–  1 jasonbell  staff  51054 24 Aug 21:04 page10.html
-rw-r–r–  1 jasonbell  staff  51227 24 Aug 21:05 page11.html
-rw-r–r–  1 jasonbell  staff  51265 24 Aug 21:05 page12.html
-rw-r–r–  1 jasonbell  staff  50695 24 Aug 21:05 page13.html
-rw-r–r–  1 jasonbell  staff  50403 24 Aug 21:05 page14.html
-rw-r–r–  1 jasonbell  staff  50367 24 Aug 21:05 page15.html
-rw-r–r–  1 jasonbell  staff  49950 24 Aug 21:05 page16.html
-rw-r–r–  1 jasonbell  staff  49987 24 Aug 21:05 page17.html
-rw-r–r–  1 jasonbell  staff  48480 24 Aug 21:05 page18.html
-rw-r–r–  1 jasonbell  staff  11798 24 Aug 21:05 page19.html
-rw-r–r–  1 jasonbell  staff  51008 24 Aug 21:04 page2.html
-rw-r–r–  1 jasonbell  staff  51031 24 Aug 21:04 page3.html
-rw-r–r–  1 jasonbell  staff  50878 24 Aug 21:04 page4.html
-rw-r–r–  1 jasonbell  staff  50493 24 Aug 21:04 page5.html
-rw-r–r–  1 jasonbell  staff  51293 24 Aug 21:04 page6.html
-rw-r–r–  1 jasonbell  staff  50982 24 Aug 21:04 page7.html
-rw-r–r–  1 jasonbell  staff  50991 24 Aug 21:04 page8.html
-rw-r–r–  1 jasonbell  staff  51165 24 Aug 21:04 page9.html

Getting the data we need.
If you open one of the html files you’ll see the table row with the user, twitter name, their bio and, most important to us, the number of followers.

<td class=”alignright”>41221</td>

This is the only piece of data we’re interested in so it makes sense to make life easier for ourselves and extract the stuff out.

Once again a quick shell script will help us out:

jasonbell$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19; do grep “alignright” page$i.html >> collate.html ; done

The >> in the shell script appends to our file collate.html and does’t create a new file each time. Not perfect but fine for our needs…

So we’re left with….

    <td class=”alignright”>188246</td>
    <td class=”alignright”>41221</td>
    <td class=”alignright”>26779</td>
    <td class=”alignright”>23372</td>
    <td class=”alignright”>18931</td>
    <td class=”alignright”>18197</td>
    <td class=”alignright”>17950</td>
    <td class=”alignright”>17069</td>
    <td class=”alignright”>16692</td>
    <td class=”alignright”>15816</td>
    <td class=”alignright”>12373</td>
    <td class=”alignright”>10611</td>
    <td class=”alignright”>9974</td>
    <td class=”alignright”>9550</td>
    <td class=”alignright”>9413</td>
    <td class=”alignright”>9398</td>

Hack out the HTML….
With a text editor of choice we need to remove the TD information, we don’t need it.
I’m using vi so my search and replace goes like:
:%s/    <td class=”alignright”>//g
:%s/</td>//g

It doesn’t matter much but I rename collate.html to collate.txt…. it’s not html anymore. 🙂

Doing the funky stuff with R
R is an open source program for statistical analysis. You can download it for your machine here.

The interface is unforgiving but there is plenty of documentation about. The opening screen is pictured below.

R_screen1
Loading the data in R
We want to read in our newly created collate.txt file in to R and associate all that data with a name.  So we’ll have an object called “tweetni” and load the data in with:

> tweetni <- read.table(file(“[path to your data files]/collate.txt”))

Fun with numbers in no time at all.
We can get the real quick basics of our data like the mean (average), median and other goodies with one simple command.

> summary(tweetni)

And we get…

R_screen2

Good eh!

Standard Deviation
Standard Deviation is a doddle as well…

> sd(tweetni) 

Gives us….

R_screen3

Summary

Scratching the surface (or perhaps barrel), well yes.  But there’s some shell scripting, some basic vi and some basic R. Not bad for free…. more later.

Advertisements