The Story So Far
In previous posts I’ve covered basically loading data in Spark (with Sparkling in Clojure) and doing some half funky stuff with it. That’s all very well and a good point for starting with, but it’s a touch limiting. Ultimately it’s very easy to get some numbers out, crack some percentages and plot a 2d graph, Google Map or infographic.
What I want to do is something far more interesting than that (in my eyes), use some machine learning to create new things based on what we have.
With a sufficient amounts of text we can do some interesting things. The nicer thing about Markov Chains is they are simple in terms of how they work.
With a corpus of text loaded we can create some fresh output text. More text, better results. A Markov Chain is will randomly walk an existing lookup, based on the corpus text, and randomly select the next word to use. By looking at the previous words in the original corpus the chain can weight what the next random word should be.
Examples I’ve seen have created Paul Graham startup stories and Garfield cartoons. I could create my own St Vincent song, in fact that’s what I’ll do.
How To Create New St Vincent Songs
“Jase, I think you might like this….”, said my dear friend, sound engineer and my soundscape recordist, Dez Rae. He was right. That was in 2010/2011 before rock royalty beckoned for Annie Clark (and rightly so)… I bought what I could on the spot, it was so unique.
The great thing is the variety of songs, no two come near each other and no two albums are the same.
The Corpus of Annie Clark
In a text editor I’ve copied/pasted the lyrics from the Strange Mercy album.
I spent the summer on my back Another attack Stay in just to get along Turn off the TV, wade in bed A blue and a red A little something to get along Best find a surgeon Come cut me open Dressing, undressing for the wall If mother calls She knows well we don't get along
An album full of lyrics (all copyright to Annie Clark I hasten to add), all the blank lines taken out, that’s our corpus.
Markov Chain Code In Clojure
Now I need some code to so the Markov Chain, I’m not writing it this time, someone else has done the work far better than I could of in Clojure so I’m using his.
You can look at it here: https://gist.github.com/dbasch/8345424
Like I said, with a corpus of text loaded in the program will look at next words and create a lookup of words and scores. When I generate new sentences the next word will be governed by the lookup table and word scores. Simple.
I’m going to loop 15 times to create a song.
(defn -main [& args] (let [markov (transform (lazy-lines (first args)))] (for [loopcount (range 15)] (generate-sentence markov))))
From the REPL I can run:
markov.core> (-main "/Users/jasonbell/Documents/stvincentlyrics.txt") ("Oh little one I guess it makes my mulling days, through my lesson" "Chloe in just to get along" "Your hometown is" "I've told whole lies" "Let's not a party I owe you ever really care for me?" "But when you ever really stare at you could take us?" "Chloe in the tiger" "My own heels" "Did you say it was the piles\"" "While you" "Heal my clothes on" "But when you went off the tiger" "I've told whole lies" "Bodies, can't you can limp beside you ever really stare?" "Tried so they left more")
Which looks pretty neat….
Oh little one I guess it makes my mulling days, through my lesson
Chloe in just to get along
Your hometown is
I’ve told whole lies
Let’s not a party I owe you ever really care for me?
But when you ever really stare at you could take us?
Chloe in the tiger
My own heels
Did you say it was the piles
Heal my clothes on
But when you went off the tiger
I’ve told whole lies
Bodies, can’t you can limp beside you ever really stare?
Tried so they left more
It’s still copyright to Annie Clark, they’re still her words just a little more random. If I was going for a title, “My Mulling Days” would be a front runner.
I could have put all the lyrics from all the albums in and come up with a more refined lyric set, but as a test and a wee tribute to one of my favourite artist’s, it’s a good start.
Do We Need An Executive?
So it looks like Stormont is getting a longer break than was originally planned. Which means that NI open data is going to be thin on the ground for new MLA questions. So in the meantime let’s turn the building into a Data Centre (we could ask Arlene if INI will fund it, she’s still there, she’s managed to hold on things….)
So I’ve got my new data centre.
With no MLA’s asking questions though we want to generate some to give the impression that something is happening up there. All those potential FDI clients will want to see the powerhouse working…. If we do a well enough job we would let the Markov Chains just do the work altogether but let’s not get ahead of ourselves just yet.
Repurposing NIAssembly Spark Code
I’m going to extract the question text from the MLA questions. I’m going to use the NI Assembly Spark code (you can read part 1 and part 2 if you want to know the inner workings) and extract just the text.
mlas.core> (def members (load-members sc members-path)) #'mlas.core/members mlas.core> (def questions (load-questions sc questions-path)) #'mlas.core/questions mlas.core> (def mqrdd (join-members-questions members questions)) #'mlas.core/mqrdd mlas.core>
That gives me a [key, [value,value] set of members with questions for that member. Now I need to map through each member, then map each question block and extract the question text.
mlas.core> (def qtext (spark/map (s-de/key-val-val-fn (fn [k m qs] (map (fn (:questiontext question)) qs))) mqrdd)) #'mlas.core/qtext mlas.core> (spark/first qtext) ("To ask the First Minister and deputy First Minister for an update on the delivery of their Programme for Government 11/15 commitments." "To ask the First Minister and deputy First Minister for an update on the delivery of their Programme for Government 11/15 commitments." "To ask the Minister of Enterprise, Trade and Investment whether any of his departmental responsibilities have been affected by the actions of any proscribed organisations since 2011.") mlas.core>
That’s the first element of the RDD and it has three questions. There’s a lot more…. a whole lot more.
I want to save this out as a text file which requires a bit more mapping.
mlas.core> (def textarrays (spark/collect qtext)) #'mlas.core/textarrays mlas.core> mlas.core> (map (fn [qs] (spit "/Users/jasonbell/Documents/mlaquestions.txt" (apply str (interpose "\n" qs)) :append true)) textarrays)
That now gives me a large text file of MLA questions throughout history.
Jasons-Mac-mini:Documents jasonbell$ wc mlaquestions.txt 94056 3007106 18327959 mlaquestions.txt Jasons-Mac-mini:Documents jasonbell$
Random MLA Question Generation
With 94,000+ questions to train my Markov Chain I’m expecting some interesting results. I only want to generate one question at a time so I can remove the loop (where I was generating 15 lines for generating St Vincent lyrics.
I’m going to run this from the REPL so I’m not reloading and reindexing all the text. Let’s create some MLA questions for next week.
markov.core> (def markov (transform (lazy-lines "/Users/jasonbell/Documents/mlaquestions.txt"))) #'markov.core/markov markov.core> (generate-sentence markov) "To ask the First Minister of Finance and deputy First Minister what steps are entitled to ensure greater weight is the reasons that no reports into the Housing Executive Gateway Reviews his Department has been allocated to outline the Minister for Social Services and to a CCEA test; and (vi) South Armagh city area." markov.core> (generate-sentence markov) "To ask the Ethnic Development what recruitment process used to detail, broken down by (i) who are assessed as possible help graduates in the Minister and Personnel for each spouse or not personally signed off a whole." markov.core> (generate-sentence markov) "To ask the cost, of Ulster in the Minister of order an organisation, broken down by Health and Learning for exemption." markov.core> (generate-sentence markov) "To ask the Minister of the last three years." markov.core> (generate-sentence markov) "To ask the First Minister what sentences would bring forward to July bonfires on the progression on planning application for rural area of Health, Social Services Directive; and location and what they are assisting these guidelines; and Leisure for Social Services and Rural Development what additional counselling, including those in 2008/09." markov.core> (generate-sentence markov) "To ask the First Minister and (ii) if so (ii) whether students with identities outside the number of the Employment and whether the Office of the Environment Minister." markov.core>
To be honest that was far too much fun!
Taking It Further
If you have access to plenty of text then you can run Markov Chains to produce new content with little difficulty. For a more refined method it’s worth looking at Artificial Neural Networks which is being used by some publishers for content creation.
All in all, to save Northern Ireland from having no news whatsoever…. well I’ve done my bit 🙂