Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
One of my favourite movie franchises is the Star Wars series. Of those, The Empire Strikes Back is my favourite. It's the first movie I remember seeing as a kid in the theatre, and I was fascinated by the various characters. I’d never seen characters like that before. I adored Yoda and wanted to be his friend. Luke and Leia I wanted as aunt and uncle, and I would’ve given anything for a ride on a tauntaun. So I decided to see if I could do some basic text analytics with the script, and I actually found out some things I didn’t know!
Get the data
You can get a text-based script of the movie here.
Get started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Get the data ready
I copied the text from the site and pasted it into Excel. I then did a couple of additional things to clean up the data:
Filtered and deleted all blank rows
Did a TRIM on each row to remove spaces, and then did a copy/paste as values. I then deleted the original column, added a RowID and imported the data without an issue.
The results
Here’s the data as it appears in my file:
It’s good, and we could do some basic analytics, but I want to dig a little more into it and see what else we can tease out.
I found the following code in this thread, specifically the response from @slchen (which I modified to suit my data):
work.starwars2;
set work.starwars;
do i=1 by 0;
new=scan(string,i,' ');
if missing(new) then leave;
output;
i+1;
end;
keep new;
run;
Now when I run this, I get something more reasonable for analysis:
(As an aside, there are 27986 words in the ESB script, in case that ever comes up in Jeopardy!)
So let’s start playing around with this. first, let’s see how many lines each character has. This could be done with either dataset, but I find it a little faster on the new one (otherwise, SAS has to do a search using the % wildcard).
proc sql;
select
count(case when new = 'YODA:' then 1 end) as Yoda,
count(case when new = 'LUKE:' then 1 end) as Luke,
count(case when new = 'LEIA:' then 1 end) as Leia,
count(case when new = 'VADER:' then 1 end) as DarthVader,
count(case when new = 'HAN:' then 1 end) as Han
from work.starwars2;
quit;
Before scrolling down, who do you think had the most lines in the movie? I was convinced it was Luke.
Not even close! Han has more, which surprised me. I was also really surprised to find out that Yoda has a measly 36 lines – for a character to have such a profound impact on me, with hardly saying anything, is very interesting.
The other one I was curious about was how often “The Force” is referred to – for such a key piece of the Star Wars universe, I’d expect to be mentioned quite a bit.
Here’s a simple query where I’m counting the instances of the word force in the new dataset, and as a validation of the data, I select all rows where it occurs in the text; obviously, you can do a count using the original data.
proc sql;
select count(*) from work.starwars2 where new = 'Force';
select * from work.starwars where string like '%Force%';
quit;
My first query tells me that “Force” has only been used 17 times in the movie; here’s the result of the second query.
Because I provided a RowID, I know that the word “Force” is not used until line 1,680 of the script. From just doing a quick scan, Yoda does appear to use it more often.
One final query I wanted to run is to look at word frequency. Here I'm creating a new table, doing a count, and then selecting all rows from the new table sorting the Count column in descending order:
proc sql;
create table work.starwars3 as
select new, count(*) as cnt
from work.starwars2
group by new
order by new;
select *
from work.starwars3
order by cnt desc;
quit;
No surprise on the first bunch; I am surprised though that Luke is referred to in the script the same number of times as Han speaks.
I am excited to discover this functionality in SAS as I am going to be starting text analytics as part of my daily job, and having this DATA step to split the text into rows will make life a lot easier for me.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
... View more