Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
As a diehard Star Wars Geek, I'm always looking for ways to out-geek my friends; whether it's by useless facts (did you know John Ratzenberger, who plays Cliff Clavin on Cheers, is in Empire Strikes Back? He's the guard that tells Han not to go out looking for Luke because it's too cold out!) to having numerous posters in my office (current total is 6), I am always looking for creative ways to show my love for the franchise.
I saw a blog about text analysis of the original Star Wars scripts (using another stats package) and decided to extend my previous article to see what I could find; I presented it to the Toronto Area SAS Society, and have attached my presentation at the end of this article.
Get the Data
The data is found on a GitHub repository (here) and needs to be copied and pasted into either a TXT file or Excel.
Get Started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Get the data ready
I must admit, I still use Excel to do a lot of my data cleaning; it's what I've been using for over 25 years (gah!) and I can think and do a lot faster than in SAS. Having said that, some really interesting issues came up with the three files.
Episode 4: A New Hope - There are varying number of spaces between the character's name and the spoken line.
Episode 5: Empire Strikes Back - A colon separates the character name from the text; descriptive text is in brackets.
Episode 6: Return of the Jedi - Double quotes separate the columns; line numbers are included.
After cleaning the data, as well as adding in 2 dummy variables (EPISODE and ID), the data finally looked like this:
The results
One tip that I learnt from my last post on Star Wars, and have used repeatedly since, is how to take the text and split it so each word is on a separate row. This is critical for doing frequency analysis and other types of reporting.
data work.starwars2;
set work.starwars;
do i=1 by 0;
new=scan(Text, i, ' ');
if missing(new) then leave;
output;
i+1;
end;
keep new;
run;
This takes my Text column and looks for a space; every time it encounters one, it moves to the next row. Here's what I get when I run it on my data:
The first thing I am curious about is line count - who has the most number of lines overall?
proc sql;
create table work.line_count as
select character, count(*) as Count
from work.starwars
group by character
order by character;
quit;
proc sort data=work.line_count;
by descending Count;
run;
ods graphics/ reset width = 6.4in height=4.8in imagemap;
proc sgplot data=work.line_count (where=(count>10));
scatter x=character y=Count / ;
xaxis grid;
yaxis grid;
run;
ods graphics / reset;
Here's the output. What I find interesting is that there is a significant difference between Han (2nd) and Threepio (3rd).
The other question I have is around the use of the word "Force"; for those unfamiliar with the franchise, the Force is the invisible energy that permeates and exists in, through and around everything. It's able to be controlled (moving objects etc.) and other abilities to those who are trained.
proc sql;
select character,
count(case when (episode='IV') then 1 end) as A_New_Hope,
count(case when (episode='V') then 1 end) as Empire,
count(case when (episode='VI') then 1 end) as Return_of_Jedi
from work.starwars
where text like '%Force%'
group by character
order by character;
quit;
So even though The Force is a key component to the franchise, in fact it's not directly mentioned very often at all.
Now it's your turn!
Did you find something else interesting in this data? Share in the comments. I'm glad to answer any questions!
... View more