Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
As a diehard Star Wars Geek, I'm always looking for ways to out-geek my friends; whether it's by useless facts (did you know John Ratzenberger, who plays Cliff Clavin on Cheers, is in Empire Strikes Back? He's the guard that tells Han not to go out looking for Luke because it's too cold out!) to having numerous posters in my office (current total is 6), I am always looking for creative ways to show my love for the franchise.
I saw a blog about text analysis of the original Star Wars scripts (using another stats package) and decided to extend my previous article to see what I could find; I presented it to the Toronto Area SAS Society, and have attached my presentation at the end of this article.
The data is found on a GitHub repository (here) and needs to be copied and pasted into either a TXT file or Excel.
I must admit, I still use Excel to do a lot of my data cleaning; it's what I've been using for over 25 years (gah!) and I can think and do a lot faster than in SAS. Having said that, some really interesting issues came up with the three files.
Episode 4: A New Hope - There are varying number of spaces between the character's name and the spoken line.
Episode 5: Empire Strikes Back - A colon separates the character name from the text; descriptive text is in brackets.
Episode 6: Return of the Jedi - Double quotes separate the columns; line numbers are included.
After cleaning the data, as well as adding in 2 dummy variables (EPISODE and ID), the data finally looked like this:
One tip that I learnt from my last post on Star Wars, and have used repeatedly since, is how to take the text and split it so each word is on a separate row. This is critical for doing frequency analysis and other types of reporting.
data work.starwars2;
set work.starwars;
do i=1 by 0;
new=scan(Text, i, ' ');
if missing(new) then leave;
output;
i+1;
end;
keep new;
run;
This takes my Text column and looks for a space; every time it encounters one, it moves to the next row. Here's what I get when I run it on my data:
The first thing I am curious about is line count - who has the most number of lines overall?
proc sql;
create table work.line_count as
select character, count(*) as Count
from work.starwars
group by character
order by character;
quit;
proc sort data=work.line_count;
by descending Count;
run;
ods graphics/ reset width = 6.4in height=4.8in imagemap;
proc sgplot data=work.line_count (where=(count>10));
scatter x=character y=Count / ;
xaxis grid;
yaxis grid;
run;
ods graphics / reset;
Here's the output. What I find interesting is that there is a significant difference between Han (2nd) and Threepio (3rd).
The other question I have is around the use of the word "Force"; for those unfamiliar with the franchise, the Force is the invisible energy that permeates and exists in, through and around everything. It's able to be controlled (moving objects etc.) and other abilities to those who are trained.
proc sql;
select character,
count(case when (episode='IV') then 1 end) as A_New_Hope,
count(case when (episode='V') then 1 end) as Empire,
count(case when (episode='VI') then 1 end) as Return_of_Jedi
from work.starwars
where text like '%Force%'
group by character
order by character;
quit;
So even though The Force is a key component to the franchise, in fact it's not directly mentioned very often at all.
Did you find something else interesting in this data? Share in the comments. I'm glad to answer any questions!
I feel the enthusiastic data force in this post... Nice work @DarthPathos!
Thanks @MichelleHomes always appreciate your support 8-)
So C3P0 gets 'shortened' to Threepio? What was wrong with 3P0? echuta!
@tomrvincent Surprisingly that's how it was in the scripts (ie. not modified by me). Also from a simple length perspective C3P0 is obviously shorter and easier to type, so no idea why that decision was made. (Appreciate the echuta by the way ;-))
Chris
@tomrvincent, Threepio doesn't give me too much heartburn. That's how I pictured it when Luke shrieked, "THREEPIO!!!" into his comlink when he, Han and Leia were in the trash compactor: https://www.youtube.com/watch?v=6u3QInIMVME
@BeverlyBrown I have a book at home called "From another Perspective" (C3P0 quote) and it's short stories from side-characters, including one about the creature living in the garbage compactor and how they came to live there. Always love stuff like that!
Oh my gosh, @DarthPathos, that's gotta be a fun read!
@BeverlyBrown haven't read the whole thing but the stories I have read are a lot of fun. I've always wanted to write a fanfic about a Database Admin that works on the Death Star - I wonder if they use SAS LOL.
SAS serving the dark side... @DarthPathos I can already sense a ripple in the Force (my PR counterparts getting twitchy 😉
Actually the Death Star Analytics Dept probably uses an Open Source stats software - that's how they missed the vulnerability in the thermal exhaust that Luke used to blow it up :')
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.