BookmarkSubscribeRSS Feed

Text analysis of scripts involving a galaxy far, far away

Started ‎06-28-2019 by
Modified ‎08-04-2021 by
Views 5,029

Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:

 

Access Now

Free Data Friday goes to space.jpg

As a diehard Star Wars Geek, I'm always looking for ways to out-geek my friends; whether it's by useless facts (did you know John Ratzenberger, who plays Cliff Clavin on Cheers, is in Empire Strikes Back? He's the guard that tells Han not to go out looking for Luke because it's too cold out!) to having numerous posters in my office (current total is 6), I am always looking for creative ways to show my love for the franchise.

 

I saw a blog about text analysis of the original Star Wars scripts (using another stats package) and decided to extend my previous article to see what I could find; I presented it to the Toronto Area SAS Society, and have attached my presentation at the end of this article.

 

Get the Data

The data is found on a GitHub repository (here) and needs to be copied and pasted into either a TXT file or Excel. 

 

Get Started with SAS OnDemand for Academics

 
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:

Get Started

 

Get the data ready

I must admit, I still use Excel to do a lot of my data cleaning; it's what I've been using for over 25 years (gah!) and I can think and do a lot faster than in SAS. Having said that, some really interesting issues came up with the three files.

 

Episode 4: A New Hope - There are varying number of spaces between the character's name and the spoken line. 

 

starwars1.PNG

Episode 5: Empire Strikes Back - A colon separates the character name from the text; descriptive text is in brackets.

starwars2.PNG

Episode 6: Return of the Jedi - Double quotes separate the columns; line numbers are included. 
starwars3.PNG 

 

After cleaning the data, as well as adding in 2 dummy variables (EPISODE and ID), the data finally looked like this:

 

starwars4a.pngFreeDataFriday_graphic.jpg

 

The results

One tip that I learnt from my last post on Star Wars, and have used repeatedly since, is how to take the text and split it so each word is on a separate row. This is critical for doing frequency analysis and other types of reporting.

 

data work.starwars2;
  set work.starwars;
  do i=1 by 0;
  new=scan(Text, i, ' ');
  if missing(new) then leave;
  output;
  i+1;
  end;
  keep new;
run;

This takes my Text column and looks for a space; every time it encounters one, it moves to the next row.  Here's what I get when I run it on my data:

 

starwars5.png 

The first thing I am curious about is line count - who has the most number of lines overall? 

proc sql; 
create table work.line_count as 
select character, count(*) as Count
from work.starwars
group by character
order by character;
quit;

proc sort data=work.line_count;
by descending Count;
run;

ods graphics/ reset width = 6.4in height=4.8in imagemap;
proc sgplot data=work.line_count (where=(count>10));
  scatter x=character y=Count / ;
  xaxis grid;
  yaxis grid;
run; 
ods graphics / reset;

 

Here's the output.  What I find interesting is that there is a significant difference between Han (2nd) and Threepio (3rd).

 

starwars6.png 

 

The other question I have is around the use of the word "Force"; for those unfamiliar with the franchise, the Force is the invisible energy that permeates and exists in, through and around everything.  It's able to be controlled (moving objects etc.) and other abilities to those who are trained.

 

proc sql;
select character, 
count(case when (episode='IV') then 1 end) as A_New_Hope, 
count(case when (episode='V') then 1 end) as Empire, 
count(case when (episode='VI') then 1 end) as Return_of_Jedi
from work.starwars
where text like '%Force%'
group by character
order by character;
quit;

So even though The Force is a key component to the franchise, in fact it's not directly mentioned very often at all. 

starwars7.png

 

Now it's your turn!

 

Did you find something else interesting in this data?  Share in the comments.  I'm glad to answer any questions!

 

 

Comments

I feel the enthusiastic data force in this post... Nice work @DarthPathos!

Thanks @MichelleHomes always appreciate your support 8-)

So C3P0 gets 'shortened' to Threepio?  What was wrong with 3P0?  echuta!

@tomrvincent Surprisingly that's how it was in the scripts (ie. not modified by me).  Also from a simple length perspective C3P0 is obviously shorter and easier to type, so no idea why that decision was made.  (Appreciate the echuta by the way ;-))

Chris

@tomrvincent, Threepio doesn't give me too much heartburn. That's how I pictured it when Luke shrieked, "THREEPIO!!!" into his comlink when he, Han and Leia were in the trash compactor: https://www.youtube.com/watch?v=6u3QInIMVME

@BeverlyBrown I have a book at home called "From another Perspective" (C3P0 quote) and it's short stories from side-characters, including one about the creature living in the garbage compactor and how they came to live there.  Always love stuff like that!

Oh my gosh, @DarthPathos, that's gotta be a fun read! 

@BeverlyBrown haven't read the whole thing but the stories I have read are a lot of fun.  I've always wanted to write a fanfic about a Database Admin that works on the Death Star - I wonder if they use SAS LOL.  

SAS serving the dark side... @DarthPathos I can already sense a ripple in the Force (my PR counterparts getting twitchy 😉 

Actually the Death Star Analytics Dept probably uses an Open Source stats software - that's how they missed the vulnerability in the thermal exhaust that Luke used to blow it up :')

Version history
Last update:
‎08-04-2021 08:51 AM
Updated by:

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags