We’re smarter together. Learn from this collection of community knowledge and add your expertise.

How I used SAS University Edition to analyse a Star Wars script

by Regular Contributor 2 weeks ago (1,060 Views)

One of my favourite movie franchises is the Star Wars series. Of those, The Empire Strikes Back is my favourite. It's the first movie I remember seeing as a kid in the theatre, and I was fascinated by the various characters. I’d never seen characters like that before. I adored Yoda and wanted to be his friend. Luke and Leia I wanted as aunt and uncle, and I would’ve given anything for a ride on a tauntaun. So I decided to see if I could do some basic text analytics with the script, and I actually found out some things I didn’t know!

 

Get the Data

 

FreeDataFriday_graphic.jpgYou can get a text-based script of the movie here

 

Get SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.

 

Get the data ready

I copied the text from the site and pasted it into Excel. I then did a couple of additional things to clean up the data:

  1. Filtered and deleted all blank rows
  2. Did a TRIM on each row to remove spaces, and then did a copy/paste as values. I then deleted the original column, added a RowID and imported the data without an issue.

The Results

Here’s the data as it appears in my file:

 

1.png

 

It’s good, and we could do some basic analytics, but I want to dig a little more into it and see what else we can tease out.

 

I found the following code in this thread, specifically the response from @slchen (which I modified to suit my data):

 

work.starwars2;
   set work.starwars;
   do i=1 by 0;
   new=scan(string,i,' ');
   if missing(new) then leave;
   output;
   i+1;
   end;
   keep new;
run;

 

 

Now when I run this, I get something more reasonable for analysis:

 

2.png

 

(As an aside, there are 27986 words in the ESB script, in case that ever comes up in Jeopardy!)

 

So let’s start playing around with this. first, let’s see how many lines each character has. This could be done with either dataset, but I find it a little faster on the new one (otherwise, SAS has to do a search using the % wildcard).

 

proc sql;
select
count(case when new = 'YODA:' then 1 end) as Yoda,
count(case when new = 'LUKE:' then 1 end) as Luke,
count(case when new = 'LEIA:' then 1 end) as Leia,
count(case when new = 'VADER:' then 1 end) as DarthVader,
count(case when new = 'HAN:' then 1 end) as Han
from work.starwars2;
quit;

 

 

Before scrolling down, who do you think had the most lines in the movie? I was convinced it was Luke.

3.PNG

 

Not even close! Han has more, which surprised me. I was also really surprised to find out that Yoda has a measly 36 lines – for a character to have such a profound impact on me, with hardly saying anything, is very interesting. 

 

The other one I was curious about was how often “The Force” is referred to – for such a key piece of the Star Wars universe, I’d expect to be mentioned quite a bit.

 

Here’s a simple query where I’m counting the instances of the word force in the new dataset, and as a validation of the data, I select all rows where it occurs in the text; obviously, you can do a count using the original data.

 

proc sql;
select count(*) from work.starwars2 where new = 'Force';
select * from work.starwars where string like '%Force%';
quit;

 

 

My first query tells me that “Force” has only been used 17 times in the movie; here’s the result of the second query.

 

4.png

 

Because I provided a RowID, I know that the word “Force” is not used until line 1,680 of the script. From just doing a quick scan, Yoda does appear to use it more often.

 

One final query I wanted to run is to look at word frequency. Here I'm creating a new table, doing a count, and then selecting all rows from the new table sorting the Count column in descending order:

 

proc sql;

create table work.starwars3 as

select new, count(*) as cnt

from work.starwars2

group by new

order by new;

select *

from work.starwars3

order by cnt desc;

quit;

 

 

5.PNG

 

No surprise on the first bunch; I am surprised though that Luke is referred to in the script the same number of times as Han speaks. 

 

I am excited to discover this functionality in SAS as I am going to be starting text analytics as part of my daily job, and having this DATA step to split the text into rows will make life a lot easier for me.

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Need data for learning?

 

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

 

9.png

 We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

 

10.png

Click Analytics U, then select "Subscribe" from the Options menu.

 

Happy Learning!

 

Comments
by Super User
2 weeks ago

Why use Excel at all? Download the file and tell SAS to read the text directly as well as strip the blanks, remove blank lines and add the row counter. Of course the filename statement below needs to match your operating environment such as pointing to /myfolders and such:

filename esb "d:\data\junk\esb.txt" encoding=utf8;

data work.esbraw;
   infile esb lrecl=1000;
   informat line $1000.;
   input;
   line=_infile_;
   line=strip(line);
   if missing(line) then delete;
   RowId+1;
run;

 

And of course one of the fun things with SAS is there are often many ways to get to same place. You might try alternates to see performance. For instance instead of Proc Sql to create work.starwars3 Proc Freq would be one other option:

proc freq data=work.starwars2 noprint order=freq;
   tables new /out= work.starwars3;
run;

Which will also have a percent of word count.

 

 

Since the scene is prefaced with an all capitol letter line starting with either INTERIOR: or EXTERIOR: we can count how many scenes and of which type:

data scenes;
   set work.starwars ;

   Scene= (scan(string,1) in ('INTERIOR:','EXTERIOR:'));
   if Scene=1 then SceneType= scan(string,1);
   if Scene=1; /*keep only the records with a scene header*/
run;

proc freq data=scenes;

  tables SceneType;
run;

For the interested reader: if it isn't a scene or title then the lines in all capitol letters are credits. So you could look for how many stars, co-stars and crew members are mentioned.

 

by Regular Contributor
2 weeks ago

@ballardw Wow.  Didn't think of that; I've had friends run into issues importing TXT files, so didn't want to risk running into problems, but that would've been a much more efficient process.  I admit I spent the bulk of my time writing this figuring out the data step; I thought I could use Regular Expressions and ended up going down a rabbit hole of very confusing syntax that got me nowhere. 

 

I wish I could write this things as a full-time job as it would allow me to really dig in to the topic; but then that's why people like yourself are great, you supplement with such great info. 

 

have a good weekend!

Chris

by Trusted Advisor
2 weeks ago

And for some further Star Wars SAS analysis check out @themaryosborne's text analytics blog post and SAS Global Forum paper - https://blogs.sas.com/content/sascom/2016/05/04/may-the-4th-be-with-you-2016-edition/

by Super Contributor
2 weeks ago

This is a really interesting analysis - one improvement I could suggest is finding a way to exclude prepositions from your word frequency count as they're always going to be the most common words in any lengthy text and don't really add anything to our understanding.

by Regular Contributor
2 weeks ago

@MichelleHomes Shall check out tomorrow, thanks so much!  Text Analytics has been something I've fiddled with but have been intimidated by the complexity - now that I have this code, it'll allow me a good starting point.

 

@ChrisBrooks thanks!  Totally agree - I admit I was rushing to get it posted (did it at work, needed to get my day started :-)).  The other thing i realized I should do is strip out commas, periods, etc. as well as put words starting in caps to all lower case (I'd keep the YODA: etc that indicate a character's lines, but Yoda etc. would be standardized to yoda).

 

Appreciate your feedback and thanks for reading!

Chris

Your turn
Sign In!

Want to write an article? Sign in with your profile.