One of my favourite movie franchises is the Star Wars series. Of those, The Empire Strikes Back is my favourite. It's the first movie I remember seeing as a kid in the theatre, and I was fascinated by the various characters. I’d never seen characters like that before. I adored Yoda and wanted to be his friend. Luke and Leia I wanted as aunt and uncle, and I would’ve given anything for a ride on a tauntaun. So I decided to see if I could do some basic text analytics with the script, and I actually found out some things I didn’t know!
You can get a text-based script of the movie here.
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
I copied the text from the site and pasted it into Excel. I then did a couple of additional things to clean up the data:
Here’s the data as it appears in my file:
It’s good, and we could do some basic analytics, but I want to dig a little more into it and see what else we can tease out.
work.starwars2; set work.starwars; do i=1 by 0; new=scan(string,i,' '); if missing(new) then leave; output; i+1; end; keep new; run;
Now when I run this, I get something more reasonable for analysis:
(As an aside, there are 27986 words in the ESB script, in case that ever comes up in Jeopardy!)
So let’s start playing around with this. first, let’s see how many lines each character has. This could be done with either dataset, but I find it a little faster on the new one (otherwise, SAS has to do a search using the % wildcard).
proc sql; select count(case when new = 'YODA:' then 1 end) as Yoda, count(case when new = 'LUKE:' then 1 end) as Luke, count(case when new = 'LEIA:' then 1 end) as Leia, count(case when new = 'VADER:' then 1 end) as DarthVader, count(case when new = 'HAN:' then 1 end) as Han from work.starwars2; quit;
Before scrolling down, who do you think had the most lines in the movie? I was convinced it was Luke.
Not even close! Han has more, which surprised me. I was also really surprised to find out that Yoda has a measly 36 lines – for a character to have such a profound impact on me, with hardly saying anything, is very interesting.
The other one I was curious about was how often “The Force” is referred to – for such a key piece of the Star Wars universe, I’d expect to be mentioned quite a bit.
Here’s a simple query where I’m counting the instances of the word force in the new dataset, and as a validation of the data, I select all rows where it occurs in the text; obviously, you can do a count using the original data.
proc sql; select count(*) from work.starwars2 where new = 'Force'; select * from work.starwars where string like '%Force%'; quit;
My first query tells me that “Force” has only been used 17 times in the movie; here’s the result of the second query.
Because I provided a RowID, I know that the word “Force” is not used until line 1,680 of the script. From just doing a quick scan, Yoda does appear to use it more often.
One final query I wanted to run is to look at word frequency. Here I'm creating a new table, doing a count, and then selecting all rows from the new table sorting the Count column in descending order:
create table work.starwars3 as
select new, count(*) as cnt
group by new
order by new;
order by cnt desc;
No surprise on the first bunch; I am surprised though that Luke is referred to in the script the same number of times as Han speaks.
I am excited to discover this functionality in SAS as I am going to be starting text analytics as part of my daily job, and having this DATA step to split the text into rows will make life a lot easier for me.
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.