Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
One of my favourite movie franchises is the Star Wars series. Of those, The Empire Strikes Back is my favourite. It's the first movie I remember seeing as a kid in the theatre, and I was fascinated by the various characters. I’d never seen characters like that before. I adored Yoda and wanted to be his friend. Luke and Leia I wanted as aunt and uncle, and I would’ve given anything for a ride on a tauntaun. So I decided to see if I could do some basic text analytics with the script, and I actually found out some things I didn’t know!
You can get a text-based script of the movie here.
I copied the text from the site and pasted it into Excel. I then did a couple of additional things to clean up the data:
Here’s the data as it appears in my file:
It’s good, and we could do some basic analytics, but I want to dig a little more into it and see what else we can tease out.
I found the following code in this thread, specifically the response from @slchen (which I modified to suit my data):
work.starwars2;
set work.starwars;
do i=1 by 0;
new=scan(string,i,' ');
if missing(new) then leave;
output;
i+1;
end;
keep new;
run;
Now when I run this, I get something more reasonable for analysis:
(As an aside, there are 27986 words in the ESB script, in case that ever comes up in Jeopardy!)
So let’s start playing around with this. first, let’s see how many lines each character has. This could be done with either dataset, but I find it a little faster on the new one (otherwise, SAS has to do a search using the % wildcard).
proc sql;
select
count(case when new = 'YODA:' then 1 end) as Yoda,
count(case when new = 'LUKE:' then 1 end) as Luke,
count(case when new = 'LEIA:' then 1 end) as Leia,
count(case when new = 'VADER:' then 1 end) as DarthVader,
count(case when new = 'HAN:' then 1 end) as Han
from work.starwars2;
quit;
Before scrolling down, who do you think had the most lines in the movie? I was convinced it was Luke.
Not even close! Han has more, which surprised me. I was also really surprised to find out that Yoda has a measly 36 lines – for a character to have such a profound impact on me, with hardly saying anything, is very interesting.
The other one I was curious about was how often “The Force” is referred to – for such a key piece of the Star Wars universe, I’d expect to be mentioned quite a bit.
Here’s a simple query where I’m counting the instances of the word force in the new dataset, and as a validation of the data, I select all rows where it occurs in the text; obviously, you can do a count using the original data.
proc sql;
select count(*) from work.starwars2 where new = 'Force';
select * from work.starwars where string like '%Force%';
quit;
My first query tells me that “Force” has only been used 17 times in the movie; here’s the result of the second query.
Because I provided a RowID, I know that the word “Force” is not used until line 1,680 of the script. From just doing a quick scan, Yoda does appear to use it more often.
One final query I wanted to run is to look at word frequency. Here I'm creating a new table, doing a count, and then selecting all rows from the new table sorting the Count column in descending order:
proc sql;
create table work.starwars3 as
select new, count(*) as cnt
from work.starwars2
group by new
order by new;
select *
from work.starwars3
order by cnt desc;
quit;
No surprise on the first bunch; I am surprised though that Luke is referred to in the script the same number of times as Han speaks.
I am excited to discover this functionality in SAS as I am going to be starting text analytics as part of my daily job, and having this DATA step to split the text into rows will make life a lot easier for me.
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Why use Excel at all? Download the file and tell SAS to read the text directly as well as strip the blanks, remove blank lines and add the row counter. Of course the filename statement below needs to match your operating environment such as pointing to /myfolders and such:
filename esb "d:\data\junk\esb.txt" encoding=utf8; data work.esbraw; infile esb lrecl=1000; informat line $1000.; input; line=_infile_; line=strip(line); if missing(line) then delete; RowId+1; run;
And of course one of the fun things with SAS is there are often many ways to get to same place. You might try alternates to see performance. For instance instead of Proc Sql to create work.starwars3 Proc Freq would be one other option:
proc freq data=work.starwars2 noprint order=freq; tables new /out= work.starwars3; run;
Which will also have a percent of word count.
Since the scene is prefaced with an all capitol letter line starting with either INTERIOR: or EXTERIOR: we can count how many scenes and of which type:
data scenes; set work.starwars ; Scene= (scan(string,1) in ('INTERIOR:','EXTERIOR:')); if Scene=1 then SceneType= scan(string,1); if Scene=1; /*keep only the records with a scene header*/ run; proc freq data=scenes; tables SceneType; run;
For the interested reader: if it isn't a scene or title then the lines in all capitol letters are credits. So you could look for how many stars, co-stars and crew members are mentioned.
@ballardw Wow. Didn't think of that; I've had friends run into issues importing TXT files, so didn't want to risk running into problems, but that would've been a much more efficient process. I admit I spent the bulk of my time writing this figuring out the data step; I thought I could use Regular Expressions and ended up going down a rabbit hole of very confusing syntax that got me nowhere.
I wish I could write this things as a full-time job as it would allow me to really dig in to the topic; but then that's why people like yourself are great, you supplement with such great info.
have a good weekend!
Chris
And for some further Star Wars SAS analysis check out @themaryosborne's text analytics blog post and SAS Global Forum paper - https://blogs.sas.com/content/sascom/2016/05/04/may-the-4th-be-with-you-2016-edition/
This is a really interesting analysis - one improvement I could suggest is finding a way to exclude prepositions from your word frequency count as they're always going to be the most common words in any lengthy text and don't really add anything to our understanding.
@MichelleHomes Shall check out tomorrow, thanks so much! Text Analytics has been something I've fiddled with but have been intimidated by the complexity - now that I have this code, it'll allow me a good starting point.
@ChrisBrooks thanks! Totally agree - I admit I was rushing to get it posted (did it at work, needed to get my day started :-)). The other thing i realized I should do is strip out commas, periods, etc. as well as put words starting in caps to all lower case (I'd keep the YODA: etc that indicate a character's lines, but Yoda etc. would be standardized to yoda).
Appreciate your feedback and thanks for reading!
Chris
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.