Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
One of my favourite movie franchises is the Star Wars series. Of those, The Empire Strikes Back is my favourite. It's the first movie I remember seeing as a kid in the theatre, and I was fascinated by the various characters. I’d never seen characters like that before. I adored Yoda and wanted to be his friend. Luke and Leia I wanted as aunt and uncle, and I would’ve given anything for a ride on a tauntaun. So I decided to see if I could do some basic text analytics with the script, and I actually found out some things I didn’t know!
You can get a text-based script of the movie here.
I copied the text from the site and pasted it into Excel. I then did a couple of additional things to clean up the data:
Here’s the data as it appears in my file:
It’s good, and we could do some basic analytics, but I want to dig a little more into it and see what else we can tease out.
work.starwars2; set work.starwars; do i=1 by 0; new=scan(string,i,' '); if missing(new) then leave; output; i+1; end; keep new; run;
Now when I run this, I get something more reasonable for analysis:
(As an aside, there are 27986 words in the ESB script, in case that ever comes up in Jeopardy!)
So let’s start playing around with this. first, let’s see how many lines each character has. This could be done with either dataset, but I find it a little faster on the new one (otherwise, SAS has to do a search using the % wildcard).
proc sql; select count(case when new = 'YODA:' then 1 end) as Yoda, count(case when new = 'LUKE:' then 1 end) as Luke, count(case when new = 'LEIA:' then 1 end) as Leia, count(case when new = 'VADER:' then 1 end) as DarthVader, count(case when new = 'HAN:' then 1 end) as Han from work.starwars2; quit;
Before scrolling down, who do you think had the most lines in the movie? I was convinced it was Luke.
Not even close! Han has more, which surprised me. I was also really surprised to find out that Yoda has a measly 36 lines – for a character to have such a profound impact on me, with hardly saying anything, is very interesting.
The other one I was curious about was how often “The Force” is referred to – for such a key piece of the Star Wars universe, I’d expect to be mentioned quite a bit.
Here’s a simple query where I’m counting the instances of the word force in the new dataset, and as a validation of the data, I select all rows where it occurs in the text; obviously, you can do a count using the original data.
proc sql; select count(*) from work.starwars2 where new = 'Force'; select * from work.starwars where string like '%Force%'; quit;
My first query tells me that “Force” has only been used 17 times in the movie; here’s the result of the second query.
Because I provided a RowID, I know that the word “Force” is not used until line 1,680 of the script. From just doing a quick scan, Yoda does appear to use it more often.
One final query I wanted to run is to look at word frequency. Here I'm creating a new table, doing a count, and then selecting all rows from the new table sorting the Count column in descending order:
create table work.starwars3 as
select new, count(*) as cnt
group by new
order by new;
order by cnt desc;
No surprise on the first bunch; I am surprised though that Luke is referred to in the script the same number of times as Han speaks.
I am excited to discover this functionality in SAS as I am going to be starting text analytics as part of my daily job, and having this DATA step to split the text into rows will make life a lot easier for me.
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.