Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Enough about the US elections. We spent three weeks poring over data related directly or tangentially to the most contentious American election in recent memory. Let's escape to the movies. IMDB is great for showing you some complex querying
techniques.
Get the Data
I got the data from one of my favourite data repositories - Kaggle. Data was already in a CSV file and imported to SAS University Edition without issue.
Getting the data ready
The data was already in a format that I could use, and there were no missing or clearly incorrect data.
The Results
PROC SQL is my favourite method to query and review data, and allows for fairly complex functionality. One of the things I used to struggle with was pulling data based on a summary statistic that changes – for example, how to pull all the costs that are above the average, when the average changes every day / hour / minute? Sure, you can put the data into a temp table and re-query it, but that’s resource heavy. You can try doing something like “select * from table_A where cost > avg(cost)” but then you’ll get this error:
ERROR: Summary functions are restricted to the SELECT and HAVING clauses only.
Hmmm. OK, so the other option is to do something along the lines of:
Proc sql;
Select movie_title, cost-avg(cost)
From work.import
Group by movie_title
Having cost-avg(cost)>0
Order by movie_title;
Quit;
This would work but is very clunky and if you’re dealing with a massive amount of data, would be inefficient (SQL evaluates the WHERE clause first, so in this case it’s running the cost-avg(cost) on all your rows).
So, is there an easier way of doing this? Of course! The simple examples I show here do not do this justice, but does provide you groundwork for more complex queries. I have three different queries, all doing the same basic function but with different variables.
proc sql;
title 'Maximum Duration';
select movie_title, duration, actor_1_name,
title_year, imdb_score
from work.import
where duration =
(select max(duration) from work.import);
title 'Maximum Budget';
select movie_title, budget format=dollar20.,
actor_1_name, title_year, imdb_score
from work.import
where budget =
(select max(budget) from work.import);
title 'Maximum Most Cast Facebook Likes';
select movie_title, cast_total_facebook_likes,
actor_1_name, title_year, imdb_score
from work.import
where cast_total_facebook_likes =
(select max(cast_total_facebook_likes) from work.import);
quit;
One side note – the TITLE above the SELECT statement is a nice easy way to keep track of your output; if you’re running 10 or more SQL queries and one doesn’t return any data, knowing which one it is can be hard. Putting in titles makes it easy for tracing back to the code. You’ll also note I had to use the format= statement for the budget column in the second query. This just adds the $ and the commas in the right places.
Here’s the output:
First off, there’s an 8 ½ hour long movie called "Trapped"? I bet you feel trapped watching it! (I am of course now curious and will be diligently trying to track it down!) The other thing that jumps out is the budget for "The Host" - $12,215,500,000? Yikes! Having seen this movie, the only thing I can think of is that because this movie was filmed in Korea, the exchange rate must have somehow come into play.
Although a relatively simple example, you can use any of the SQL functions in the “nested” query. Here’s an example of using the AVG to find the top 10 movies that are above the overall average. The AVG_DOLLAR column I have included is simply for show, I wouldn’t include it as an actual column.
proc sql outobs=10;
select movie_title, budget format=dollar16., avg(budget) format=dollar16. as Avg_Dollar
from work.import
where budget> (select avg(budget) from work.import)
order by budget desc;
quit;
Here’s the output:
Most of these movies are from Asia (and I highly recommend "The Host," "Princess Mononoke" and "Akira!") Again, the exchange rate may explain the exhorbitant budgets.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Nice illustration of Proc SQL subqueries.
I think I can explain the strangely high budget figures - the costs in the file you've used are not all in US $ - for example if you check the IMDB site itself the cost of The Host is given in South Korean Won and Princess Mononoke is given in Japanese Yen http://www.imdb.com/title/tt0468492/?ref_=fn_al_tt_2 and http://www.imdb.com/title/tt0119698/?ref_=nv_sr_1
Hey @ChrisBrooks thanks for the clarifying - I wasn't sure if it was conversion or reporting in Won or USD. Appreciate your time and reading my article 🙂
Have a great weekend and happy coding!
Chris
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.