BookmarkSubscribeRSS Feed

A Look at IMDB Data Using Nested Queries in SAS

Started ‎11-18-2016 by
Modified ‎08-04-2021 by
Views 9,560

Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:

 

Access Now

 

Enough about the US elections. We spent three weeks poring over data related directly or tangentially to the most contentious American election in recent memory. Let's escape to the movies. IMDB is great for showing you some complex querying SAS University Edition goes to the movies.jpg
techniques.

 

Get the Data

I got the data from one of my favourite data repositories - Kaggle.  Data was already in a CSV file and imported to SAS University Edition without issue.

 

 

Get Started with SAS OnDemand for Academics

 
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:

Get Started

 

 

GetFDF Tile.PNGting the data ready

The data was already in a format that I could use, and there were no missing or clearly incorrect data.

 

The Results

PROC SQL is my favourite method to query and review data, and allows for fairly complex functionality.  One of the things I used to struggle with was pulling data based on a summary statistic that changes – for example, how to pull all the costs that are above the average, when the average changes every day / hour / minute?  Sure, you can put the data into a temp table and re-query it, but that’s resource heavy.  You can try doing something like “select * from table_A where cost > avg(cost)” but then you’ll get this error:

 

ERROR: Summary functions are restricted to the SELECT and HAVING clauses only.

 

 

Hmmm. OK, so the other option is to do something along the lines of:

 

Proc sql;
Select movie_title, cost-avg(cost)
From work.import
Group by movie_title
Having cost-avg(cost)>0
Order by movie_title;
Quit;

 

This would work but is very clunky and if you’re dealing with a massive amount of data, would be inefficient (SQL evaluates the WHERE clause first, so in this case it’s running the cost-avg(cost) on all your rows).

 

So, is there an easier way of doing this?  Of course!  The simple examples I show here do not do this justice, but does provide you groundwork for more complex queries.  I have three different queries, all doing the same basic function but with different variables.

 

proc sql;
title 'Maximum Duration';
select movie_title, duration, actor_1_name, 
            title_year, imdb_score
from work.import
where duration = 
            (select max(duration) from work.import);
 
title 'Maximum Budget';
select  movie_title,  budget format=dollar20., 
            actor_1_name, title_year, imdb_score
from work.import
where budget = 
            (select max(budget) from work.import);
 
title 'Maximum Most Cast Facebook Likes';
select  movie_title, cast_total_facebook_likes, 
            actor_1_name, title_year, imdb_score
from work.import
where cast_total_facebook_likes = 
            (select max(cast_total_facebook_likes) from work.import);
quit;

 

One side note – the TITLE above the SELECT statement is a nice easy way to keep track of your output; if you’re running 10 or more SQL queries and one doesn’t return any data, knowing which one it is can be hard.  Putting in titles makes it easy for tracing back to the code.  You’ll also note I had to use the format= statement for the budget column in the second query.  This just adds the $ and the commas in the right places.

 

Here’s the output:

image 1.png

 

First off, there’s an 8 ½ hour long movie called "Trapped"?  I bet you feel trapped watching it! (I am of course now curious and will be diligently trying to track it down!)  The other thing that jumps out is the budget for "The Host" - $12,215,500,000?  Yikes!  Having seen this movie, the only thing I can think of is that because this movie was filmed in Korea, the exchange rate must have somehow come into play. 

 

Although a relatively simple example, you can use any of the SQL functions in the “nested” query.  Here’s an example of using the AVG to find the top 10 movies that are above the overall average.  The AVG_DOLLAR column I have included is simply for show, I wouldn’t include it as an actual column.

 

proc sql outobs=10;
select movie_title, budget format=dollar16., avg(budget) format=dollar16. as Avg_Dollar
from work.import
where budget> (select avg(budget) from work.import)
order by budget desc;
quit;

 

Here’s the output:

 image2.png

 

Most of these movies are from Asia (and I highly recommend "The Host," "Princess Mononoke" and "Akira!") Again, the exchange rate may explain the exhorbitant budgets.

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

 

Comments

Nice illustration of Proc SQL subqueries.

 

I think I can explain the strangely high budget figures - the costs in the file you've used are not all in US $ - for example if you check the IMDB site itself the cost of The Host is given in South Korean Won and Princess Mononoke is given in Japanese Yen http://www.imdb.com/title/tt0468492/?ref_=fn_al_tt_2 and http://www.imdb.com/title/tt0119698/?ref_=nv_sr_1

Hey @ChrisBrooks thanks for the clarifying - I wasn't sure if it was conversion or reporting in Won or USD.  Appreciate your time and reading my article 🙂

 

Have a great weekend and happy coding!

Chris

Version history
Last update:
‎08-04-2021 07:22 AM
Updated by:

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags