We’re smarter together. Learn from this collection of community knowledge and add your expertise.

A Look at IMDB Data Using Nested Queries in SAS University Edition

by Super Contributor on ‎11-18-2016 12:45 PM (425 Views)

Enough about the US elections. We spent three weeks poring over data related directly or tangentially to the most contentious American election in recent memory. Let's escape to the movies. IMDB is great for showing you some complex querying SAS University Edition goes to the movies.jpg
techniques.

 

Get the Data

I got the data from one of my favourite data repositories - Kaggle.  Data was already in a CSV file and imported to SAS University Edition without issue.

 

How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.

 

GetFDF Tile.PNGting the data ready

The data was already in a format that I could use, and there were no missing or clearly incorrect data.

 

The Results

PROC SQL is my favourite method to query and review data, and allows for fairly complex functionality.  One of the things I used to struggle with was pulling data
based on a summary statistic that changes – for example, how to pull all the costs that are above the average, when the average changes every day / hour / minute?  Sure, you can put the data into a temp table and re-query it, but that’s resource heavy.  You can try doing something like “select * from table_A where cost > avg(cost)” but then you’ll get this error:

 

ERROR: Summary functions are restricted to the SELECT and HAVING clauses only.

 

 

Hmmm. OK, so the other option is to do something along the lines of:

 

Proc sql;
Select movie_title, cost-avg(cost)
From work.import
Group by movie_title
Having cost-avg(cost)>0
Order by movie_title;
Quit;

 

This would work but is very clunky and if you’re dealing with a massive amount of data, would be inefficient (SQL evaluates the WHERE clause first, so in this case it’s running the cost-avg(cost) on all your rows).

 

So, is there an easier way of doing this?  Of course!  The simple examples I show here do not do this justice, but does provide you groundwork for more complex queries.  I have three different queries, all doing the same basic function but with different variables.

 

proc sql;
title 'Maximum Duration';
select movie_title, duration, actor_1_name, 
            title_year, imdb_score
from work.import
where duration = 
            (select max(duration) from work.import);
 
title 'Maximum Budget';
select  movie_title,  budget format=dollar20., 
            actor_1_name, title_year, imdb_score
from work.import
where budget = 
            (select max(budget) from work.import);
 
title 'Maximum Most Cast Facebook Likes';
select  movie_title, cast_total_facebook_likes, 
            actor_1_name, title_year, imdb_score
from work.import
where cast_total_facebook_likes = 
            (select max(cast_total_facebook_likes) from work.import);
quit;

 

One side note – the TITLE above the SELECT statement is a nice easy way to keep track of your output; if you’re running 10 or more SQL queries and one doesn’t return any data, knowing which one it is can be hard.  Putting in titles makes it easy for tracing back to the code.  You’ll also note I had to use the format= statement for the budget column in the second query.  This just adds the $ and the commas in the right places.

 

Here’s the output:

image 1.png

 

First off, there’s an 8 ½ hour long movie called "Trapped"?  I bet you feel trapped watching it! (I am of course now curious and will be diligently trying to track it down!)  The other thing that jumps out is the budget for "The Host" - $12,215,500,000?  Yikes!  Having seen this movie, the only thing I can think of is that because this movie was filmed in Korea, the exchange rate must have somehow come into play. 

 

Although a relatively simple example, you can use any of the SQL functions in the “nested” query.  Here’s an example of using the AVG to find the top 10 movies that are above the overall average.  The AVG_DOLLAR column I have included is simply for show, I wouldn’t include it as an actual column.

 

proc sql outobs=10;
select movie_title, budget format=dollar16., avg(budget) format=dollar16. as Avg_Dollar
from work.import
where budget> (select avg(budget) from work.import)
order by budget desc;
quit;

 

Here’s the output:

 image2.png

 

Most of these movies are from Asia (and I highly recommend "The Host," "Princess Mononoke" and "Akira!") Again, the exchange rate may explain the exhorbitant budgets.

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Need data for learning?

 

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

 

Image11.png

We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

 IMAGE12.png

 

Click Analytics U, then select "Subscribe" from the Options menu.

 

Happy Learning!

Comments
by Super Contributor
on ‎11-18-2016 07:21 PM

Nice illustration of Proc SQL subqueries.

 

I think I can explain the strangely high budget figures - the costs in the file you've used are not all in US $ - for example if you check the IMDB site itself the cost of The Host is given in South Korean Won and Princess Mononoke is given in Japanese Yen http://www.imdb.com/title/tt0468492/?ref_=fn_al_tt_2 and http://www.imdb.com/title/tt0119698/?ref_=nv_sr_1

by Super Contributor
on ‎11-18-2016 09:41 PM

Hey @ChrisBrooks thanks for the clarifying - I wasn't sure if it was conversion or reporting in Won or USD.  Appreciate your time and reading my article :-)

 

Have a great weekend and happy coding!

Chris

Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.