Enough about the US elections. We spent three weeks poring over data related directly or tangentially to the most contentious American election in recent memory. Let's escape to the movies. IMDB is great for showing you some complex querying
Get the Data
I got the data from one of my favourite data repositories - Kaggle. Data was already in a CSV file and imported to SAS University Edition without issue.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
The data was already in a format that I could use, and there were no missing or clearly incorrect data.
PROC SQL is my favourite method to query and review data, and allows for fairly complex functionality. One of the things I used to struggle with was pulling data
based on a summary statistic that changes – for example, how to pull all the costs that are above the average, when the average changes every day / hour / minute? Sure, you can put the data into a temp table and re-query it, but that’s resource heavy. You can try doing something like “select * from table_A where cost > avg(cost)” but then you’ll get this error:
ERROR: Summary functions are restricted to the SELECT and HAVING clauses only.
Hmmm. OK, so the other option is to do something along the lines of:
Proc sql; Select movie_title, cost-avg(cost) From work.import Group by movie_title Having cost-avg(cost)>0 Order by movie_title; Quit;
This would work but is very clunky and if you’re dealing with a massive amount of data, would be inefficient (SQL evaluates the WHERE clause first, so in this case it’s running the cost-avg(cost) on all your rows).
So, is there an easier way of doing this? Of course! The simple examples I show here do not do this justice, but does provide you groundwork for more complex queries. I have three different queries, all doing the same basic function but with different variables.
proc sql; title 'Maximum Duration'; select movie_title, duration, actor_1_name, title_year, imdb_score from work.import where duration = (select max(duration) from work.import); title 'Maximum Budget'; select movie_title, budget format=dollar20., actor_1_name, title_year, imdb_score from work.import where budget = (select max(budget) from work.import); title 'Maximum Most Cast Facebook Likes'; select movie_title, cast_total_facebook_likes, actor_1_name, title_year, imdb_score from work.import where cast_total_facebook_likes = (select max(cast_total_facebook_likes) from work.import); quit;
One side note – the TITLE above the SELECT statement is a nice easy way to keep track of your output; if you’re running 10 or more SQL queries and one doesn’t return any data, knowing which one it is can be hard. Putting in titles makes it easy for tracing back to the code. You’ll also note I had to use the format= statement for the budget column in the second query. This just adds the $ and the commas in the right places.
Here’s the output:
First off, there’s an 8 ½ hour long movie called "Trapped"? I bet you feel trapped watching it! (I am of course now curious and will be diligently trying to track it down!) The other thing that jumps out is the budget for "The Host" - $12,215,500,000? Yikes! Having seen this movie, the only thing I can think of is that because this movie was filmed in Korea, the exchange rate must have somehow come into play.
Although a relatively simple example, you can use any of the SQL functions in the “nested” query. Here’s an example of using the AVG to find the top 10 movies that are above the overall average. The AVG_DOLLAR column I have included is simply for show, I wouldn’t include it as an actual column.
proc sql outobs=10; select movie_title, budget format=dollar16., avg(budget) format=dollar16. as Avg_Dollar from work.import where budget> (select avg(budget) from work.import) order by budget desc; quit;
Here’s the output:
Most of these movies are from Asia (and I highly recommend "The Host," "Princess Mononoke" and "Akira!") Again, the exchange rate may explain the exhorbitant budgets.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.