We’re smarter together. Learn from this collection of community knowledge and add your expertise.

PROC SQL Continued - Joins, Aliases and Basketball!

by Regular Contributor on ‎12-18-2015 12:04 PM (986 Views)

Get the Data

 

Last week's installment of Free Data Friday used PROC SQL with NASA data. Today, we come back down to earth with a look at open source basketball data.

 

I found the website http://www.opensourcesports.com/ completely by accident, and promptly bookmarked it for future reference. Free Data Friday and .jpg  After checking out a couple of different datasets, I decided to go with Basketball (http://www.opensourcesports.com/basketball/).  There’s a lot of good stuff there, so I highly recommend checking it out if you’re a sports fan.           

 

How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.

 

 

Getting the data ready

I’ve imported the tables into a Library named “bookdata”, and have reviewed the data.  There’s a couple of key things to note, but I’ll highlight them when they are relevant.  Having said that, I can’t stress enough the importance of doing a review of your data before you import, as well as after – I’ve had many experiences where I thought the data imported incorrectly, spent a long time working to fix it, only to realise the raw dataset was incorrect. 

 

The Results

I was born, raised and live in or just outside Toronto, Ontario, Canada, and so although I’m not a huge sports fan, Toronto will always be my team of choice.  So using the Basketball data, I want to see all the players that have played for Toronto. 

 FreeDataFriday_graphic.jpg

I’ve covered the basics of SQL Syntax in my previous two posts, so won’t go over them here.  I want to focus on a couple of key lines; specifically, the first two lines of the WHERE statement.  Because the data is split into multiple tables (Players, Teams, Managers, etc), we need some way to join them together.  This is the first example, and by far the easiest – I’m asking SAS to show me the playerID and year, where the tmID and year in the PLAYERS table matches the tmID and year in the TEAMS table. 

 Picture1.png

Below are the SAS results; if you had only provided the tmID as the joining variable, you would have had all players showing for all years (I highly recommend you try this, so you can see what “wrong” results look like).  This is good, but doesn’t tell us much and is not the easiest to interpret.  We need more information, and that’s what we’ll do next.

 Picture2.png

My next SQL statement brings in a third table – MASTER.  The only significant difference is the line that reads “and players.playerID = master.bioID”.  This is one of the “pitfalls” I mentioned above – without doing a preliminary data exploration of the data, I didn’t know that the same columns have different names in different tables.  Knowing this, I can run my query and expect I’ll get the variables I’ve asked for in the SELECT statement.

 Picture3.png

The results are below and are perfect, this is exactly what I was hoping for.  I now have basic information that is somewhat useful – I know that Douglas Christie was born in the US, and is 78” tall. 

 Picture4.png

I’m going to pause here for a moment to show a technique in SQL called “Aliasing”.  This means you give the tables in your FROM statement names that are easier to remember, making maintenance a lot easier.  In a real-world example, I’m working on a query at work that is currently up to 1200 lines of code – without aliasing, it would be a lot more typing, and therefore way more room for error!

 

You’ll note that the FROM statement now has “as pl”, “as tm” and “as ma”.  These are your alias statements.  Note that my SELECT statement still has “players.year” – this is simply to show that both will work, but using the alias is a lot shorter.  Depending on your preference, you can use A, B, C, or anything else that you can remember.  For my database at work, which as over 800 tables, I use the first letter of each word – so the table patientimaginghistory becomes pih; feel free to try different ways and see what’s comfortable.

 Picture5.png

We have aliasing, and we have basic joins.  Let’s take it up a notch, and add one more table, PLAYER_ALLSTAR, which I’ve aliased as pla.

Picture6.png

 

Because of the type of join we’re doing (a = b, or return only the data where the ID in a matches the ID in b), I get the following results:

 Picture7.png

But what if we wanted to see all the players, but see the Conference for those that played in the Allstar game?  We change our tactic – you’ll notice the WHERE statement only has the criteria that we want just Toronto.  But what the heck happened to the FROM statement?  I’ve used the INNER and LEFT joins to build the relationships.  INNER JOIN is the same as saying a = b; it’s an exact match on the variables you specify.  Because we want to see all players, plus any data on players that were in the Allstar game, we use a LEFT JOIN – all the rows from the table on the left side of the relationship, plus any rows from the right that are a match based on the variables.

 Picture8.png

Admittedly, I had to scroll down quite a ways before I found players from Toronto who played in the Allstar game, but here is one example.  Benjamin Lenard has not played in the Allstar in 1996, and Billups Chauncey did in 1997 for the East. 

 Picture9.png

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Need data for learning?

 

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

 Picture10.png

We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

 Picture11.png

 

Click Analytics U, then select "Subscribe" from the Options menu.

 

Happy Learning!

 

 

Comments
by Regular Contributor
on ‎12-23-2015 05:40 PM

Hi all,

 

Someone posted a question on Facebook and in reviewing my code for the last example, I realised I missed a join.  

 

Here's the code that I should have used:

 

basketball Edit 1.png

 

You'll note the LEFT JOIN has been updated to join the YEAR from the Players table with the SEASON_ID from the Players_Allstar table.

 

Apologies to anyone who was working with this and having problems :-)

Chris

Your turn
Sign In!

Want to write an article? Sign in with your profile.