Get the Data
Last week's installment of Free Data Friday used PROC SQL with NASA data. Today, we come back down to earth with a look at open source basketball data.
I found the website http://www.opensourcesports.com/ completely by accident, and promptly bookmarked it for future reference. After checking out a couple of different datasets, I decided to go with Basketball (http://www.opensourcesports.com/basketball/). There’s a lot of good stuff there, so I highly recommend checking it out if you’re a sports fan.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
I’ve imported the tables into a Library named “bookdata”, and have reviewed the data. There’s a couple of key things to note, but I’ll highlight them when they are relevant. Having said that, I can’t stress enough the importance of doing a review of your data before you import, as well as after – I’ve had many experiences where I thought the data imported incorrectly, spent a long time working to fix it, only to realise the raw dataset was incorrect.
I was born, raised and live in or just outside Toronto, Ontario, Canada, and so although I’m not a huge sports fan, Toronto will always be my team of choice. So using the Basketball data, I want to see all the players that have played for Toronto.
I’ve covered the basics of SQL Syntax in my previous two posts, so won’t go over them here. I want to focus on a couple of key lines; specifically, the first two lines of the WHERE statement. Because the data is split into multiple tables (Players, Teams, Managers, etc), we need some way to join them together. This is the first example, and by far the easiest – I’m asking SAS to show me the playerID and year, where the tmID and year in the PLAYERS table matches the tmID and year in the TEAMS table.
Below are the SAS results; if you had only provided the tmID as the joining variable, you would have had all players showing for all years (I highly recommend you try this, so you can see what “wrong” results look like). This is good, but doesn’t tell us much and is not the easiest to interpret. We need more information, and that’s what we’ll do next.
My next SQL statement brings in a third table – MASTER. The only significant difference is the line that reads “and players.playerID = master.bioID”. This is one of the “pitfalls” I mentioned above – without doing a preliminary data exploration of the data, I didn’t know that the same columns have different names in different tables. Knowing this, I can run my query and expect I’ll get the variables I’ve asked for in the SELECT statement.
The results are below and are perfect, this is exactly what I was hoping for. I now have basic information that is somewhat useful – I know that Douglas Christie was born in the US, and is 78” tall.
I’m going to pause here for a moment to show a technique in SQL called “Aliasing”. This means you give the tables in your FROM statement names that are easier to remember, making maintenance a lot easier. In a real-world example, I’m working on a query at work that is currently up to 1200 lines of code – without aliasing, it would be a lot more typing, and therefore way more room for error!
You’ll note that the FROM statement now has “as pl”, “as tm” and “as ma”. These are your alias statements. Note that my SELECT statement still has “players.year” – this is simply to show that both will work, but using the alias is a lot shorter. Depending on your preference, you can use A, B, C, or anything else that you can remember. For my database at work, which as over 800 tables, I use the first letter of each word – so the table patientimaginghistory becomes pih; feel free to try different ways and see what’s comfortable.
We have aliasing, and we have basic joins. Let’s take it up a notch, and add one more table, PLAYER_ALLSTAR, which I’ve aliased as pla.
Because of the type of join we’re doing (a = b, or return only the data where the ID in a matches the ID in b), I get the following results:
But what if we wanted to see all the players, but see the Conference for those that played in the Allstar game? We change our tactic – you’ll notice the WHERE statement only has the criteria that we want just Toronto. But what the heck happened to the FROM statement? I’ve used the INNER and LEFT joins to build the relationships. INNER JOIN is the same as saying a = b; it’s an exact match on the variables you specify. Because we want to see all players, plus any data on players that were in the Allstar game, we use a LEFT JOIN – all the rows from the table on the left side of the relationship, plus any rows from the right that are a match based on the variables.
Admittedly, I had to scroll down quite a ways before I found players from Toronto who played in the Allstar game, but here is one example. Benjamin Lenard has not played in the Allstar in 1996, and Billups Chauncey did in 1997 for the East.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.