What’s this data?
We kick off (pun intended) our Free Data Friday series with a Rushing Defense data set from Pro-Football-Reference. University Edition can help show you how your favorite team stands up against the run.
Get University Edition
If you don’t already have University Edition, get it here. If you download and use it through virtualization software, follow closely the appropriate Quick Start Guide (a pdf on the download page.) If you need help with almost any aspect of using University Edition, check out these video tutorials.
Get the data and prep it for analysis
Download the data -- the Rushing Defense data set -- from Pro-Football-Reference. From that web page, click the Export option to create a CSV file that contains a text version of the data. You'll want to save that file in the "myfolders" area that you've designated for your SAS University Edition installation. Here are more details on how you copy data to the myfolders area.
Now, let’s start coding. The infile statement will require the location of your saved file, so make sure you grab the file path. Then you can use an infile statement to bring in the file. Normally, SAS uses a set statement, but since this is a csv and not a SAS dataset, the infile is needed.
Next, specify options to help SAS understand how to read the csv. The dlm= option specifies your delimiter. The missover and dsd options allow any missing data to be recorded as missing instead of going to the next non-missing column, which would record the wrong values for each variable. Then use the firstobs= 2 option so that SAS starts reading at the second row for values instead of the first. The first just has all the variable names, which aren’t needed.
In your input statement, put in each variable name as it appears in the dataset followed by the variables format. The first character is the colon in each format; the colon is not part of the format, but it specifies that it will stop reading values once the delimiter is encountered. In most cases, the format just the width of the variable. For rank for instance the format is 2. since the variable is always two digits long or shorter.
However, when there are decimals in the values, you need to format it as N.X. This specifies that you want x number of digits after the decimal and N represents the total amount of characters. This includes the decimal, commas, or any character in the value. All formats end in a period except those that are written as N.X the decimal in the middle is enough.
Last, to assign a variable as a character variable instead of numeric, put the $ in front of the width. $20. is an example of a character format.
filename Rush_D "/folders/myfolders/my_data/NFL Rushing Defense.csv"; data Rush_D2; infile Rush_D dlm=',' dsd missover firstobs= 2; input R_Rank :2. Team :$22. G :2. R_ATT :3. R_YDS :4. R_TD :2. R_Lng :5. R_Y_A :3.1 R_Yds_per_G :5.1 fmb :3. R_Exp_Pts_Cont :6.2; if anydigit(R_Rank)= 0 then delete; run; proc corr data= rush_d2; var R_YDS R_ATT R_TD; run; proc sgplot data= rush_d2; bubble x =R_YDS y =R_ATT size= R_TD; run; quit;
What we’re analyzing
The anydigit function looks for placement of the first numeric digit in a variable. It will return a value of zero if there are no digits. I used this function to eliminate the summary rows at the bottom since we want only the teams and their individual numbers. Proc corr shows the correlations of the variables in the var statement. Proc gplot makes the graphic shown.
What does the output mean?
From these graphs, we see that there is a strong positive correlation between yards and attempts. The bubble size is determined by the number of touchdowns each team allowed against the run. As you can see, more teams with low yardage have also allowed fewer touchdowns.
There are exceptions. The Kansas City Chiefs allowed 2,036 yards against the run, making them 28th out of 32 teams in yardage allowed. However, they have allowed the fewest touchdowns of any other team with 4. There are exceptions like this.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.