What’s this data?
If you're considering where to apply to college and money's an issue, this week's installment of Free Data Friday is for you! We look at how schools compare based on patterns in federal student financial aid data,
How to download
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Click here to download the data. Save it to your hard drive and it's ready to load up into University Edition.
How to get the data and prep it for analysis
First thing we do is use a proc import to get the data in. We specify dbms= xls this time since we have an excel file. Use the namerow= function to tell SAS which row you want it to use for the column headers. There are multiple rows of headers so this is necessary to avoid confusion in the default variable naming. The datarow= statement tells SAS to start filling in the dataset at the row you set it equal to, for us it's at row five.
In the log you can see that there were some invalid characters in the variable names as they're written in the excel file. To fix this SAS replaces most of them with underscores by default. Another issue we have with the naming convention is that the headers we excluded some header rows leaving us with just the sub headers which are the same in some cases. SAS by default will number them off if they are the same. Now we have to make a datastep and rename these variables. Put a prefix on each variable to represent the headers we cut from the top of the data. Then drop all the variables we just renamed.
Now we need to use the glm procedure to make a prediction on a variable, in this case FWS_Disbursements. To predict this, we also need to include all other Federal Work-Study data as predictors. In addition to those variables I have included School_Type to see if there is a difference in the amount of financial aid given based on School_Type. Use your class statement to separate the output by, School_Type. Then in the model statement put in the variable you want to predict first then after the equals sign you put in every variable you want to use as a predictor. Finally we output these results into a dataset using the output statement. The out= statement is where you name the new dataset. The p= and residual= are the variables we're adding from this procedure, and we're simply naming them with these statements. End the proc glm with a run and a quit. Now we can do the univariate procedure. The only new thing for us in this procedure is the id statement. This shows the most extreme observations from the var statement, and what their values are for the variable specified in the id statement.
proc import datafile="/folders/myfolders/my_data/Financial Aid.xls" out=fin_aid dbms=xls; namerow=4; datarow=5; run; data financial; set fin_aid; FWS_Recipients=Recipients; FWS_Federal_Award=__Federal_Award; FWS_Disbursements=Disbursements; PL_Recipients=Recipients_1; PL_Federal_Award=__Federal_Award_1; PL_Disbursements=Disbursements_1; FSEOG_Recipients=Recipients_2; FSEOG_Federal_Award=__Federal_Award_2; FSEOG_Disbursements=Disbursements_2; drop O Recipients __Federal_Award Disbursements Recipients_1 __Federal_Award_1 Disbursements_1 Recipients_2 __Federal_Award_2 Disbursements_2; run; proc glm data=financial; class School_Type; model FWS_Disbursements=FWS_Federal_Award School_Type FWS_Recipients; output out=predictions p=predicted residual=residual; run; quit; proc univariate data=predictions; var predicted; id school; run;
What does this output mean?
The table on the left is part of the output from the glm procedure. This shows us the significance level of the predictive variables. From this we can see that all the variables are statistically significant (all with p-value < .0001). This is not surprising with the number of observations being this high, most variable will be statistically significant. Using these figures in conjunction with the rest of the output and the dataset that was output from the glm procedure, there is more interesting information to be found.
The table on the right is part of the output from the univariate procedure. This shows a few key marks in the distribution of the predicted variable. Notice that some of the predictions are negative. This is because the glm procedure predicts solely off of what it sees. This is because there are many schools that are receiving little to no funding. These schools have been predicted to lose money based on the model created in the glm procedure. Since they had so little to start with they're projected to lose more than they had to begin with, and that's why they are listed as negative. Since we know there can't be negative dollars given to a school we can treat those as zeroes.
Now that we have these benchmarks, you can look up your favorite school and see where it ranks. If you don't see the number you're expecting, fear not; there are two other financial aid programs in this dataset. Maybe your school is leading the way in a different program.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.