BookmarkSubscribeRSS Feed
phuzface
Fluorite | Level 6

I've got a final project for one of my multivariate stats class.  I have a data set with 265 US colleges and various variables about incoming students, students, costs, tuition, etc. 

 

I'm trying to do a classification analysis by region. I don't understand why this is happening, but the lowest prediction error rate I'm getting is by including ALL of the variables.  In other words, the error rate increases when I remove highly correlated variables.  For instance, there are several mean SAT variables (Math, verbal, 1st quartile, 3rd quartile) that are all highly correlated with overall SAT mean score.  If I take any of these out, the error rate goes up. 

 

Is there something going on here that I'm missing?  I don't think I'm going to do well on this project if I'm not able to take out variables from the model and discuss why.  Any help is appreciated

 

Data set is attached and my code is below.

 

proc corr data=schools;

run;

 

proc discrim data=schools crossvalidate pool=yes;
class Region;
var
Public_Private
Mean_Math_SAT
Mean_Verbal_SAT
Mean_SAT
First_Quartile_Math_SAT
Third_Quartile_Math_SAT
First_Quartile_Verbal_SAT
Third_Quartile_Verbal_SAT
Applications_Received
Applicants_Accepted
New_Students_Enrolled
Percentage_of_students_accepted
Percent_accepted_stud_enrolled
Stud_top_ten_of_HS
Stud_top_twentyfive_percent_HS
Fulltime_Undergrads
Parttime_Undergrads
Percentage_Fulltime
InState_Tuition
OutofState_Tuition
RoomBoard_Cost
Additional_Fees
Estimated_Books_Cost
Estimated_Personal_Spending
Percentage_of_Faculty_with_PhD
Percent_Faculty_terminaldegree
StudentFaculty_Ratio
Percentage_of_alumni_who_donate
Instructional_ExpenditureStudent
Graduation_Rate
;
run;

2 REPLIES 2
PGStats
Opal | Level 21

You can find a better (lower error rate) set of discriminant variables with proc stepdisc:

 

proc stepdisc data=schools method=stepwise;
class Region;
var
Public_Private
Mean_Math_SAT
Mean_Verbal_SAT
Mean_SAT
First_Quartile_Math_SAT
Third_Quartile_Math_SAT
First_Quartile_Verbal_SAT
Third_Quartile_Verbal_SAT
Applications_Received
Applicants_Accepted
New_Students_Enrolled
Percentage_of_students_accepted
Percent_accepted_stud_enrolled
Stud_top_ten_of_HS
Stud_top_twentyfive_percent_HS
Fulltime_Undergrads
Parttime_Undergrads
Percentage_Fulltime
InState_Tuition
OutofState_Tuition
RoomBoard_Cost
Additional_Fees
Estimated_Books_Cost
Estimated_Personal_Spending
Percentage_of_Faculty_with_PhD
Percent_Faculty_terminaldegree
StudentFaculty_Ratio
Percentage_of_alumni_who_donate
Instructional_ExpenditureStudent
Graduation_Rate
;
run;

proc discrim data=schools crossvalidate pool=yes;
class Region;
var &_stdvar;
run;
PG
PaigeMiller
Diamond | Level 26

I don't understand why this is happening, but the lowest prediction error rate I'm getting is by including ALL of the variables.

 

I think you need to be open to what the data is telling you. This is what the data is telling you.

 

HOWEVER

 

There is such a thing as overfitting. When a model is overfit, you have added at least one (possibly more than one) variable that is essentially being fit to the random noise of the data, rather than being fit to the signal in the data. So, if you have overfitting, you ought to remove terms from the model, which will give you WORSE fit statistics, but more "stable" (or to phrase things differently, a model that is less variable). So avoiding overfitting gives you WORSE fit but a better model on other measures.

 

How do you avoid overfitting in PROC DISCRIM? You can use the CROSSVALIDATE option which will show you the classifications using cross-validation; if those are poor, then you can remove terms from the model until the cross-validation statistics are closer to perfect classification (realizing that perfect classification isn't really possible).

 

There is an example in the PROC DISCRIM documentation where the cross-validation error rates are much higher than the error rates of the model, and this indicates the model has been overfit.

 

I have never been a fan of stepwise methods, and I avoid them like the plague. Google "problems with stepwise". What would I use? I would use PLS Discriminant Analysis (PLS-DA) which is PROC PLS with dummy variables for Y to indicate which region the observation is.

--
Paige Miller

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1796 views
  • 4 likes
  • 3 in conversation