Discriminant analysis: removing highly correleated variables causes p...

phuzface · Posted 06-03-2018 03:17 PM

I've got a final project for one of my multivariate stats class. I have a data set with 265 US colleges and various variables about incoming students, students, costs, tuition, etc.

I'm trying to do a classification analysis by region. I don't understand why this is happening, but the lowest prediction error rate I'm getting is by including ALL of the variables. In other words, the error rate increases when I remove highly correlated variables. For instance, there are several mean SAT variables (Math, verbal, 1st quartile, 3rd quartile) that are all highly correlated with overall SAT mean score. If I take any of these out, the error rate goes up.

Is there something going on here that I'm missing? I don't think I'm going to do well on this project if I'm not able to take out variables from the model and discuss why. Any help is appreciated

Data set is attached and my code is below.

proc corr data=schools;

run;

proc discrim data=schools crossvalidate pool=yes;
class Region;
var
Public_Private
Mean_Math_SAT
Mean_Verbal_SAT
Mean_SAT
First_Quartile_Math_SAT
Third_Quartile_Math_SAT
First_Quartile_Verbal_SAT
Third_Quartile_Verbal_SAT
Applications_Received
Applicants_Accepted
New_Students_Enrolled
Percentage_of_students_accepted
Percent_accepted_stud_enrolled
Stud_top_ten_of_HS
Stud_top_twentyfive_percent_HS
Fulltime_Undergrads
Parttime_Undergrads
Percentage_Fulltime
InState_Tuition
OutofState_Tuition
RoomBoard_Cost
Additional_Fees
Estimated_Books_Cost
Estimated_Personal_Spending
Percentage_of_Faculty_with_PhD
Percent_Faculty_terminaldegree
StudentFaculty_Ratio
Percentage_of_alumni_who_donate
Instructional_ExpenditureStudent
Graduation_Rate
;
run;

PGStats · Posted 06-03-2018 05:47 PM

You can find a better (lower error rate) set of discriminant variables with proc stepdisc:

proc stepdisc data=schools method=stepwise;
class Region;
var
Public_Private
Mean_Math_SAT
Mean_Verbal_SAT
Mean_SAT
First_Quartile_Math_SAT
Third_Quartile_Math_SAT
First_Quartile_Verbal_SAT
Third_Quartile_Verbal_SAT
Applications_Received
Applicants_Accepted
New_Students_Enrolled
Percentage_of_students_accepted
Percent_accepted_stud_enrolled
Stud_top_ten_of_HS
Stud_top_twentyfive_percent_HS
Fulltime_Undergrads
Parttime_Undergrads
Percentage_Fulltime
InState_Tuition
OutofState_Tuition
RoomBoard_Cost
Additional_Fees
Estimated_Books_Cost
Estimated_Personal_Spending
Percentage_of_Faculty_with_PhD
Percent_Faculty_terminaldegree
StudentFaculty_Ratio
Percentage_of_alumni_who_donate
Instructional_ExpenditureStudent
Graduation_Rate
;
run;

proc discrim data=schools crossvalidate pool=yes;
class Region;
var &_stdvar;
run;

PG

PaigeMiller · Posted 06-04-2018 08:24 AM

I don't understand why this is happening, but the lowest prediction error rate I'm getting is by including ALL of the variables.

I think you need to be open to what the data is telling you. This is what the data is telling you.

HOWEVER

There is such a thing as overfitting. When a model is overfit, you have added at least one (possibly more than one) variable that is essentially being fit to the random noise of the data, rather than being fit to the signal in the data. So, if you have overfitting, you ought to remove terms from the model, which will give you WORSE fit statistics, but more "stable" (or to phrase things differently, a model that is less variable). So avoiding overfitting gives you WORSE fit but a better model on other measures.

How do you avoid overfitting in PROC DISCRIM? You can use the CROSSVALIDATE option which will show you the classifications using cross-validation; if those are poor, then you can remove terms from the model until the cross-validation statistics are closer to perfect classification (realizing that perfect classification isn't really possible).

There is an example in the PROC DISCRIM documentation where the cross-validation error rates are much higher than the error rates of the model, and this indicates the model has been overfit.

I have never been a fan of stepwise methods, and I avoid them like the plague. Google "problems with stepwise". What would I use? I would use PLS Discriminant Analysis (PLS-DA) which is PROC PLS with dummy variables for Y to indicate which region the observation is.

--
Paige Miller

Discriminant analysis: removing highly correleated variables causes prediction error to increase!

Re: Discriminant analysis: removing highly correleated variables causes prediction error to increas

Re: Discriminant analysis: removing highly correleated variables causes prediction error to increas

SAS Innovate 2026 Registration is Open