Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Discriminant analysis: removing highly correleated variables causes p...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-03-2018 03:17 PM
(1839 views)

I've got a final project for one of my multivariate stats class. I have a data set with 265 US colleges and various variables about incoming students, students, costs, tuition, etc.

I'm trying to do a classification analysis by region. I don't understand why this is happening, but the lowest prediction error rate I'm getting is by including ALL of the variables. In other words, the error rate increases when I remove highly correlated variables. For instance, there are several mean SAT variables (Math, verbal, 1st quartile, 3rd quartile) that are all highly correlated with overall SAT mean score. If I take any of these out, the error rate goes up.

Is there something going on here that I'm missing? I don't think I'm going to do well on this project if I'm not able to take out variables from the model and discuss why. Any help is appreciated

Data set is attached and my code is below.

proc corr data=schools;

run;

proc discrim data=schools crossvalidate pool=yes;

class Region;

var

Public_Private

Mean_Math_SAT

Mean_Verbal_SAT

Mean_SAT

First_Quartile_Math_SAT

Third_Quartile_Math_SAT

First_Quartile_Verbal_SAT

Third_Quartile_Verbal_SAT

Applications_Received

Applicants_Accepted

New_Students_Enrolled

Percentage_of_students_accepted

Percent_accepted_stud_enrolled

Stud_top_ten_of_HS

Stud_top_twentyfive_percent_HS

Fulltime_Undergrads

Parttime_Undergrads

Percentage_Fulltime

InState_Tuition

OutofState_Tuition

RoomBoard_Cost

Additional_Fees

Estimated_Books_Cost

Estimated_Personal_Spending

Percentage_of_Faculty_with_PhD

Percent_Faculty_terminaldegree

StudentFaculty_Ratio

Percentage_of_alumni_who_donate

Instructional_ExpenditureStudent

Graduation_Rate

;

run;

2 REPLIES 2

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can find a better (lower error rate) set of discriminant variables with **proc stepdisc**:

```
proc stepdisc data=schools method=stepwise;
class Region;
var
Public_Private
Mean_Math_SAT
Mean_Verbal_SAT
Mean_SAT
First_Quartile_Math_SAT
Third_Quartile_Math_SAT
First_Quartile_Verbal_SAT
Third_Quartile_Verbal_SAT
Applications_Received
Applicants_Accepted
New_Students_Enrolled
Percentage_of_students_accepted
Percent_accepted_stud_enrolled
Stud_top_ten_of_HS
Stud_top_twentyfive_percent_HS
Fulltime_Undergrads
Parttime_Undergrads
Percentage_Fulltime
InState_Tuition
OutofState_Tuition
RoomBoard_Cost
Additional_Fees
Estimated_Books_Cost
Estimated_Personal_Spending
Percentage_of_Faculty_with_PhD
Percent_Faculty_terminaldegree
StudentFaculty_Ratio
Percentage_of_alumni_who_donate
Instructional_ExpenditureStudent
Graduation_Rate
;
run;
proc discrim data=schools crossvalidate pool=yes;
class Region;
var &_stdvar;
run;
```

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I don't understand why this is happening, but the lowest prediction error rate I'm getting is by including ALL of the variables.

I think you need to be open to what the data is telling you. This is what the data is telling you.

HOWEVER

There is such a thing as overfitting. When a model is overfit, you have added at least one (possibly more than one) variable that is essentially being fit to the random noise of the data, rather than being fit to the signal in the data. So, if you have overfitting, you ought to remove terms from the model, which will give you WORSE fit statistics, but more "stable" (or to phrase things differently, a model that is less variable). So avoiding overfitting gives you WORSE fit but a better model on other measures.

How do you avoid overfitting in PROC DISCRIM? You can use the CROSSVALIDATE option which will show you the classifications using cross-validation; if those are poor, then you can remove terms from the model until the cross-validation statistics are closer to perfect classification (realizing that perfect classification isn't really possible).

There is an example in the PROC DISCRIM documentation where the cross-validation error rates are much higher than the error rates of the model, and this indicates the model has been overfit.

I have never been a fan of stepwise methods, and I avoid them like the plague. Google "problems with stepwise". What would I use? I would use PLS Discriminant Analysis (PLS-DA) which is PROC PLS with dummy variables for Y to indicate which region the observation is.

--

Paige Miller

Paige Miller

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.