Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Select Most Important variables before a Linear Re...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-26-2015 10:01 AM

Hi All,

I would like to build a linear regression model and I need to select the most important variables (highly correlated to my target)..Does anyone know a great technique (Not Decision trees), I am using Base SAS for data preparation and I have around 1000 variables for a start.So I want to reduce the number of variables and select the most important before I enter them into Proc Reg.

Your help would be much appreciated.

Many Thanks

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-26-2015 10:53 AM

You could use stepwise regression (I wonder what the stats experts come up with). For example,

Data R_Input (Drop=i j);

Array X{*} X1-X1000;

Do j=1 To 120;

X1=Ranuni(1);

Y=X1*3+2+Ranuni(1)-0.5; * if SAS finds X1, it works :-);

Do i=2 To 1000;

X{i}=Ranuni(1);

End;

Output;

End;

Run;

Proc Reg Data=R_Input;

Model Y = X1--X1000 / Selection=Stepwise SlEntry=0.1 SLStay=0.15;

Run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-26-2015 10:58 AM

Thank You, but I didn't want to use Proc Reg at this stage, as to process 1000 var will take a long time...is there any other quicker way?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-26-2015 12:28 PM

As PaigeMiller noted above, there is a way to find "few" representations of a large set of explanatory variables, which I think is very common in Financial Econometrics. You've probably found it on the internet already, but a simple example would be (even though you can't see the full effect, because x1 and x2 lack correlating variables):

Data R_Input (Drop=i j);

Array X{*} X1-X1000;

Do j=1 To 140;

X1=Ranuni(1);

X2=Ranuni(1);

If j le 120 Then Y=X1*3-X2*0.4+2+Ranuni(1)-0.5;

Else Call Missing (y);

Do i=3 To 1000;

X{i}=Ranuni(1);

End;

Output;

End;

Run;

Proc PLS Data=R_Input Outmodel=Estimation Method=PLS CV=Split;

Model Y = X1-X1000;

Output Out=Estimate Predicted=Y_Hat;

Run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-26-2015 02:18 PM

Actually I don't know if Financial Econometrics use PLS regularly or not ... but it is used in lots of fields, including Sociology, Biology, Chemistry, Physics, Spectroscopy, Manufacturing, Food Science and probably a bunch of others.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-26-2015 11:05 AM

[1] Examine the strength of correlation coefficient of variable i with the dependent variable. Say, choose r > 0.5 or some reasonable value.

[2] Suppose you have chosen X1, X2, X3, ... X10. Check the linear relationship between each of them. Keep in your model only those that have lesser correlations( to avoid collinearity).

[3] Explore this way until you can have manageable independent variables for your final model.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-26-2015 11:24 AM

Kanyange wrote:

I would like to build a linear regression model and I need to select the most important variables (highly correlated to my target)..Does anyone know a great technique (Not Decision trees), I am using Base SAS for data preparation and I have around 1000 variables for a start.So I want to reduce the number of variables and select the most important before I enter them into Proc Reg.

A "great technique"??

Well, I offer a suggestion and I will let others decide if it is "great" or not.

Your situation is the exact situation that Partial Least Squares regression was designed for. PROC PLS does this.

However, your thought process needs to be adjusted. There really is no way to select the "most important" variables when they are all correlated with each other as well as with the response variable. This is logically impossible to do, and thus no statistical method can pick out the unambiguous "most important" variables in this situation. What PLS does is it selects linear combinations of your variables that are highly correlated with the response, and then it is up to you to use and/or interpret these linear combination. Please note: this is not a "variable reduction" method, but it is the technique that fits your situation perfectly.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-27-2015 05:16 AM

If you want pick up variables , Check proc glmselect .