BookmarkSubscribeRSS Feed
Kanyange
Fluorite | Level 6

Hi All,

I would like to build a linear regression model and I need to select the most important variables (highly correlated to my target)..Does anyone know a great technique (Not Decision trees), I am using Base SAS for data preparation and I have around 1000 variables for a start.So I want to reduce the number of variables and select the most important before I enter them into Proc Reg.

Your help would be much appreciated.

Many Thanks


7 REPLIES 7
user24feb
Barite | Level 11

You could use stepwise regression (I wonder what the stats experts come up with). For example,

Data R_Input (Drop=i j);
  Array X{*} X1-X1000;
  Do j=1 To 120;
    X1=Ranuni(1);
Y=X1*3+2+Ranuni(1)-0.5; * if SAS finds X1, it works :-);
    Do i=2 To 1000;
   X{i}=Ranuni(1);
End;
Output;
  End;
Run;

Proc Reg Data=R_Input;
  Model Y = X1--X1000 / Selection=Stepwise SlEntry=0.1 SLStay=0.15;
Run;

Kanyange
Fluorite | Level 6

Thank You, but I didn't want to use Proc Reg at this stage, as to process 1000 var will take a long time...is there any other quicker way?

user24feb
Barite | Level 11

As PaigeMiller noted above, there is a way to find "few" representations of a large set of explanatory variables, which I think is very common in Financial Econometrics. You've probably found it on the internet already, but a simple example would be (even though you can't see the full effect, because x1 and x2 lack correlating variables):

Data R_Input (Drop=i j);
  Array X{*} X1-X1000;
  Do j=1 To 140;
    X1=Ranuni(1);
X2=Ranuni(1);
If j le 120 Then Y=X1*3-X2*0.4+2+Ranuni(1)-0.5;
Else Call Missing (y);
    Do i=3 To 1000;
   X{i}=Ranuni(1);
End;
Output;
  End;
Run;

Proc PLS Data=R_Input Outmodel=Estimation Method=PLS CV=Split;
  Model Y = X1-X1000;
  Output Out=Estimate Predicted=Y_Hat;
Run;

PaigeMiller
Diamond | Level 26

Actually I don't know if Financial Econometrics use PLS regularly or not ... but it is used in lots of fields, including Sociology, Biology, Chemistry, Physics, Spectroscopy, Manufacturing, Food Science and probably a bunch of others.

--
Paige Miller
KachiM
Rhodochrosite | Level 12

[1] Examine the strength of correlation coefficient of variable i with the dependent variable. Say, choose r > 0.5 or some reasonable value.

[2] Suppose you have chosen X1, X2, X3, ... X10. Check the linear relationship between each of them. Keep in your model only those that have lesser correlations( to avoid collinearity).

[3] Explore this way until you can have manageable independent variables for your final model.

PaigeMiller
Diamond | Level 26

Kanyange wrote:

I would like to build a linear regression model and I need to select the most important variables (highly correlated to my target)..Does anyone know a great technique (Not Decision trees), I am using Base SAS for data preparation and I have around 1000 variables for a start.So I want to reduce the number of variables and select the most important before I enter them into Proc Reg.

A "great technique"??

Well, I offer a suggestion and I will let others decide if it is "great" or not.

Your situation is the exact situation that Partial Least Squares regression was designed for. PROC PLS does this.

However, your thought process needs to be adjusted. There really is no way to select the "most important" variables when they are all correlated with each other as well as with the response variable. This is logically impossible to do, and thus no statistical method can pick out the unambiguous "most important" variables in this situation. What PLS does is it selects linear combinations of your variables that are highly correlated with the response, and then it is up to you to use and/or interpret these linear combination. Please note: this is not a "variable reduction" method, but it is the technique that fits your situation perfectly.

--
Paige Miller
Ksharp
Super User

If you want pick up variables , Check proc glmselect .

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 3656 views
  • 0 likes
  • 5 in conversation