Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Frequent Contributor
Posts: 95

Hi All,

I would like to build a linear regression model and I need to select the most important variables (highly correlated to my target)..Does anyone know a great technique (Not Decision trees), I am using Base SAS for data preparation and I have around 1000 variables for a start.So I want to reduce the number of variables and select the most important before I enter them into Proc Reg.

Your help would be much appreciated.

Many Thanks

Super Contributor
Posts: 340

You could use stepwise regression (I wonder what the stats experts come up with). For example,

Data R_Input (Drop=i j);
Array X{*} X1-X1000;
Do j=1 To 120;
X1=Ranuni(1);
Y=X1*3+2+Ranuni(1)-0.5; * if SAS finds X1, it works :-);
Do i=2 To 1000;
X{i}=Ranuni(1);
End;
Output;
End;
Run;

Proc Reg Data=R_Input;
Model Y = X1--X1000 / Selection=Stepwise SlEntry=0.1 SLStay=0.15;
Run;

Frequent Contributor
Posts: 95

Thank You, but I didn't want to use Proc Reg at this stage, as to process 1000 var will take a long time...is there any other quicker way?

Super Contributor
Posts: 340

As PaigeMiller noted above, there is a way to find "few" representations of a large set of explanatory variables, which I think is very common in Financial Econometrics. You've probably found it on the internet already, but a simple example would be (even though you can't see the full effect, because x1 and x2 lack correlating variables):

Data R_Input (Drop=i j);
Array X{*} X1-X1000;
Do j=1 To 140;
X1=Ranuni(1);
X2=Ranuni(1);
If j le 120 Then Y=X1*3-X2*0.4+2+Ranuni(1)-0.5;
Else Call Missing (y);
Do i=3 To 1000;
X{i}=Ranuni(1);
End;
Output;
End;
Run;

Proc PLS Data=R_Input Outmodel=Estimation Method=PLS CV=Split;
Model Y = X1-X1000;
Output Out=Estimate Predicted=Y_Hat;
Run;

Posts: 1,932

Actually I don't know if Financial Econometrics use PLS regularly or not ... but it is used in lots of fields, including Sociology, Biology, Chemistry, Physics, Spectroscopy, Manufacturing, Food Science and probably a bunch of others.

Super Contributor
Posts: 298

[1] Examine the strength of correlation coefficient of variable i with the dependent variable. Say, choose r > 0.5 or some reasonable value.

[2] Suppose you have chosen X1, X2, X3, ... X10. Check the linear relationship between each of them. Keep in your model only those that have lesser correlations( to avoid collinearity).

[3] Explore this way until you can have manageable independent variables for your final model.

Posts: 1,932

```Kanyange wrote:

I would like to build a linear regression model and I need to select the most important variables (highly correlated to my target)..Does anyone know a great technique (Not Decision trees), I am using Base SAS for data preparation and I have around 1000 variables for a start.So I want to reduce the number of variables and select the most important before I enter them into Proc Reg.
```

A "great technique"??

Well, I offer a suggestion and I will let others decide if it is "great" or not.

Your situation is the exact situation that Partial Least Squares regression was designed for. PROC PLS does this.

However, your thought process needs to be adjusted. There really is no way to select the "most important" variables when they are all correlated with each other as well as with the response variable. This is logically impossible to do, and thus no statistical method can pick out the unambiguous "most important" variables in this situation. What PLS does is it selects linear combinations of your variables that are highly correlated with the response, and then it is up to you to use and/or interpret these linear combination. Please note: this is not a "variable reduction" method, but it is the technique that fits your situation perfectly.

Super User
Posts: 10,044