I'm using proc reg to create a linear regression model and am wondering if there is a procedure that will create and apply the equation that is created by the model so that I do not have to. My problem is that I have about 1,000 variables, and the forward selection model spits out about 300 in the model. That's a lot to type! I also have about 300,000 records that I need to score with this model.
Any help would be greatly appreciated (even if it is "sorry, that doesn't exist").
Three items to discuss
-1) Model diagnostics: if you want to score your modeling datasets see the output p=phat r=residuals options in the documentation for PROC REG.
-2) Validation: Once your model is done output the betas from the OUTEST option in the PROC REG statement. Then use PROC SCORE to score your file. PROC SCORE can be tricky so see the documentation for details and examples.
Since you have some many predictor variables (see #3 below on that) you will need to dynamically build the VAR statement in PROC SCORE.
I modified the example in the PROC SCORE documentation to do this
proc print data=RegOut;
title2 'OUTEST= Data Set from PROC REG';
proc print data=RScoreP;
title2 'Predicted Scores for Regression';
proc score data=Fitness score=RegOut out=RScoreR type=parms;
var Oxygen Age Weight RunTime RunPulse RestPulse;
proc print data=RScoreR;
title2 'Negative Residual Scores for Regression';
* to dynamically only use the variables that you want for scoring;
* modified from PROC SCORE example;
proc contents data=RegOut out=Betas noprint;
set Betas end=eof;
where type=1 and upcase(name) not in('OXYGEN','INTERCEPT','_RMSE_');
* type=1 means numeric variables;
* we do not want the y, intercept, or RMSE to be used as the betas;
if eof then call symput('numVars', strip(put(_n_,8.)));
-3) Now for your biggest problem. Having 1000 predictor variables is extreme. I'm not sure what you are modeling but using that many variables in a model will cause overfitting and instability. Each variable adds a dimension and with 300-1000 the "curse of dimensionality" will most likely occur.
One of the key features of Darryl's solution is the "outest=" option for "proc reg". It defines a SAS dataset that will hold the results, the estimated/derived parameters. This is what then can be used to automate the application of the model through another proc that is designed to use it, or within your own Data step and/or macro.