Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Existence of model-applying procedure

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-15-2008 07:28 AM

I'm using proc reg to create a linear regression model and am wondering if there is a procedure that will create and apply the equation that is created by the model so that I do not have to. My problem is that I have about 1,000 variables, and the forward selection model spits out about 300 in the model. That's a lot to type! I also have about 300,000 records that I need to score with this model.

Any help would be greatly appreciated (even if it is "sorry, that doesn't exist").

Thanks!

Becky

Any help would be greatly appreciated (even if it is "sorry, that doesn't exist").

Thanks!

Becky

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to becky

05-15-2008 09:32 AM

Becky

Three items to discuss

-1) Model diagnostics: if you want to score your modeling datasets see the output p=phat r=residuals options in the documentation for PROC REG.

-2) Validation: Once your model is done output the betas from the OUTEST option in the PROC REG statement. Then use PROC SCORE to score your file. PROC SCORE can be tricky so see the documentation for details and examples.

Since you have some many predictor variables (see #3 below on that) you will need to dynamically build the VAR statement in PROC SCORE.

I modified the example in the PROC SCORE documentation to do this

data Fitness;

input Age Weight Oxygen RunTime RestPulse RunPulse @@;

datalines;

44 89.47 44.609 11.37 62 178 40 75.07 45.313 10.07 62 185

44 85.84 54.297 8.65 45 156 42 68.15 59.571 8.17 40 166

38 89.02 49.874 9.22 55 178 47 77.45 44.811 11.63 58 176

40 75.98 45.681 11.95 70 176 43 81.19 49.091 10.85 64 162

44 81.42 39.442 13.08 63 174 38 81.87 60.055 8.63 48 170

44 73.03 50.541 10.13 45 168 45 87.66 37.388 14.03 56 186

;

run;

proc reg data=Fitness outest=RegOut;

OxyHat: model Oxygen=Age Weight RunTime RunPulse RestPulse;

output p=phat;

title 'REGRESSION SCORING EXAMPLE';

run;

proc print data=RegOut;

title2 'OUTEST= Data Set from PROC REG';

run;

proc print data=RScoreP;

title2 'Predicted Scores for Regression';

run;

proc score data=Fitness score=RegOut out=RScoreR type=parms;

var Oxygen Age Weight RunTime RunPulse RestPulse;

run;

proc print data=RScoreR;

title2 'Negative Residual Scores for Regression';

run;

* to dynamically only use the variables that you want for scoring;

* modified from PROC SCORE example;

%macro scoreit();

proc contents data=RegOut out=Betas noprint;

run;

data _null_;

set Betas end=eof;

where type=1 and upcase(name) not in('OXYGEN','INTERCEPT','_RMSE_');

* type=1 means numeric variables;

* we do not want the y, intercept, or RMSE to be used as the betas;

call symput('var'||strip(put(_n_,8.)),strip(name));

if eof then call symput('numVars', strip(put(_n_,8.)));

run;

proc score data=Fitness score=RegOut out=RScoreP type=parms;

var %do i=1 %to &numvars;

&&var&i

%end;;

run;

%mend scoreit;

%scoreit;

http://support.sas.com/onlinedoc/913/docMainpage.jsp

-3) Now for your biggest problem. Having 1000 predictor variables is extreme. I'm not sure what you are modeling but using that many variables in a model will cause overfitting and instability. Each variable adds a dimension and with 300-1000 the "curse of dimensionality" will most likely occur.

-Darryl

Three items to discuss

-1) Model diagnostics: if you want to score your modeling datasets see the output p=phat r=residuals options in the documentation for PROC REG.

-2) Validation: Once your model is done output the betas from the OUTEST option in the PROC REG statement. Then use PROC SCORE to score your file. PROC SCORE can be tricky so see the documentation for details and examples.

Since you have some many predictor variables (see #3 below on that) you will need to dynamically build the VAR statement in PROC SCORE.

I modified the example in the PROC SCORE documentation to do this

data Fitness;

input Age Weight Oxygen RunTime RestPulse RunPulse @@;

datalines;

44 89.47 44.609 11.37 62 178 40 75.07 45.313 10.07 62 185

44 85.84 54.297 8.65 45 156 42 68.15 59.571 8.17 40 166

38 89.02 49.874 9.22 55 178 47 77.45 44.811 11.63 58 176

40 75.98 45.681 11.95 70 176 43 81.19 49.091 10.85 64 162

44 81.42 39.442 13.08 63 174 38 81.87 60.055 8.63 48 170

44 73.03 50.541 10.13 45 168 45 87.66 37.388 14.03 56 186

;

run;

proc reg data=Fitness outest=RegOut;

OxyHat: model Oxygen=Age Weight RunTime RunPulse RestPulse;

output p=phat;

title 'REGRESSION SCORING EXAMPLE';

run;

proc print data=RegOut;

title2 'OUTEST= Data Set from PROC REG';

run;

proc print data=RScoreP;

title2 'Predicted Scores for Regression';

run;

proc score data=Fitness score=RegOut out=RScoreR type=parms;

var Oxygen Age Weight RunTime RunPulse RestPulse;

run;

proc print data=RScoreR;

title2 'Negative Residual Scores for Regression';

run;

* to dynamically only use the variables that you want for scoring;

* modified from PROC SCORE example;

%macro scoreit();

proc contents data=RegOut out=Betas noprint;

run;

data _null_;

set Betas end=eof;

where type=1 and upcase(name) not in('OXYGEN','INTERCEPT','_RMSE_');

* type=1 means numeric variables;

* we do not want the y, intercept, or RMSE to be used as the betas;

call symput('var'||strip(put(_n_,8.)),strip(name));

if eof then call symput('numVars', strip(put(_n_,8.)));

run;

proc score data=Fitness score=RegOut out=RScoreP type=parms;

var %do i=1 %to &numvars;

&&var&i

%end;;

run;

%mend scoreit;

%scoreit;

http://support.sas.com/onlinedoc/913/docMainpage.jsp

-3) Now for your biggest problem. Having 1000 predictor variables is extreme. I'm not sure what you are modeling but using that many variables in a model will cause overfitting and instability. Each variable adds a dimension and with 300-1000 the "curse of dimensionality" will most likely occur.

-Darryl

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to darrylovia

05-15-2008 10:58 AM

Thanks, Darryl.

I was able to use Proc Score with the macro you spelled out above. I appreciate your help.

Becky

I was able to use Proc Score with the macro you spelled out above. I appreciate your help.

Becky

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to becky

05-16-2008 08:41 AM

One of the key features of Darryl's solution is the "outest=" option for "proc reg". It defines a SAS dataset that will hold the results, the estimated/derived parameters. This is what then can be used to automate the application of the model through another proc that is designed to use it, or within your own Data step and/or macro.