BookmarkSubscribeRSS Feed
koraskornel
Calcite | Level 5

Hello! I'm about as green as they get to programming, let alone for SAS, and I'm really struggling. So any help would be so very helpful.

I have 2 questions.

1) We had to split a data set into two (training and validation). In our training dataset we built our regression model. Now we need to test for sensitivity. There are two outliers. How do I do this?

 

2) Now that we have our model, we need to test it in our validation dataset. How do I do this? Below are the three ways I have tried and I get error messages that I don't understand. 

 

proc reg data=training;
model v18= v6 v8 v9 v13 v14 v15; /*Final Model*/
run;
 
/*Using model to predict outcome*/
proc reg data=validation;
output out = validation p=Ypredicted v6 v8 v9 v13 v14 v15;
run;
 
 
proc glm data=training;
model v18=v6 v8 v9 v13 v14 v15/solution;
code file="/folders/myfolders/output";
run;
data scored;
set test;
%include validation;
run;
 
proc stdize data=training reponly method=median out=training outstat=med;
var v6 v8 v9 v13 v14 v15;
run;
proc stdize data=validation out=validation reponly method=in(med);
var v6 v8 v9 v13 v14 v15;
proc stdize data=med out=test reponly method=in(med);
var v6 v8 v9 v13 v14 v15;
run;
proc score data=validation;
model ""= v6 v8 v9 v13 v14 v15;
run;
 
 
11 REPLIES 11
Reeza
Super User

Regarding #2 see examples here on how to score your data. Note that PROC SCORE is one method but you never tell it which model data to use and you never store the model output from PROC REG anywhere.

 

https://blogs.sas.com/content/iml/2014/02/19/scoring-a-regression-model-in-sas.html

PaigeMiller
Diamond | Level 26

We had to split a data set into two (training and validation). In our training dataset we built our regression model. Now we need to test for sensitivity. There are two outliers. How do I do this?

 

PROC REG has several methods in the MODEL statement to check for data points that might be considered extreme or high leverage. One is the R option which produces the Cook's D statistic and the other one is the INFLUENCE option. Outliers can also be obtain by the R option, any observation with a large residual (either positive or negative) is considered a potential outlier.

 

There are also multivariate measures of being an outlier using all of v6 v8 v9 v13 v14 v15 simultaneously.

 

Now that we have our model, we need to test it in our validation dataset. How do I do this? Below are the three ways I have tried and I get error messages that I don't understand. 

 

proc reg data=training;
model v18=v6 v8 v9 v13 v14 v15;/*Final Model*/
run;
 
If you append the validation data set to the training data set, convert the Y values to missing for the validation data, and then re-run the above code on this adjusted data set, you will get predictions for the validation data. These can be compared to the actual Y for the validation data, which can then be used to summarize how well the model fits on the validation data set.
--
Paige Miller
koraskornel
Calcite | Level 5

How do I append one data set to another?

PaigeMiller
Diamond | Level 26

Better you shouldn't split them in the first place. Just include a variable in the data set that contains either "Training" or "Validation". For the validation samples, make Y missing and then save the actual value of Y in another variable.

 

 

 

 

--
Paige Miller
Reeza
Super User

@koraskornel the link I included also shows how to do this method as well.

koraskornel
Calcite | Level 5
Thanks-Ill try it!
koraskornel
Calcite | Level 5
Thanks for the speedy response! I totally understand, but this is for a class and they are requiring us to split it via even and odd ID numbers. The following gets me the following error.
data validation; 
    if 0>=v18=>0 then "."; 
    run; 
proc append base=training data=validation; 
    run; 
  proc reg data=validation; 
       model v18=v6 v8 v9 v13 v14 v15; 
    run;
 
 data validation;
 74             if 0>=v18=>0 then ".";
                                  ___
                                  180
 ERROR 180-322: Statement is not valid or it is used out of proper order.
 
 NOTE: Appending WORK.VALIDATION to WORK.TRAINING.
 WARNING: Variable _MODEL_ was not found on BASE file. The variable will not be added to the BASE file.
 WARNING: Variable _TYPE_ was not found on BASE file. The variable will not be added to the BASE file.
 WARNING: Variable _DEPVAR_ was not found on BASE file. The variable will not be added to the BASE file.
 WARNING: Variable _RMSE_ was not found on BASE file. The variable will not be added to the BASE file.
 WARNING: Variable Intercept was not found on BASE file. The variable will not be added to the BASE file.
 WARNING: Variable V1 was not found on DATA file.
 WARNING: Variable V2 was not found on DATA file.
 WARNING: Variable V3 was not found on DATA file.
 WARNING: Variable V4 was not found on DATA file.
 WARNING: Variable V5 was not found on DATA file.
 WARNING: Variable V7 was not found on DATA file.
 WARNING: Variable V10 was not found on DATA file.
 WARNING: Variable V11 was not found on DATA file.
 WARNING: Variable V12 was not found on DATA file.
 WARNING: Variable V16 was not found on DATA file.
 WARNING: Variable V17 was not found on DATA file.
 WARNING: Variable obs was not found on DATA file.
 ERROR: No appending done because of anomalies listed above. Use FORCE option to append these files.
 NOTE: 0 observations added.
PaigeMiller
Diamond | Level 26

I think there are a number of problems here, such as you don't have a SET statement in your DATA step.

 

Also, the proper syntax of an IF statement is:

 

IF some condition THEN variablename=".";

 

 

--
Paige Miller
koraskornel
Calcite | Level 5

Got it. I'll fix those and report back soon. Thank you!

koraskornel
Calcite | Level 5

Hi all. Thank you all so much for the help! Through all the responses I am understanding the basics better and sussed out a solution. Here is what I used to use the parameter estimates from data set (training), to another (validation).

 

/*Outputting Parameter Estimates*/
proc reg data=senstraining4 outest=regout;
train: model v18=v6 v8 v9 v13 v14 v15;
title "Regression Parameters Out";
run;
proc print data=regout;
title2 "Outest data from regout";
run;
 
proc score data=senstraining4 score=regout type=parms predict out=RScoreP;
var v6 v8 v9 v13 v14 v15;
run;
proc print data=RScoreP;
run;
 
proc score data=senstraining4 score=regout out=RScoreR type=parms;
var v6 v8 v9 v13 v14 v15;
run;
proc print data=RScoreR;
title "Negative Residual Scores for Reg";
run;
 
/*Using output parameter estimates on new data set*/
data validation;
set newcounty;
if mod(_n_,2) eq 1; *_n_ is the number of observation;
obs=_n_;
run;

proc print data=validation;
title "Validation";
run;
proc score data=validation score=regout out=newpred type=parms nostd predict;
var v6 v8 v9 v13 v14 v15; /*'train' name carries model over*/
run;
proc print data=newpred;
title "Data Scored";
run;
ballardw
Super User

@koraskornel wrote:
Thanks for the speedy response! I totally understand, but this is for a class and they are requiring us to split it via even and odd ID numbers. The following gets me the following error.
data validation; 
    if 0>=v18=>0 then "."; 
    run; 

You don't show any source for data. SAS would be expecting either a SET (or merge , update or modify statement) with existing data set or to read data from a file so that you would have a variable V18.

 

Some other issues:

0>=v18=>0  can only be true when v18 is = 0. So you need to reconsider what the limits here are supposed to actually be.

 

An "If then " requires either an action such as OUTPUT or DELETE, or a variable to assign a value to. If you want to assign a value to a variable you must list it such as the v18 = .; (to assign missing value to a numeric).

The comparison you use for v18 implies that is numeric. In which case you should not attempt to assign a character value. "." would be a character value of period.

 

It is generally not a good idea to use the same data set as the source and result for a data step. It is not a syntax error but if you have a coding issue that does not result in an error that halts the data step you can corrupt your data and would have to go back to an earlier point in the code to recover the set.

 

Example: suppose you intended to recode a value of 3 to something else: If var = 3 then var=.; ;

But accidentally type: if var >= 3 then var=. ; which would recode any value of 3 or larger, you cannot recover the previous values of var that were accidently coded to missing. It is better practice to use

Data newdataset;

     set dataset;

<code>.

 

To go along with that, it is better to move all of your recoding or such into a single step than to create a bunch of data sets where you are modifying one variable or adding one variable at a time. Use a temporary data set to test. Then when it is working move it into your "main" recoding step.

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1050 views
  • 0 likes
  • 4 in conversation