BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Ronein
Meteorite | Level 14

Hello

I want to run logistic regression to build a credit risk model.

The data included 100,000 rows and has indicator variable that  tell if the observation being to train data (in-sample) or test data (out-sample).

My question-

In order to build the model we need to work only on train data (in-sample).

My question- Why are we using the whole data set (called panel in my code) that included both train+Test data?

How SAS knows to build the model only on the train data (If the data set wrote is train+test data)??

proc genmod data=panel namelen=60 descending ;
ods output parameterestimates=want;
class  X W Z R;
model TARGET=X W Z R/ dist=binomial link=logit  type3 wald ;
output out=want
p=P_BAD xbeta=logit;
ODS SELECT ModelANOVA;  
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ronein
Meteorite | Level 14

I found the answer.

The task is to calculate the model coefficients based on train data only and calculate prediction values (Yhat) on both train+Test data.

One way to do it is create a binary varaible (for example called weight ) that get value 1(Train) and 0 (Test)

and then use weight statement in Proc genmod

 

Other way (Tricky) is to create another response variable (for example :_Y_) that get null values for test data .

Data train(KEEP=X Y);
Do i=1 to 50;
x=ranuni(1234579);
p = 0.6;
u = rand("Uniform");
if (u < p) then Y=1;else Y=0;
output;
end;
Run;
Data test(KEEP=X Y);
Do i=1 to 25;
x=ranuni(1234579);
p = 0.6;
u = rand("Uniform");
if (u < p) then Y=1;else Y=0;
output;
end;
Run;


/***Way1- Calculate coefficents based on train data only. Calculate pred for Train+Test*****/
/**Using Weight statement and using Weight varaible that identify train/test data***/
data panel;
set train(in=a) test(in=b);
if a then weight=1;else weight=0;/***Value 1 for train data, value 0 for test data**/
Run;
proc genmod data=panel;
ods output parameterestimates=tbl_coefficients;/****Data set with Coeficients information that was calculated from train data only****/
weight weight;
model y=x / dist=binomial link=logit  type3 wald ;
output out=preds_tbl pred=P_BAD XBETA=logit;
Run;



/***Way2- Calculate coefficents based on train data only. Calculate pred for Train+Test*****/
/**Trick---Create null values for te response varaible***/
data panel_b;
set train(in=a) test(in=b);
if a then _Y_=Y;else _Y_=.;
Run;
proc genmod data=panel_b;
ods output parameterestimates=tbl_coefficients;/****Data set with Coeficients information that was calculated from train data only****/
model _Y_=x / dist=binomial link=logit  type3 wald ;
output out=preds_tbl pred=P_BAD XBETA=logit;
Run;

 

 

 

View solution in original post

10 REPLIES 10
PaigeMiller
Diamond | Level 26

You have to identify the training data in data set PANEL, for example (and there are many ways to do this), let suppose the variable TRAIN has 1 if it is the training data set and has the value 0 otherwise. Then

 

proc logistic data=panel(where=(train=1)) namelen=60 descending;
--
Paige Miller
Ronein
Meteorite | Level 14

So If I run  proc genmod  and I want to build the model only on train data then I should add the condition  (Where=(Train=1))??

(Train variable is indicator if observation belong to in-sample or out-sample)

PaigeMiller
Diamond | Level 26

That's exactly what I said.

--
Paige Miller
Ronein
Meteorite | Level 14

The var mention if train or test data is called outsample

These 2 codes provide same coefficients.

Why?? 

 

proc genmod data=panel  namelen=60 descending ;
ods output parameterestimates=tbl_coef;
class  X W Z t;
model TARGET=X W Z t / dist=binomial link=logit  type3 wald ;
output out=tbl_Want1   
p=P_BAD xbeta=logit;
ODS SELECT ModelANOVA; 
run;


proc genmod data=panel(Where=(outsample=0))  namelen=60 descending ;
ods output parameterestimates=tbl_coef;
class  X W Z t;
model TARGET=X W Z t / dist=binomial link=logit  type3 wald ;
output out=tbl_Want1   
p=P_BAD xbeta=logit;
ODS SELECT ModelANOVA; 
run;
PaigeMiller
Diamond | Level 26

Using PROC LOGISTIC, see the example named "ROC analysis using separate training and validation data sets" here https://support.sas.com/kb/39/724.html  So LOGISTIC does exactly what you want.

 

This method does not work in PROC GENMOD, and its not clear to me how to do this with PROC GENMOD alone. Probably you will need PROC GENMOD + PROC PLM

--
Paige Miller
ballardw
Super User

ALL of the variable Outsample =0 apparently of the observations used.

What does the log show? Typically with a data set WHERE there will be a note about how many observations meet the condition. And how many total observations were used by the model.

StatDave
SAS Super FREQ

You can simultaneously fit the model to the training portion of your data and evaluate the fitted model on both the training and test portions using the PARTITION statement in PROC HPLOGISTIC. The following is a simplified version of the example titled "" in the HPLOGISTIC documentation in the SAS/STAT User's Guide (https://support.sas.com/en/software/sas-stat-support.html ). The ROLEVAR option lets you specify the variable in your data set that distinguishes the training and test portions. The output will show you fit statistics for both portions. Note also that instead of using the DESCENDING option, it is safer for you to always use the EVENT= option (either in the LOGISTIC, HPLOGISTIC, or GENMOD procedure) to be sure that you are modeling the level of the response variable that you consider the event level of interest. 

proc hplogistic data=Sashelp.JunkMail;
   model Class(event='1')=Make Address All _3d Our Over Remove Internet Order;
   partition rolevar=Test(train='0' test='1');
run;

Ronein
Meteorite | Level 14

I want to foucs on proc genmond please.

Let's say that the variable outsample get values 1 or 0 (1  is test data, 0 is train data).

I want to calculate model coefficients based on train data only.

I want to calculate P_bad for all population (Train+Test data)

It was told me in my work that when I run the code below then the coefficients are calculated on train data only.

As you can see in the code I dont see anything related to outsample=0.

Can you please tell how the model is calculated in this code?(Based on train data only or train+test?)

I checked it and it is true!  in the code below sas calculate the model based on train data only.

MY question is - How does SAS knows to calculate it only on train data??

proc genmod data=panel namelen=60 descending ;
ods output parameterestimates=want;
class  X W Z R;
model TARGET=X W Z R/ dist=binomial link=logit  type3 wald ;
output out=want
p=P_BAD xbeta=logit;
ODS SELECT ModelANOVA;  
run;

 

Ksharp
Super User
"It was told me in my work that when I run the code below then the coefficients are calculated on train data only.
As you can see in the code I dont see anything related to outsample=0."
That is not ture. Since dataset "PANEL" contains all data and in your code there is not outsample=0, your code is just building a model based on ALL data ,not TRAIN data.
You should ask this question to your mentor.

Your mentor run this code, I think it is just to get a CUTOFF value or get a BEST p-value to yield the Yhat=0 or Yhat=1.


"MY question is - How does SAS knows to calculate it only on train data??"
SAS didn't know.Your code is just on ALL(train+test) data.
Except you are using SAS/EM and assign the variable 'outsample' to role 'train' and 'test'.
Ronein
Meteorite | Level 14

I found the answer.

The task is to calculate the model coefficients based on train data only and calculate prediction values (Yhat) on both train+Test data.

One way to do it is create a binary varaible (for example called weight ) that get value 1(Train) and 0 (Test)

and then use weight statement in Proc genmod

 

Other way (Tricky) is to create another response variable (for example :_Y_) that get null values for test data .

Data train(KEEP=X Y);
Do i=1 to 50;
x=ranuni(1234579);
p = 0.6;
u = rand("Uniform");
if (u < p) then Y=1;else Y=0;
output;
end;
Run;
Data test(KEEP=X Y);
Do i=1 to 25;
x=ranuni(1234579);
p = 0.6;
u = rand("Uniform");
if (u < p) then Y=1;else Y=0;
output;
end;
Run;


/***Way1- Calculate coefficents based on train data only. Calculate pred for Train+Test*****/
/**Using Weight statement and using Weight varaible that identify train/test data***/
data panel;
set train(in=a) test(in=b);
if a then weight=1;else weight=0;/***Value 1 for train data, value 0 for test data**/
Run;
proc genmod data=panel;
ods output parameterestimates=tbl_coefficients;/****Data set with Coeficients information that was calculated from train data only****/
weight weight;
model y=x / dist=binomial link=logit  type3 wald ;
output out=preds_tbl pred=P_BAD XBETA=logit;
Run;



/***Way2- Calculate coefficents based on train data only. Calculate pred for Train+Test*****/
/**Trick---Create null values for te response varaible***/
data panel_b;
set train(in=a) test(in=b);
if a then _Y_=Y;else _Y_=.;
Run;
proc genmod data=panel_b;
ods output parameterestimates=tbl_coefficients;/****Data set with Coeficients information that was calculated from train data only****/
model _Y_=x / dist=binomial link=logit  type3 wald ;
output out=preds_tbl pred=P_BAD XBETA=logit;
Run;

 

 

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 607 views
  • 8 likes
  • 5 in conversation