Hello
I want to run logistic regression to build a credit risk model.
The data included 100,000 rows and has indicator variable that tell if the observation being to train data (in-sample) or test data (out-sample).
My question-
In order to build the model we need to work only on train data (in-sample).
My question- Why are we using the whole data set (called panel in my code) that included both train+Test data?
How SAS knows to build the model only on the train data (If the data set wrote is train+test data)??
proc genmod data=panel namelen=60 descending ;
ods output parameterestimates=want;
class X W Z R;
model TARGET=X W Z R/ dist=binomial link=logit type3 wald ;
output out=want
p=P_BAD xbeta=logit;
ODS SELECT ModelANOVA;
run;
I found the answer.
The task is to calculate the model coefficients based on train data only and calculate prediction values (Yhat) on both train+Test data.
One way to do it is create a binary varaible (for example called weight ) that get value 1(Train) and 0 (Test)
and then use weight statement in Proc genmod
Other way (Tricky) is to create another response variable (for example :_Y_) that get null values for test data .
Data train(KEEP=X Y);
Do i=1 to 50;
x=ranuni(1234579);
p = 0.6;
u = rand("Uniform");
if (u < p) then Y=1;else Y=0;
output;
end;
Run;
Data test(KEEP=X Y);
Do i=1 to 25;
x=ranuni(1234579);
p = 0.6;
u = rand("Uniform");
if (u < p) then Y=1;else Y=0;
output;
end;
Run;
/***Way1- Calculate coefficents based on train data only. Calculate pred for Train+Test*****/
/**Using Weight statement and using Weight varaible that identify train/test data***/
data panel;
set train(in=a) test(in=b);
if a then weight=1;else weight=0;/***Value 1 for train data, value 0 for test data**/
Run;
proc genmod data=panel;
ods output parameterestimates=tbl_coefficients;/****Data set with Coeficients information that was calculated from train data only****/
weight weight;
model y=x / dist=binomial link=logit type3 wald ;
output out=preds_tbl pred=P_BAD XBETA=logit;
Run;
/***Way2- Calculate coefficents based on train data only. Calculate pred for Train+Test*****/
/**Trick---Create null values for te response varaible***/
data panel_b;
set train(in=a) test(in=b);
if a then _Y_=Y;else _Y_=.;
Run;
proc genmod data=panel_b;
ods output parameterestimates=tbl_coefficients;/****Data set with Coeficients information that was calculated from train data only****/
model _Y_=x / dist=binomial link=logit type3 wald ;
output out=preds_tbl pred=P_BAD XBETA=logit;
Run;
You have to identify the training data in data set PANEL, for example (and there are many ways to do this), let suppose the variable TRAIN has 1 if it is the training data set and has the value 0 otherwise. Then
proc logistic data=panel(where=(train=1)) namelen=60 descending;
So If I run proc genmod and I want to build the model only on train data then I should add the condition (Where=(Train=1))??
(Train variable is indicator if observation belong to in-sample or out-sample)
That's exactly what I said.
The var mention if train or test data is called outsample
These 2 codes provide same coefficients.
Why??
proc genmod data=panel namelen=60 descending ;
ods output parameterestimates=tbl_coef;
class X W Z t;
model TARGET=X W Z t / dist=binomial link=logit type3 wald ;
output out=tbl_Want1
p=P_BAD xbeta=logit;
ODS SELECT ModelANOVA;
run;
proc genmod data=panel(Where=(outsample=0)) namelen=60 descending ;
ods output parameterestimates=tbl_coef;
class X W Z t;
model TARGET=X W Z t / dist=binomial link=logit type3 wald ;
output out=tbl_Want1
p=P_BAD xbeta=logit;
ODS SELECT ModelANOVA;
run;
Using PROC LOGISTIC, see the example named "ROC analysis using separate training and validation data sets" here https://support.sas.com/kb/39/724.html So LOGISTIC does exactly what you want.
This method does not work in PROC GENMOD, and its not clear to me how to do this with PROC GENMOD alone. Probably you will need PROC GENMOD + PROC PLM
ALL of the variable Outsample =0 apparently of the observations used.
What does the log show? Typically with a data set WHERE there will be a note about how many observations meet the condition. And how many total observations were used by the model.
You can simultaneously fit the model to the training portion of your data and evaluate the fitted model on both the training and test portions using the PARTITION statement in PROC HPLOGISTIC. The following is a simplified version of the example titled "" in the HPLOGISTIC documentation in the SAS/STAT User's Guide (https://support.sas.com/en/software/sas-stat-support.html ). The ROLEVAR option lets you specify the variable in your data set that distinguishes the training and test portions. The output will show you fit statistics for both portions. Note also that instead of using the DESCENDING option, it is safer for you to always use the EVENT= option (either in the LOGISTIC, HPLOGISTIC, or GENMOD procedure) to be sure that you are modeling the level of the response variable that you consider the event level of interest.
proc hplogistic data=Sashelp.JunkMail;
model Class(event='1')=Make Address All _3d Our Over Remove Internet Order;
partition rolevar=Test(train='0' test='1');
run;
I want to foucs on proc genmond please.
Let's say that the variable outsample get values 1 or 0 (1 is test data, 0 is train data).
I want to calculate model coefficients based on train data only.
I want to calculate P_bad for all population (Train+Test data)
It was told me in my work that when I run the code below then the coefficients are calculated on train data only.
As you can see in the code I dont see anything related to outsample=0.
Can you please tell how the model is calculated in this code?(Based on train data only or train+test?)
I checked it and it is true! in the code below sas calculate the model based on train data only.
MY question is - How does SAS knows to calculate it only on train data??
proc genmod data=panel namelen=60 descending ;
ods output parameterestimates=want;
class X W Z R;
model TARGET=X W Z R/ dist=binomial link=logit type3 wald ;
output out=want
p=P_BAD xbeta=logit;
ODS SELECT ModelANOVA;
run;
I found the answer.
The task is to calculate the model coefficients based on train data only and calculate prediction values (Yhat) on both train+Test data.
One way to do it is create a binary varaible (for example called weight ) that get value 1(Train) and 0 (Test)
and then use weight statement in Proc genmod
Other way (Tricky) is to create another response variable (for example :_Y_) that get null values for test data .
Data train(KEEP=X Y);
Do i=1 to 50;
x=ranuni(1234579);
p = 0.6;
u = rand("Uniform");
if (u < p) then Y=1;else Y=0;
output;
end;
Run;
Data test(KEEP=X Y);
Do i=1 to 25;
x=ranuni(1234579);
p = 0.6;
u = rand("Uniform");
if (u < p) then Y=1;else Y=0;
output;
end;
Run;
/***Way1- Calculate coefficents based on train data only. Calculate pred for Train+Test*****/
/**Using Weight statement and using Weight varaible that identify train/test data***/
data panel;
set train(in=a) test(in=b);
if a then weight=1;else weight=0;/***Value 1 for train data, value 0 for test data**/
Run;
proc genmod data=panel;
ods output parameterestimates=tbl_coefficients;/****Data set with Coeficients information that was calculated from train data only****/
weight weight;
model y=x / dist=binomial link=logit type3 wald ;
output out=preds_tbl pred=P_BAD XBETA=logit;
Run;
/***Way2- Calculate coefficents based on train data only. Calculate pred for Train+Test*****/
/**Trick---Create null values for te response varaible***/
data panel_b;
set train(in=a) test(in=b);
if a then _Y_=Y;else _Y_=.;
Run;
proc genmod data=panel_b;
ods output parameterestimates=tbl_coefficients;/****Data set with Coeficients information that was calculated from train data only****/
model _Y_=x / dist=binomial link=logit type3 wald ;
output out=preds_tbl pred=P_BAD XBETA=logit;
Run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.