I have a simulated dataset for personal loans, it contains borrowers' financial history and their requested loans. I'm trying to write a logistic regression model to assess loan status - current(0) or default(1)
I have already splitter the dataset into 70%train and 30%test
my code looks like:
/*Logistic regression*/
ods graphics on;
proc logistic data=train outmodel=model.log plots=all;
class purpose term grade yearsemployment homeownership incomeVerified;
model bad_good (event='0') =purpose term grade yearsemployment homeownership incomeVerified
date
isJointApplication
loanAmount
interestRate
monthlyPayment
annualIncome
dtiRatio
lengthCreditHistory
numTotalCreditLines
numOpenCreditLines
numOpenCreditLines1Year
revolvingBalance
revolvingUtilizationRate
numDerogatoryRec
numDelinquency2Years
numChargeoff1year
numInquiries6Mon
/
selection=stepwise
details
lackfit;
score data= test out=score1;
store log_model;
run;
/*Score model*/
proc logistic inmodel=model.log;
score data=train out=score2 fitstat;
run;
proc logistic inmodel=model.log;
score data=test out=score3 fitstat;
run;
/*confusion matrix*/
proc freq data=score2;
tables f_bad_good*i_bad_good / nocol norow;
run;
proc freq data=score3;
tables f_bad_good*i_bad_good / nocol norow;
run;
My next step is to use this trained model to make predictions to a new prod data, update that data and store it. How would I do that?
Also I wonder if anyone could take a look at my code and see if there's anything I should improve on. I'm new to SAS and statistics, any help is much appreciated!
What do you mean by "new prod data"? New product? Or new data for the same product? In any event, if you have new data and you previously used this code:
proc logistic inmodel=model.log;
score data=test out=score3 fitstat;
run;
then you can score the data in a new data set this way:
proc logistic inmodel=model.log;
score data=new_data out=score4 fitstat;
run;
PROC PLM also works in this case.
Hi, thanks for answering my question!
Prod is a new data file that contains the same information as the data file that the model was trained on. The only difference between the two files is that the column 'loan status' in prod is empty.
I want to use the logistic regression model to make new predictions about the loan status(default or current) in new prod data. Essentially, I expect the previously empty 'loan status' column to be filled with 0 or 1 in the final outputted prod data file
However when I tried to use the score statement as described in your response, the model failed to predict anything.
I'm unsure if it is due to some problem with my data.
These are the columns of the data on which the model was trained on
PROC SQL;
CREATE TABLE WORK.query AS
SELECT loanId , memberId , 'date'n ,
purpose , isJointApplication , loanAmount ,
term , interestRate , monthlyPayment ,
grade , loanStatus , residentialState ,
yearsEmployment , homeOwnership , annualIncome ,
incomeVerified , dtiRatio , lengthCreditHistory ,
numTotalCreditLines , numOpenCreditLines ,
numOpenCreditLines1Year , revolvingBalance , revolvingUtilizationRate ,
numDerogatoryRec , numDelinquency2Years , numChargeoff1year ,
numInquiries6Mon , bad_good FROM WORK.MERGED_LABEL;
RUN;
QUIT;
and these are the columns in prod data
PROC SQL;
CREATE TABLE WORK.query AS
SELECT loanId , memberId , 'date'n ,
purpose , isJointApplication , loanAmount ,
term , interestRate , monthlyPayment ,
grade , loanStatus , residentialState ,
yearsEmployment , homeOwnership , annualIncome ,
incomeVerified , dtiRatio , lengthCreditHistory ,
numTotalCreditLines , numOpenCreditLines ,
numOpenCreditLines1Year , revolvingBalance , revolvingUtilizationRate ,
numDerogatoryRec , numDelinquency2Years , numChargeoff1year ,
numInquiries6Mon, loanStatus FROM WORK.PROD;
RUN;
QUIT;
Thank you for your help!
I'm not going to scan through your SQL code to figure out the difference. Just tell me, yes or no, are the columns that same in the two SQL calls, and if so, what is the difference? Why two SQL calls creating the same named data set (WORK.query)? What are you trying to say by showing us the two SQL calls?
However when I tried to use the score statement as described in your response, the model failed to predict anything.
Never (that's NEVER, not even once more in the future) should you state something failed and then not explain and not provide evidence. There are two possibilities here:
1. If there is an error or problem in the log, show us the ENTIRE log for PROC LOGISTIC. Please copy the log as text and paste it into the window that appears when you click on the </> icon.
2. If the output is wrong, show us the incorrect output and explain what is wrong.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.