BookmarkSubscribeRSS Feed
JulietteZ
Calcite | Level 5

I have a simulated dataset for personal loans, it contains borrowers' financial history and their requested loans. I'm trying to write a logistic regression model to assess loan status - current(0) or default(1)

 

I have already splitter the dataset into 70%train and 30%test

my code looks like:

/*Logistic regression*/
ods graphics on;
proc logistic data=train outmodel=model.log plots=all;
	class purpose term grade yearsemployment homeownership incomeVerified;
	model bad_good (event='0') =purpose term grade yearsemployment homeownership incomeVerified
					date
					isJointApplication
					loanAmount
					interestRate
					monthlyPayment
					annualIncome
					dtiRatio
					lengthCreditHistory
					numTotalCreditLines
					numOpenCreditLines
					numOpenCreditLines1Year
					revolvingBalance
					revolvingUtilizationRate
					numDerogatoryRec
					numDelinquency2Years
					numChargeoff1year
					numInquiries6Mon					
/
	selection=stepwise
	details
	lackfit;
	score data= test out=score1;
	store log_model;
run;

/*Score model*/
proc logistic inmodel=model.log;
	score data=train out=score2 fitstat;
run;

proc logistic inmodel=model.log;
	score data=test out=score3 fitstat;
run;

/*confusion matrix*/
proc freq data=score2;
	tables f_bad_good*i_bad_good / nocol norow; 
run;

proc freq data=score3;
	tables f_bad_good*i_bad_good / nocol norow; 
run;

My next step is to use this trained model to make predictions to a new prod data, update that data and store it. How would I do that? 

Also I wonder if anyone could take a look at my code and see if there's anything I should improve on.  I'm new to SAS and statistics, any help is much appreciated!

4 REPLIES 4
Ksharp
Super User
"My next step is to use this trained model to make predictions to a new prod data, update that data and store it. "
You could use SCORE statement to score your new dataset as in your code. Or try PROC PLM.
@Rick_SAS wrote a blog about it before.
https://blogs.sas.com/content/iml/2019/02/11/proc-plm-regression-models-sas.html
https://blogs.sas.com/content/iml/2020/12/02/score-external-logistic-model.html
https://blogs.sas.com/content/iml/2014/02/19/scoring-a-regression-model-in-sas.html

" I wonder if anyone could take a look at my code and see if there's anything I should improve on."
I would like to use penality max likelihood method via FIRTH option in model, if your data is not big enough. Like:
model y=x1 ................./ firth ..........
PaigeMiller
Diamond | Level 26

What do you mean by "new prod data"? New product? Or new data for the same product? In any event, if you have new data and you previously used this code:

 

proc logistic inmodel=model.log;
	score data=test out=score3 fitstat;
run;

 

then you can score the data in a new data set this way:

 

proc logistic inmodel=model.log;
	score data=new_data out=score4 fitstat;
run;

 

PROC PLM also works in this case.

--
Paige Miller
JulietteZ
Calcite | Level 5

Hi, thanks for answering my question!
Prod is a new data file that contains the same information as the data file that the model was trained on. The only difference between the two files is that the column 'loan status' in prod is empty.

I want to use the logistic regression model to make new predictions about the loan status(default or current) in new prod data. Essentially, I expect the previously empty 'loan status'  column to be filled with 0 or 1 in the final outputted prod data file 
However when I tried to use the score statement as described in your response, the model failed to predict anything.
I'm unsure if it is due to some problem with my data.

These are the columns of the data on which the model was trained on 

PROC SQL;
CREATE TABLE WORK.query AS
SELECT loanId , memberId , 'date'n , 
		purpose , isJointApplication , loanAmount ,
 		term , interestRate , monthlyPayment , 
 		grade , loanStatus , residentialState , 
 		yearsEmployment , homeOwnership , annualIncome , 
 		incomeVerified , dtiRatio , lengthCreditHistory , 
 		numTotalCreditLines , numOpenCreditLines , 
 		numOpenCreditLines1Year , revolvingBalance , revolvingUtilizationRate , 
 		numDerogatoryRec , numDelinquency2Years , numChargeoff1year , 
 		numInquiries6Mon , bad_good FROM WORK.MERGED_LABEL;
RUN;
QUIT;

and these are the columns in prod data

PROC SQL;
CREATE TABLE WORK.query AS
SELECT  loanId , memberId , 'date'n , 
		purpose , isJointApplication , loanAmount ,
 		term , interestRate , monthlyPayment , 
 		grade , loanStatus , residentialState , 
 		yearsEmployment , homeOwnership , annualIncome , 
 		incomeVerified , dtiRatio , lengthCreditHistory , 
 		numTotalCreditLines , numOpenCreditLines , 
 		numOpenCreditLines1Year , revolvingBalance , revolvingUtilizationRate , 
 		numDerogatoryRec , numDelinquency2Years , numChargeoff1year , 
 		numInquiries6Mon, loanStatus FROM WORK.PROD;
RUN;
QUIT;

Thank you for your help!

PaigeMiller
Diamond | Level 26

I'm not going to scan through your SQL code to figure out the difference. Just tell me, yes or no, are the columns that same in the two SQL calls, and if so, what is the difference? Why two SQL calls creating the same named data set (WORK.query)? What are you trying to say by showing us the two SQL calls?

 

However when I tried to use the score statement as described in your response, the model failed to predict anything.

 

Never (that's NEVER, not even once more in the future) should you state something failed and then not explain and not provide evidence. There are two possibilities here:

 

1. If there is an error or problem in the log, show us the ENTIRE log for PROC LOGISTIC. Please copy the log as text and paste it into the window that appears when you click on the </> icon.

2021-11-26 08_27_29-Reply to Message - SAS Support Communities — Mozilla Firefox.png

 

2. If the output is wrong, show us the incorrect output and explain what is wrong.

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 600 views
  • 1 like
  • 3 in conversation