10-29-2015 06:30 PM
I have a data set of around 80K customers. I have classified these customers as GOOD, BAD or INDETERMINATE based on their payment history for the Last 12 months.
Each customer is assigned a Clssification od either Good, Bad or Indeterminiate, in the same file I have Application Scores for these customers (i.e. each has a score assigned to them from 0 to 100). I want to test the reliability of these scores in terms of the classification I did for these customers (i.e. to be sure that more bads are at lower scores and goods at higher scores.). Could somebody help me with the code that I could use to get a lift curve and/or K-S Curve, Gini, ROC etc or analysis of cumulative goods vs Bad.
Application Score Score Range Classification
10 0-10 Bad
30 21-30 Bad
68 61-70 Good
12 11-20 Good
Also, is there a way to determinethe cut-off that I can come up with for the application score that I could use to accept or reject cistomer (maybe a reverse cumulative distribution for the bad)?
I have tried a lot to try to get codes for SAS but unsuccessful, please HELP!
10-29-2015 10:41 PM
You should try decision tree procedure HPSPLIT. Something like:
proc hpsplit data=test; target class; input score / level=int; output nodestats=want; run; option linesize=120; proc print data=want label noobs; where depth=1; var leaf n predictedvalue insplitvar decision p_: ; run;
You will get optimal cutting scores between your classes as well as classification rates.
10-30-2015 12:38 PM
Thanks for the response I tried the code but SAS log returns an error message of -
"ERROR: Procedure HPSPLIT not found."
This was a similar situation for PROC Reliability as well, would you know why this is happening?
I have SAS 9.2 and no Enterprise Miner.
10-30-2015 12:49 PM
HPSPLIT is rather recent. The first mention of HPSPLIT in the documentation is for version 12.3 of SAS/STAT. If you have access to JMP you could do roughly the same thing with the partition platform.
10-30-2015 03:37 PM
Untested, but try these ideas:
Recode the response variable as Bad= -1, indeterminant=0, and good=1. You can fit the response by using the "score" as the explanatory variable for ordinal logistic regression.
The ROC statement in PROC LOGISTIC enables you to construct ROC curves for the response in terms of the scores.
Use the LINK=CLOGLOG option to fit the ordinal response.
The "response profile" table gives various concordance statistics such as Gini and the area under the ROC curve.
10-30-2015 04:03 PM
In last resort, you could try discriminant analysis, the non-parametric version:
/* Scores over the range of possible values */ data testvalues; do score = 0 to 100 by 0.1; output; end; run; /* non-parametric discriminant analysis */ proc discrim data=test method=npar kernel=normal r=5 testdata=testvalues testout=testScore; class class; var score; run; /* Get the predicted score range for each class */ proc sql; select _into_ as class, min(score) as fromScore, max(score) as toScore from testScore group by _into_ order by fromScore; quit;
10-31-2015 06:26 AM
@PGStats I thought about recommending PROC CANCORR, but discriminant analysis is more appropriate for nominal than ordinal categories. What is your reason for recommending the nonparametrix discriminant analysis over the linear?
10-31-2015 10:13 PM
Hi @Rick_SAS, I suggested non parametric discriminant analysis because I didn't want to make strong assumptions about the score distribution in each class. But more importantly, I thought that using a small kernel would yield sharper delineation of the classes, i.e. class border positions would be determined locally. I chose a normal kernel because of its infinite support.
11-03-2015 12:34 PM
Thanks to both you guys for the quick turnaround. I really approeciate it.
I had another question on Cumulative Accuracy Profile which I will post shortly. Hope you guys can help.