BookmarkSubscribeRSS Feed
Top_Katz
Quartz | Level 8

Hi @Ksharp !

 

"Pearson chi-square/DF =1 is testing if the data  is over-disperse ,has nothing to do with GOF."

Ha-ha, well, if you want to argue with Paul Allison, go ahead.  Anyway, I've invited you to apply any of your favorite GOF measures to the data.  You've already invested a lot of time in your argument, and haven't shown even a single bit of quantitative proof.  I've given you fact after fact.  So here's your chance!

 

 

"What I try to do is to make score more distinguish (-10 -5 V.S. -10 -8) ,and make better GOF ,although IV is not bigger that yours."

You haven't shown how or why making the score more distinguished is better, nor have you shown a better GOF.  And in my example, the event rates are already distinguished with 99% confidence, anyway.

 

 

"What I do with linearity of equally spaced bin is trying to not break assumption violation of GLM and get better GOF ."

I don't mean to sound harsh, but as I've already repeated several times, equal spaced linearity is COMPLETELY IRRELEVANT to GLM assumptions.  And you still haven't proved that it gives better GOF.

 

 

I'm waiting.  For proof.

Ksharp
Super User

Hi Top,

Sorry.I really have no time to test and compare them. I have already post two kind of GOF via URL(Rick's blog) .

And attachment is the output I do some practice for class German ScoreCard data. You could compare GOF with yours and see what is different.

 

 

"You haven't shown how or why making the score more distinguished is better"

Equal width bin could make score more distinguish, I think.

Top_Katz
Quartz | Level 8

Hi @Ksharp !

 

I did an experiment with the German Credit data to illustrate my point about linearity of target and predictor versus visual linearity of bin WoE values with bin sequence numbers.  The attachment imports the full thousand record German Credit data set, sorted by increasing order of Credit_Amount (the original interval-valued predictor), with five added variables:

1.  idnum  is just a sequence number for the original order of the observations

2.  group_ks  gives bin sequence numbers for the binning in your paper (group_ks = 1 for the first 699 records, up to Credit_Amount = 3578, group_ks = 2 for the next 139 records, up to Credit_Amount = 5743, group_ks = 3 for the next 60 records, up to Credit_Amount = 7127, group_ks = 4 for the last 102 records, up to Credit_Amount = 18424)

3.  woe_ks  gives the corresponding WoE value for each of the bins from your paper (woe_ks = 0.167231311070365 when group_ks = 1, woe_ks = -0.0441497846129298 when group_ks = 2, woe_ks = -0.371874163672129 when group_ks = 3, woe_ks = -0.72951482473082 when group_ks = 4)

4.  group_sd  gives bin sequence numbers for the maximum information value binning with all bin event rate differences significant with 95% confidence and 5% minimum distribution requirement with group size at least 60 (group_sd = 1 for the first 739 records, up to Credit_Amount = 3905, group_sd = 2 for the next 183 records, up to Credit_Amount = 7758, group_ks = 3 for the last 78 records, up to Credit_Amount = 18424)

5.  woe_sd  gives the corresponding WoE value for each of the group_sd bins (woe_sd = 0.2208734028 when group_sd = 1, woe_sd = -0.345205917 when group_sd = 2, woe_sd = -1.00144854 when group_sd = 3)

 

Then I ran logistic regressions of Creditability, the binary target, against, respectively, group_ks as a numeric variable, woe_ks as a numeric variable, group_ks as a class variable, group_sd as a numeric variable, woe_sd as a numeric variable, group_sd as a class variable.  In each case, I had PROC LOGISTIC run the Hosmer-Lemeshow test; it's only on development data because there is no extra test data.

 

%let	runid	=	01	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	group_ks	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	02	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	woe_ks	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	03	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	group_ks	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid.. (class)	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	class	&&bin_predictor_&runid..	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	04	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	group_sd	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	05	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	woe_sd	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	06	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	group_sd	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid.. (class)	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	class	&&bin_predictor_&runid..	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;

The first model uses your bin sequence numbers as a single variable, group_ks.  Because you tried to create linearity, the Hosmer-Lemeshow test shows pretty close agreement between observed and expected, chi-square = 0.2012 and p value = 0.9043.  But if you run the second model, using woe_ks, the WoE values for the same set of bins, as a single variable, you get perfect agreement between observed and expected in Hosmer-Lemeshow, chi-square = 0 and p value = 1.  Same perfect result for the third model, where you use group_ks as a class variable.

 

group_sd has similar results, although it does slightly better at model fit and classification / rank ordering than group_ks, and the only requirements for group_sd were that the event rate differences between neighboring groups should be significant (with event rates monotonically decreasing) and the minimum group size be 60 (with the 5% minimum distribution requirement too).  group_sd made no attempt to show visual linearity.  So, if you just use WoE values to represent the bins, or make the bins class variables, you get the best linear response between predictor and target, there's no need to try to obtain visual linearity.

 

Ksharp
Super User

Hi Top,

Honestly, I don't understand all you said.Can you post your WOE bin for AMOUNT?

Here is mine in paper.

 

Obs

i

j

wsize

wones

wzero

wrate

sos

iv

woe

ent

css

diffwrate

sigdifrate

diffwoe

1

1

631

699

513

186

0.7339

0.1365

0.0189

0.1672

0.6629

0.1237

.

.

.

2

632

764

139

96

43

0.6907

0.0297

0.0003

-0.0442

0.1408

0.0292

-0.0433

0.0835

-0.2114

3

765

821

60

37

23

0.6167

0.0142

0.0089

-0.3719

0.0654

0.0132

-0.0740

0.1451

-0.3277

4

822

923

102

54

48

0.5294

0.0254

0.0604

-0.7295

0.1155

0.0254

-0.0873

0.1566

-0.3576

Total

 

 

1,000

700

300

 

0.2058

0.0884

 

0.9845

0.1915

 

Top_Katz
Quartz | Level 8

Hi @Ksharp !

 

Sorry for being unclear.  Yes, for the set of bins you just posted, which I copied from your paper, I created two variables:

1.  group_ks, which just assigns the corresponding "Obs" column number (1, 2, 3, 4) to each of the 1,000 data records, depending on which bin it falls into;

2.  woe_ks, which assigns the corresponding "woe" column number (0.1672, -0.0442, -0.3719, -0.7295) to each of the 1,000 data records, depending on which bin it falls into

 

Here is the other set of bins I used:

 

Obsijwsizewoneswzerowratesosivwoeentcssdiffwratesigdifratediffwoe
116687395501890.74430.14070.03440.22090.68780.1278...
2669845183114690.62300.04300.0232-0.34520.19850.0415-0.12130.0769-0.5661
38469237836420.46150.01940.0887-1.00150.08810.0194-0.16140.1310-0.6562
Total  1,000700300 0.20300.1463 0.97450.1887   

 

 

I also created two variables for this set of bins:

3.  group_sd, which just assigns the corresponding "Obs" column number (1, 2, 3) to each of the 1,000 data records, depending on which bin it falls into;

2.  woe_sd, which assigns the corresponding "woe" column number (0.2209, -0.3452, -1.0015) to each of the 1,000 data records, depending on which bin it falls into

 

Then I ran separate regressions of the binary target variable, Creditability, against each of the four new variables, used as numeric variables, and also used as class variables.  If you regress against the group_ks or group_sd variable as a numeric variable, that's where the visual linearity by sequence number corresponds to linearity between target and predictor.  But if you use the binned predictor as a class variable, or if you use the woe_ks or woe_sd numeric version, visual linearity becomes irrelevant and the fit is better, too.  The "_import_WORK.GERMAN_CREDIT_GROUPS.sas" program I uploaded with my previous post reads in the German Credit data along with the variables I added, and then the code embedded in my post runs the series of regressions, including Hosmer-Lemeshow tests (from the "lackfit" option on the model statement in PROC LOGISTIC).  All of the H-L test show good agreement between observed and expected, but the ones using the woe_ks, woe_sd, or class variable predictors show perfect agreement.

Ksharp
Super User

Hi Top,

Ou, You don't need have to do that .

In scorecard , we model WOE not group number.

Here is my woe for Amount.

x.PNG

 

I need your min_amount  , max_amount , woe  for testing GOF. 

But using a simple variable to test look like not a good idea.

Here is I tried , using woe_amount to build a model.

Maybe we need more variables to test which one have better GOF.

 

proc import datafile='/folders/myfolders/1--German Credit.xlsx' dbms=xlsx out=have replace;
run;

data temp;
 set have(keep=amount good_bad) ;
if amount  le 3578	then woe_amount=-0.16723;
 else if amount le 5743	 then woe_amount=0.04415;
  else if amount le 7127 then woe_amount=0.371874;
	else woe_amount=0.729515 ;
run;

proc logistic data=temp;
model good_bad=woe_amount/gof lackfit;
run;

 

OUTPUT:

 

 

x.PNG

 

x.PNG

 

Top_Katz
Quartz | Level 8

Hi @Ksharp !

 

The variable you call woe_amount is the negative of the variable I called woe_ks, and it looks like the variable you call good_bad is the same as the variable I called Creditability, so your logistic regression of good_bad against woe_amount is exactly equivalent to my second regression from above:

 

%let	runid	=	02	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	woe_ks	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;

And the fact that the Hosmer-Lemeshow test shows perfect agreement between observed and expected is not a coincidence.  Creating a bin transformation variable with values equal to the woe or the log odds will always do that, even if the woe is non-monotonic.  You can test it and see.  The program I have attached (_try_nonmono_woe_.sas) creates a thirty-six record data set with eighteen events and eighteen non-events and three bins, each with twelve records.  The first bin has six events and six non-events, woe=0.  The second bin has nine events and three non-events, woe=1.099, the third bin has three events and nine non-events, woe=-1.099.  Very non-monotonic woe in bin sequence.  Logistic regression against the bin sequence number is a failure, and the Hosmer-Lemeshow test shows significant disagreement between observed and expected.  But once again, for logistic regression against the woe transformation, the Hosmer-Lemeshow test shows perfect agreement between observed and expected, as asserted.

Ksharp
Super User

Top,

Yeah. I know . therefore using one single variable to build model and check GOF is NOT a good idea, maybe need more variables to check GOF, like eight or nine variables at least .

Top_Katz
Quartz | Level 8

Right you are @Ksharp !

 

Univariate models are rarely useful in practice, just suitable for demonstration purposes.  My point is that you can always bin a continuous variable, even one with a non-linear, non-monotonic relationship to the target, and get a transformed variable which is usable in a logistic regression model.  (And once again, users of binned variables as model predictors should be aware of the risks incurred.)

Ksharp
Super User

Top,

And I agree with that you don't have to make  woe visual linearity. But woe must be monotonic , that is also mentioned by Sidi in his book,and woe monotonic is also one of assumption of GLM .

x.PNG

 

And you could see bad percent is monotonic increase and so is woe . that is one of assumption of GLM .

And about width of bin , whether it should have the same width or not , my opinion is with the same width, that could make woe/score more distinguish ,although that would get you lower IV .

Top_Katz
Quartz | Level 8

Hi @Ksharp !

 

As I demonstrated with my example program in response to your previous post, woe monotonicity is completely unnecessary for the original predictor variable.  You just need to transform it to a variable that has woe monotonicity.  When you create bins for a continuous predictor, just use the bin woe value as the transformed variable value, as I have shown you.  That transformed variable will be ready for use as a predictor in a logistic regression.

Ksharp
Super User

Top,

" woe monotonicity is completely unnecessary for the original predictor variable. "

Agree. Once woe is monotonic ,then Y(or response variable) must be monotonic, as I show you above in the picture.

Ksharp
Super User

Hi @Top_Katz  ,

Siddi just reply to my private message via LinkedIn.com .

And I would like to share it with you .

But there is something I disagree with him.

 

x.PNG

 

 

Top_Katz
Quartz | Level 8

Hi @Ksharp !

 

Thank you for following up.  I think what Siddiqi said makes sense; it looks like he more or less agrees with my views about linearity, where I assume he's talking about visual linearity of equally spaced bin WoE values.  What is your disagreement with him?

Ksharp
Super User

Siddi said WOE could be like hockey stick, but my opinion is woe must be monotonic .

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 49 replies
  • 5051 views
  • 11 likes
  • 2 in conversation