Trying to use PROC OPTMODEL for monotonic supervised optimal binning o... - Page 3

Top_Katz · Posted 05-03-2019 12:02 PM

"Pearson chi-square/DF =1 is testing if the data is over-disperse ,has nothing to do with GOF."

Ha-ha, well, if you want to argue with Paul Allison, go ahead. Anyway, I've invited you to apply any of your favorite GOF measures to the data. You've already invested a lot of time in your argument, and haven't shown even a single bit of quantitative proof. I've given you fact after fact. So here's your chance!

"What I try to do is to make score more distinguish (-10 -5 V.S. -10 -8) ,and make better GOF ,although IV is not bigger that yours."

You haven't shown how or why making the score more distinguished is better, nor have you shown a better GOF. And in my example, the event rates are already distinguished with 99% confidence, anyway.

"What I do with linearity of equally spaced bin is trying to not break assumption violation of GLM and get better GOF ."

I don't mean to sound harsh, but as I've already repeated several times, equal spaced linearity is COMPLETELY IRRELEVANT to GLM assumptions. And you still haven't proved that it gives better GOF.

I'm waiting. For proof.

Ksharp · Posted 05-04-2019 08:07 AM

Hi Top,

Sorry.I really have no time to test and compare them. I have already post two kind of GOF via URL(Rick's blog) .

And attachment is the output I do some practice for class German ScoreCard data. You could compare GOF with yours and see what is different.

"You haven't shown how or why making the score more distinguished is better"

Equal width bin could make score more distinguish, I think.

Top_Katz · Posted 05-06-2019 01:12 PM

Hi @Ksharp !

I did an experiment with the German Credit data to illustrate my point about linearity of target and predictor versus visual linearity of bin WoE values with bin sequence numbers. The attachment imports the full thousand record German Credit data set, sorted by increasing order of Credit_Amount (the original interval-valued predictor), with five added variables:

1. idnum is just a sequence number for the original order of the observations

2. group_ks gives bin sequence numbers for the binning in your paper (group_ks = 1 for the first 699 records, up to Credit_Amount = 3578, group_ks = 2 for the next 139 records, up to Credit_Amount = 5743, group_ks = 3 for the next 60 records, up to Credit_Amount = 7127, group_ks = 4 for the last 102 records, up to Credit_Amount = 18424)

3. woe_ks gives the corresponding WoE value for each of the bins from your paper (woe_ks = 0.167231311070365 when group_ks = 1, woe_ks = -0.0441497846129298 when group_ks = 2, woe_ks = -0.371874163672129 when group_ks = 3, woe_ks = -0.72951482473082 when group_ks = 4)

4. group_sd gives bin sequence numbers for the maximum information value binning with all bin event rate differences significant with 95% confidence and 5% minimum distribution requirement with group size at least 60 (group_sd = 1 for the first 739 records, up to Credit_Amount = 3905, group_sd = 2 for the next 183 records, up to Credit_Amount = 7758, group_ks = 3 for the last 78 records, up to Credit_Amount = 18424)

5. woe_sd gives the corresponding WoE value for each of the group_sd bins (woe_sd = 0.2208734028 when group_sd = 1, woe_sd = -0.345205917 when group_sd = 2, woe_sd = -1.00144854 when group_sd = 3)

Then I ran logistic regressions of Creditability, the binary target, against, respectively, group_ks as a numeric variable, woe_ks as a numeric variable, group_ks as a class variable, group_sd as a numeric variable, woe_sd as a numeric variable, group_sd as a class variable. In each case, I had PROC LOGISTIC run the Hosmer-Lemeshow test; it's only on development data because there is no extra test data.

%let	runid	=	01	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	group_ks	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	02	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	woe_ks	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	03	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	group_ks	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid.. (class)	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	class	&&bin_predictor_&runid..	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	04	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	group_sd	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	05	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	woe_sd	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;


%let	runid	=	06	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	group_sd	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid.. (class)	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	class	&&bin_predictor_&runid..	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;

The first model uses your bin sequence numbers as a single variable, group_ks. Because you tried to create linearity, the Hosmer-Lemeshow test shows pretty close agreement between observed and expected, chi-square = 0.2012 and p value = 0.9043. But if you run the second model, using woe_ks, the WoE values for the same set of bins, as a single variable, you get perfect agreement between observed and expected in Hosmer-Lemeshow, chi-square = 0 and p value = 1. Same perfect result for the third model, where you use group_ks as a class variable.

group_sd has similar results, although it does slightly better at model fit and classification / rank ordering than group_ks, and the only requirements for group_sd were that the event rate differences between neighboring groups should be significant (with event rates monotonically decreasing) and the minimum group size be 60 (with the 5% minimum distribution requirement too). group_sd made no attempt to show visual linearity. So, if you just use WoE values to represent the bins, or make the bins class variables, you get the best linear response between predictor and target, there's no need to try to obtain visual linearity.

Ksharp · Posted 05-07-2019 11:26 AM

Hi Top,

Honestly, I don't understand all you said.Can you post your WOE bin for AMOUNT?

Here is mine in paper.

Obs	i	j	wsize	wones	wzero	wrate	sos	iv	woe	ent	css	diffwrate	sigdifrate	diffwoe
1	1	631	699	513	186	0.7339	0.1365	0.0189	0.1672	0.6629	0.1237	.	.	.
2	632	764	139	96	43	0.6907	0.0297	0.0003	-0.0442	0.1408	0.0292	-0.0433	0.0835	-0.2114
3	765	821	60	37	23	0.6167	0.0142	0.0089	-0.3719	0.0654	0.0132	-0.0740	0.1451	-0.3277
4	822	923	102	54	48	0.5294	0.0254	0.0604	-0.7295	0.1155	0.0254	-0.0873	0.1566	-0.3576
Total			1,000	700	300		0.2058	0.0884		0.9845	0.1915

Top_Katz · Posted 05-07-2019 12:41 PM

Hi @Ksharp !

Sorry for being unclear. Yes, for the set of bins you just posted, which I copied from your paper, I created two variables:

1. group_ks, which just assigns the corresponding "Obs" column number (1, 2, 3, 4) to each of the 1,000 data records, depending on which bin it falls into;

2. woe_ks, which assigns the corresponding "woe" column number (0.1672, -0.0442, -0.3719, -0.7295) to each of the 1,000 data records, depending on which bin it falls into

Here is the other set of bins I used:

Obs	i	j	wsize	wones	wzero	wrate	sos	iv	woe	ent	css	diffwrate	sigdifrate	diffwoe
1	1	668	739	550	189	0.7443	0.1407	0.0344	0.2209	0.6878	0.1278	.	.	.
2	669	845	183	114	69	0.6230	0.0430	0.0232	-0.3452	0.1985	0.0415	-0.1213	0.0769	-0.5661
3	846	923	78	36	42	0.4615	0.0194	0.0887	-1.0015	0.0881	0.0194	-0.1614	0.1310	-0.6562
Total			1,000	700	300		0.2030	0.1463		0.9745	0.1887

I also created two variables for this set of bins:

3. group_sd, which just assigns the corresponding "Obs" column number (1, 2, 3) to each of the 1,000 data records, depending on which bin it falls into;

2. woe_sd, which assigns the corresponding "woe" column number (0.2209, -0.3452, -1.0015) to each of the 1,000 data records, depending on which bin it falls into

Then I ran separate regressions of the binary target variable, Creditability, against each of the four new variables, used as numeric variables, and also used as class variables. If you regress against the group_ks or group_sd variable as a numeric variable, that's where the visual linearity by sequence number corresponds to linearity between target and predictor. But if you use the binned predictor as a class variable, or if you use the woe_ks or woe_sd numeric version, visual linearity becomes irrelevant and the fit is better, too. The "_import_WORK.GERMAN_CREDIT_GROUPS.sas" program I uploaded with my previous post reads in the German Credit data along with the variables I added, and then the code embedded in my post runs the series of regressions, including Hosmer-Lemeshow tests (from the "lackfit" option on the model statement in PROC LOGISTIC). All of the H-L test show good agreement between observed and expected, but the ones using the woe_ks, woe_sd, or class variable predictors show perfect agreement.

Ksharp · Posted 05-07-2019 10:16 PM

Hi Top,

Ou, You don't need have to do that .

In scorecard , we model WOE not group number.

Here is my woe for Amount.

I need your min_amount , max_amount , woe for testing GOF.

But using a simple variable to test look like not a good idea.

Here is I tried , using woe_amount to build a model.

Maybe we need more variables to test which one have better GOF.

proc import datafile='/folders/myfolders/1--German Credit.xlsx' dbms=xlsx out=have replace;
run;

data temp;
 set have(keep=amount good_bad) ;
if amount  le 3578	then woe_amount=-0.16723;
 else if amount le 5743	 then woe_amount=0.04415;
  else if amount le 7127 then woe_amount=0.371874;
	else woe_amount=0.729515 ;
run;

proc logistic data=temp;
model good_bad=woe_amount/gof lackfit;
run;

OUTPUT:

Top_Katz · Posted 05-08-2019 10:01 AM

Hi @Ksharp !

The variable you call woe_amount is the negative of the variable I called woe_ks, and it looks like the variable you call good_bad is the same as the variable I called Creditability, so your logistic regression of good_bad against woe_amount is exactly equivalent to my second regression from above:

%let	runid	=	02	;
%let	indata&runid.			=	work.german_credit_groups	;
%let	binary_target_&runid.	=	Creditability	;
%let	bin_predictor_&runid.	=	woe_ks	;	


title2	"PROC	LOGISTIC	&runid.	&&binary_target_&runid.. by &&bin_predictor_&runid..	DATA	=	&&indata&runid.."	;
PROC	LOGISTIC	DATA	=	&&indata&runid..	DESCENDING	OUTEST	=	work.gc_logit_&&bin_predictor_&runid.._coefs_&runid.	
					OUTMODEL	=	work.gc_logit_&&bin_predictor_&runid.._outmodel_&runid.	;
	model	&&binary_target_&runid..	=	&&bin_predictor_&runid..	/	lackfit	rsquare	;
	store	work.gc_logit_&&bin_predictor_&runid.._store_&runid.	;
	output	out	=	work.gc_logit_&&bin_predictor_&runid.._output_&runid.		predicted	=	predicted&runid.	xbeta	=	xbeta&runid.
																				reschi	=	reschi&runid.	resdev	=	resdev&runid.	reslik	=	reslik&runid.	;
run	;
title2	;

And the fact that the Hosmer-Lemeshow test shows perfect agreement between observed and expected is not a coincidence. Creating a bin transformation variable with values equal to the woe or the log odds will always do that, even if the woe is non-monotonic. You can test it and see. The program I have attached (_try_nonmono_woe_.sas) creates a thirty-six record data set with eighteen events and eighteen non-events and three bins, each with twelve records. The first bin has six events and six non-events, woe=0. The second bin has nine events and three non-events, woe=1.099, the third bin has three events and nine non-events, woe=-1.099. Very non-monotonic woe in bin sequence. Logistic regression against the bin sequence number is a failure, and the Hosmer-Lemeshow test shows significant disagreement between observed and expected. But once again, for logistic regression against the woe transformation, the Hosmer-Lemeshow test shows perfect agreement between observed and expected, as asserted.

Ksharp · Posted 05-08-2019 10:06 AM

Top,

Yeah. I know . therefore using one single variable to build model and check GOF is NOT a good idea, maybe need more variables to check GOF, like eight or nine variables at least .

Top_Katz · Posted 05-08-2019 10:14 AM

Right you are @Ksharp !

Univariate models are rarely useful in practice, just suitable for demonstration purposes. My point is that you can always bin a continuous variable, even one with a non-linear, non-monotonic relationship to the target, and get a transformed variable which is usable in a logistic regression model. (And once again, users of binned variables as model predictors should be aware of the risks incurred.)

Ksharp · Posted 05-07-2019 10:34 PM

Top,

And I agree with that you don't have to make woe visual linearity. But woe must be monotonic , that is also mentioned by Sidi in his book,and woe monotonic is also one of assumption of GLM .

And you could see bad percent is monotonic increase and so is woe . that is one of assumption of GLM .

And about width of bin , whether it should have the same width or not , my opinion is with the same width, that could make woe/score more distinguish ,although that would get you lower IV .

Top_Katz · Posted 05-08-2019 10:06 AM

Hi @Ksharp !

As I demonstrated with my example program in response to your previous post, woe monotonicity is completely unnecessary for the original predictor variable. You just need to transform it to a variable that has woe monotonicity. When you create bins for a continuous predictor, just use the bin woe value as the transformed variable value, as I have shown you. That transformed variable will be ready for use as a predictor in a logistic regression.

Ksharp · Posted 05-08-2019 10:16 AM

Top,

" woe monotonicity is completely unnecessary for the original predictor variable. "

Agree. Once woe is monotonic ,then Y(or response variable) must be monotonic, as I show you above in the picture.

Ksharp · Posted 05-09-2019 08:45 AM

Hi @Top_Katz ,

Siddi just reply to my private message via LinkedIn.com .

And I would like to share it with you .

But there is something I disagree with him.

Top_Katz · Posted 05-09-2019 09:22 AM

Hi @Ksharp !

Thank you for following up. I think what Siddiqi said makes sense; it looks like he more or less agrees with my views about linearity, where I assume he's talking about visual linearity of equally spaced bin WoE values. What is your disagreement with him?

Ksharp · Posted 05-09-2019 09:28 AM

Siddi said WOE could be like hockey stick, but my opinion is woe must be monotonic .

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

Re: Trying to use PROC OPTMODEL for monotonic supervised optimal binning of a continuous predictor

The 2025 SAS Hackathon has begun!