I have a continuous variable like : INCOME or DURATION.
I need to split it into several groups by using several cutpoints.
two cutpoints would yield three groups, three would be four groups .............
Attachment is a test data/excel.
Data looks like this,I want to generate a GROUP variable:
Here I have a cutpoint DURATION=12 ,that could split DURATION (a continuous variable) into TWO groups.
But if you have TWO cutpoints,you would yield THREE groups, THREE cutpoints yield FOUR groups........
good_bad group duration
good 1 2
bad 1 4
good 1 5
good 1 6
bad 1 8
good 1 10
good 2 18
good 2 28
bad 2 30
bad 2 32
total_n_bad=4 total_n_good=6
group=1
--------
n_bad=2 n_good=4
bad_dist=n_bad/total_n_bad=2/4=0.5
good_dist=n_good/total_n_good=4/6=0.667
woe=(Bad_Dist-Good_Dist)*log(Bad_Dist/Good_Dist)=(0.5-0.667)*log(0.5/0.667)=0.048
group=2
--------
n_bad=2 n_good=2
bad_dist=n_bad/total_n_bad=2/4=0.5
good_dist=n_good/total_n_good=2/6=0.333
woe=(Bad_Dist-Good_Dist)*log(Bad_Dist/Good_Dist)=(0.5-0.333)*log(0.5/0.333)=0.068
iv=0.048 + 0.068 = 0.116 <----- I want to maximize this iv .
And I also have THREE constraints:
group=1
--------
Bad_Dist>0.05 and Good_Dist>0.05
group=2
--------
Bad_Dist>0.05 and Good_Dist>0.05
to avoid "If n_good[g] = 0, then good_dist[g] = 0, yielding a division by zero"
woe[1]<woe[2]<woe[3]<woe[4]...........
or
woe[1]>woe[2]>woe[3]>woe[4]...........
a.k.a woe is monotonic .
P.S.
The group could be 3,4,5,6,7,8,9,10.....
and pick up the max IV from these group.
E.X. group=8 have the max IV when group in (2 3 4 5 6 7 8 9 10).
Here is an example used by my GA code for the test data(attachment):
