BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
doudou66
Calcite | Level 5

Hello, All

I have a binary dependent variable CREDIT_RATING which takes value of either 1(bad) or 0(good), and I have an independent variable INCOME which is continuous. I want to do a logistic regression

MODEL  CREDIT_RATING = INCOME

I was told that I should NOT use INCOME directly in the model; rather I should group INCOME to different categories (such as $0-20,000 as INCOME_1, $20,001-$50,000 as INCOME_2, etc). While the suggestion makes sense intuitively, is there any statistical consideration here? what statistical knowledge was applied here? My another question is: is there any other way to transform INCOME to make it more suitable for the model?

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

My opinion only:

The answer is dependent on the data set that you have.  If INCOME follows the usual distribution with a long tail to the right, then it is likely that high INCOME values will influence the fit more than others.  Some have modeled it as a continuous variable, with a log transformation, which is probably close, but it is really a mixture of several different distributions.

The grouping makes a lot of sense and avoids a lot of this influence of single or small groups of records, and may make interpretation easier, but...

It assumes homogeneity WITHIN the group, i.e., the probability of 0 (good) is exactly the same for all individuals within a group, no matter whether they are near the extremes of the group or not.

It assumes you have good reasons to set your cutpoints where you do.

If you are working on developing a predictive equation with only a single predictor, take a good look at PROC TRANSREG.  This would enable you to model the dependent variable as a logit, and the independent variable in a variety of ways--class, optimal transforms, non-optimal transforms, nonlinear transforms (such as Box-Cox or penalized B-splines).

Good luck.

Steve Denham

View solution in original post

7 REPLIES 7
SteveDenham
Jade | Level 19

My opinion only:

The answer is dependent on the data set that you have.  If INCOME follows the usual distribution with a long tail to the right, then it is likely that high INCOME values will influence the fit more than others.  Some have modeled it as a continuous variable, with a log transformation, which is probably close, but it is really a mixture of several different distributions.

The grouping makes a lot of sense and avoids a lot of this influence of single or small groups of records, and may make interpretation easier, but...

It assumes homogeneity WITHIN the group, i.e., the probability of 0 (good) is exactly the same for all individuals within a group, no matter whether they are near the extremes of the group or not.

It assumes you have good reasons to set your cutpoints where you do.

If you are working on developing a predictive equation with only a single predictor, take a good look at PROC TRANSREG.  This would enable you to model the dependent variable as a logit, and the independent variable in a variety of ways--class, optimal transforms, non-optimal transforms, nonlinear transforms (such as Box-Cox or penalized B-splines).

Good luck.

Steve Denham

doudou66
Calcite | Level 5

Thank you very much for your information! It is very very enlightening!

Rick_SAS
SAS Super FREQ

Besides the distributional considerations (long tailed --> undue influence), another statistical consideration is that incomes are often reported as rounded values. That is, it you draw a histogram of your income variable, you will likely see spikes at $40k, $60k, and $100k.  Although in general I dislike converting a continous variable to a discrete one, it seems to be a common practice for income, and "rounded incomes" are one reason.  Transformations cannot rid your data of this phenomenon.

doudou66
Calcite | Level 5

Thank you very much for your suggestion.

Anyone has any other idea on this topic? Your input is highly appreciated.

Manivini123
Calcite | Level 5

I can share a traditional way of deciding on best way to bin continuous variables (using best KS). Firstly you can determine the range of income and split roughly into 8-10 bins. For e.g. if it ranges from 20,000 to 100,000 you can start with

- 20K - 30K

- 30K - 40K

--------------------

90K and above

For each bin you would know the actual good/bad distribution and you can come up with KS value (abs diff between Cum good% and Cum bad% for that bin). The bin giving the highest KS would be used as first cut-off to split the variable into two bins. e.g. if 60K has best KS ur first split is <=60K and >60K. You can repeat the best KS method on <=60K distribution to again come up with a suitable cut-off. Similarly repeat it for >60K and so on...you keep doing this until you reach a reasonable level of KS and also maintaining ranking. This is an iterative process and quite useful.

PGStats
Opal | Level 21

Classification (or decision) trees would give you optimal cutting points, i.e. the income level categories that make the most difference in credit rating. It is available as Partition analysis in JMP and Decision trees in SAS Enterprise Miner.

PG

PG
doudou66
Calcite | Level 5

Thank you all very much for great help.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 2469 views
  • 7 likes
  • 5 in conversation