Solved: Re: a question about variable transformation

doudou66 · Posted 05-02-2012 01:18 PM

Hello, All

I have a binary dependent variable CREDIT_RATING which takes value of either 1(bad) or 0(good), and I have an independent variable INCOME which is continuous. I want to do a logistic regression

MODEL CREDIT_RATING = INCOME

I was told that I should NOT use INCOME directly in the model; rather I should group INCOME to different categories (such as $0-20,000 as INCOME_1, $20,001-$50,000 as INCOME_2, etc). While the suggestion makes sense intuitively, is there any statistical consideration here? what statistical knowledge was applied here? My another question is: is there any other way to transform INCOME to make it more suitable for the model?

SteveDenham · Posted 05-02-2012 03:30 PM

My opinion only:

The answer is dependent on the data set that you have. If INCOME follows the usual distribution with a long tail to the right, then it is likely that high INCOME values will influence the fit more than others. Some have modeled it as a continuous variable, with a log transformation, which is probably close, but it is really a mixture of several different distributions.

The grouping makes a lot of sense and avoids a lot of this influence of single or small groups of records, and may make interpretation easier, but...

It assumes homogeneity WITHIN the group, i.e., the probability of 0 (good) is exactly the same for all individuals within a group, no matter whether they are near the extremes of the group or not.

It assumes you have good reasons to set your cutpoints where you do.

If you are working on developing a predictive equation with only a single predictor, take a good look at PROC TRANSREG. This would enable you to model the dependent variable as a logit, and the independent variable in a variety of ways--class, optimal transforms, non-optimal transforms, nonlinear transforms (such as Box-Cox or penalized B-splines).

Good luck.

Steve Denham

View solution in original post

SteveDenham · Posted 05-02-2012 03:30 PM

My opinion only:

The answer is dependent on the data set that you have. If INCOME follows the usual distribution with a long tail to the right, then it is likely that high INCOME values will influence the fit more than others. Some have modeled it as a continuous variable, with a log transformation, which is probably close, but it is really a mixture of several different distributions.

The grouping makes a lot of sense and avoids a lot of this influence of single or small groups of records, and may make interpretation easier, but...

It assumes homogeneity WITHIN the group, i.e., the probability of 0 (good) is exactly the same for all individuals within a group, no matter whether they are near the extremes of the group or not.

It assumes you have good reasons to set your cutpoints where you do.

If you are working on developing a predictive equation with only a single predictor, take a good look at PROC TRANSREG. This would enable you to model the dependent variable as a logit, and the independent variable in a variety of ways--class, optimal transforms, non-optimal transforms, nonlinear transforms (such as Box-Cox or penalized B-splines).

Good luck.

Steve Denham

doudou66 · Posted 05-02-2012 11:24 PM

Thank you very much for your information! It is very very enlightening!

Rick_SAS · Posted 05-03-2012 10:00 AM

Besides the distributional considerations (long tailed --> undue influence), another statistical consideration is that incomes are often reported as rounded values. That is, it you draw a histogram of your income variable, you will likely see spikes at $40k, $60k, and $100k. Although in general I dislike converting a continous variable to a discrete one, it seems to be a common practice for income, and "rounded incomes" are one reason. Transformations cannot rid your data of this phenomenon.

doudou66 · Posted 05-03-2012 11:35 AM

Thank you very much for your suggestion.

Anyone has any other idea on this topic? Your input is highly appreciated.

Manivini123 · Posted 05-03-2012 08:56 PM

I can share a traditional way of deciding on best way to bin continuous variables (using best KS). Firstly you can determine the range of income and split roughly into 8-10 bins. For e.g. if it ranges from 20,000 to 100,000 you can start with

- 20K - 30K

- 30K - 40K

--------------------

90K and above

For each bin you would know the actual good/bad distribution and you can come up with KS value (abs diff between Cum good% and Cum bad% for that bin). The bin giving the highest KS would be used as first cut-off to split the variable into two bins. e.g. if 60K has best KS ur first split is <=60K and >60K. You can repeat the best KS method on <=60K distribution to again come up with a suitable cut-off. Similarly repeat it for >60K and so on...you keep doing this until you reach a reasonable level of KS and also maintaining ranking. This is an iterative process and quite useful.

PGStats · Posted 05-03-2012 09:21 PM

Classification (or decision) trees would give you optimal cutting points, i.e. the income level categories that make the most difference in credit rating. It is available as Partition analysis in JMP and Decision trees in SAS Enterprise Miner.

PG

doudou66 · Posted 05-04-2012 11:44 AM

Thank you all very much for great help.