Re: Bin continous variable wrt. distributional properties

deleted_user · Posted 08-30-2010 02:12 PM

Is there a binning method that takes the distributional property within bands into account? I want the distributions within bands to be similar across the sample, and they should have lowest possible variation, still limiting the number of bands created.

Please guide me 🙂

topkatz · Posted 09-13-2010 08:13 PM

Hi.

I have some interest in binning, and might have some suggestions for you, except I don't understand your question.

I think of binning as splitting the entire range of values of a variable into a finite set of disjoint partitions. For continuous-valued variables, these partitions are usually subintervals. For example, suppose my original variable takes values from 0 to 100. Then, I might have the following four partitions / bins / sub-intervals:
[0, 21.7), [21.7, 38.4), [38.4, 67.9), [67.9, 100].

When you say "bands" are you referring to what I call bins or partitions? The values in any two different bins are completely disjoint, so what do you mean by having "the distributions within bands to be similar across the sample?"

Do you want the histograms of the values within each bin to have the same shape? Unless you have a uniform distribution a priori, I think that would be almost impossible to achieve.

Or perhaps you're saying that when you partition your data into training, test, and validation sets, you want the distribution of values in each bin to look similar in each of the training, test, and validation partitions? If you have a sufficient amount of data, uniform random sampling without replacement should accomplish that.

You want a small number of bins with low variation? Does this mean you want the sums of squares around the means within each bin to be small? I could show you how to achieve this goal through integer programming, but you'd have to establish a minimum bin size or maximum number of bins, otherwise every point would be its own bin, and you'd have zero variance. But I have to say I don't understand why you'd need to do this. If this is the kind of thing you want, would you care to explain why?

I think the two best reasons to use binning are the following:

1. The bins have some meaning in the context of the analysis. For example, suppose you're looking at a group of students from ages 5 to 18. You might want to bin them as 5 - 11, 12 - 14, 15 - 18 to encapsulate elementary, junior high, and high school ages.

2. The bins help you extract predictive information. Suppose you're trying to predict a binary outcome. If the likelihood of success increases or decreases monotonically as the variable values increase, then there's no reason to bin. But if there are alternating pockets of higher and lower likelihood, binning can help focus the predictive power of the variable.

Anyway, that's my opinion. But some more explanation from you may help us give you useful suggestions.

Good luck!

-- TMK --
T O P K A T Z at M S N dot C O M

deleted_user · Posted 09-14-2010 04:42 PM

Thanks topkatz for your very detailed reply. It's very much appreciated.

With bands I mean bins.

I wasn't clear enough. I wanted to find a binning method that could find disjoint distributions across the sample. It should be possible for empirical data with different peaks. Is it called mixture distribution? Is there an analytical approach to this?

"You want a small number of bins with low variation? Does this mean you want the sums of squares around the means within each bin to be small?"

Correct. Then it becomes an optimization problem since I still want a small number of bins. Integer programming may be the tool I'm looking for to accomplish this?

I am going to use the variable for segmentation and I'm curious whether it is possible to use an univariate approach, "to let the data speak for itself", instead of using a binning technique that compares to the state of some other variable.

Good points there "why to bin".

/dante

oloolo · Posted 09-17-2010 11:25 AM

you can try the P-spline smoothing using MIXED & GLIMMIX

> Thanks topkatz for your very detailed reply. It's
> very much appreciated.
>
> With bands I mean bins.
>
> I wasn't clear enough. I wanted to find a binning
> method that could find disjoint distributions across
> the sample. It should be possible for empirical data
> with different peaks. Is it called mixture
> distribution? Is there an analytical approach to
> this?
>
> "You want a small number of bins with low variation?
> Does this mean you want the sums of squares around
> the means within each bin to be small?"
>
> Correct. Then it becomes an optimization problem
> since I still want a small number of bins. Integer
> programming may be the tool I'm looking for to
> accomplish this?
>
> I am going to use the variable for segmentation and
> I'm curious whether it is possible to use an
> univariate approach, "to let the data speak for
> itself", instead of using a binning technique that
> compares to the state of some other variable.
>
> d points there "why to bin".
>
> /dante

goladin · Posted 11-22-2010 09:10 PM

Hi,

I think you can write a Macro to achieve this. Depending on the approach, be it top down or bottoms up, you can avoid integer programming by using an recursive process. However, I believe you will still need to define a maximum number of bin.

The reason for this being that to reduce variation, the ideal number of bins given any distribution is exactly the number of unique observations. Thus you have to keep that in mind when you are coding this.

Let me see whether I have time to build this little algorithm for you.

Regards,
Murphy

deleted_user · Posted 11-23-2010 03:36 PM

I decided for a univariate K-means clustering approach. Thus, minimizing the sum of squares given a pre-defined number of bins and without distributional considerations.

Thus, I left my initial plan but I needed a solution. The number of bins where mainly chosen out of practical reasons.

Bin continous variable wrt. distributional properties