N/A
Posts: 1

# how to split a bimodal distribution variable into two separate variables?

Hi,

I'm using EM4.3. I have a data set that contains a variable that is bimodal. The two components are very clearly delineated and do not seem to interfere or overlap with each other. I can separate them on a chart using a Distribution Explorer node but how can i dump each hump into a new variable so i can investigate them individually?

deege

SAS Employee
Posts: 180

## Re: how to split a bimodal distribution variable into two separate variables?

I have a data set that contains a variable that is bimodal. The two components are very clearly delineated and do not seem to interfere or overlap with each other. I can separate them on a chart using a Distribution Explorer node but how can i dump each hump into a new variable so i can investigate them individually?

The question I have is why would you want to split it into two separate variables?  Suppose you had 10 observations and variable X has five large values and five small values.   By creating two new variables (say X1 and X2) from X, you effectively create a observations which have a missing value in either X1 or X2.   Methods such as regression and neural networks rely on complete data so you would be forced to impute those missing values which doesn't make sense in this case.  Even methods that can use data with missing values would not benefit from this split.

It is possible you are trying to make the distribution of your input variable more normally distributed but be sure to note that with regression models, the assumption is on the distribution of the error terms rather than on the distribution of the actual variable itself.   So the model

Y  =  b*X + e

assumes that the error term e is normally distributed.   Having said that, in most data mining scenarios, you end up imputing missing values so many of the classical statistical approaches that rely on accurate estimates of the error (e.g. confidence intervals, hypothesis tests) are no longer as meaningful since you have essentially made up some of your data resulting in more degrees of freedom than are actually present in the original data.   As a result, I don't believe that dividing the bimodal variable into two variables will help you.

If there is another benefit that I have not considered, please let me know and I will be happy to try and help.

I hope this helps!

Doug

Discussion stats