turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- how to split a bimodal distribution variable into ...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-19-2012 07:10 PM

Hi,

I'm using EM4.3. I have a data set that contains a variable that is bimodal. The two components are very clearly delineated and do not seem to interfere or overlap with each other. I can separate them on a chart using a Distribution Explorer node but how can i dump each hump into a new variable so i can investigate them individually?

deege

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Tuesday

I have a data set that contains a variable that is bimodal. The two components are very clearly delineated and do not seem to interfere or overlap with each other. I can separate them on a chart using a Distribution Explorer node but how can i dump each hump into a new variable so i can investigate them individually?

The question I have is why would you want to split it into two separate variables? Suppose you had 10 observations and variable X has five large values and five small values. By creating two new variables (say X1 and X2) from X, you effectively create a observations which have a missing value in either X1 or X2. Methods such as regression and neural networks rely on complete data so you would be forced to impute those missing values which doesn't make sense in this case. Even methods that can use data with missing values would not benefit from this split.

It is possible you are trying to make the distribution of your input variable more normally distributed but be sure to note that with regression models, the assumption is on the distribution of the error terms rather than on the distribution of the actual variable itself. So the model

Y = b*X + e

assumes that the error term e is normally distributed. Having said that, in most data mining scenarios, you end up imputing missing values so many of the classical statistical approaches that rely on accurate estimates of the error (e.g. confidence intervals, hypothesis tests) are no longer as meaningful since you have essentially made up some of your data resulting in more degrees of freedom than are actually present in the original data. As a result, I don't believe that dividing the bimodal variable into two variables will help you.

If there is another benefit that I have not considered, please let me know and I will be happy to try and help.

I hope this helps!

Doug