BookmarkSubscribeRSS Feed
deege
Calcite | Level 5

Hi,

I'm using EM4.3. I have a data set that contains a variable that is bimodal. The two components are very clearly delineated and do not seem to interfere or overlap with each other. I can separate them on a chart using a Distribution Explorer node but how can i dump each hump into a new variable so i can investigate them individually?

deege

1 REPLY 1
DougWielenga
SAS Employee

I have a data set that contains a variable that is bimodal. The two components are very clearly delineated and do not seem to interfere or overlap with each other. I can separate them on a chart using a Distribution Explorer node but how can i dump each hump into a new variable so i can investigate them individually?

 

The question I have is why would you want to split it into two separate variables?  Suppose you had 10 observations and variable X has five large values and five small values.   By creating two new variables (say X1 and X2) from X, you effectively create a observations which have a missing value in either X1 or X2.   Methods such as regression and neural networks rely on complete data so you would be forced to impute those missing values which doesn't make sense in this case.  Even methods that can use data with missing values would not benefit from this split.   

It is possible you are trying to make the distribution of your input variable more normally distributed but be sure to note that with regression models, the assumption is on the distribution of the error terms rather than on the distribution of the actual variable itself.   So the model

 

Y  =  b*X + e  

 

assumes that the error term e is normally distributed.   Having said that, in most data mining scenarios, you end up imputing missing values so many of the classical statistical approaches that rely on accurate estimates of the error (e.g. confidence intervals, hypothesis tests) are no longer as meaningful since you have essentially made up some of your data resulting in more degrees of freedom than are actually present in the original data.   As a result, I don't believe that dividing the bimodal variable into two variables will help you.  


If there is another benefit that I have not considered, please let me know and I will be happy to try and help.

 

I hope this helps!

Doug

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 2437 views
  • 0 likes
  • 2 in conversation