08-21-2014 12:45 PM
In my project, I have found that MBR seems to be quite good at identifying my rare event phenomena. From what I have read in the literature, MBR is a nearest neighbor approach, that when dealing with rare events needs boosting/oversampling for training data. People also say that it prefers uncorrelated input variables, and that range scaling is needed. I assume the range scaling is so that variables with a wide range of values do not dominate the decisions by virtue of their size.
Can you tell me if MBR takes care of the range scaling, and also removing or diminishing the influence of highly correlated variables? If it does not, then I will need to do extra work to identify these issues and rescale first, or remove via the metadata node. Thanks for any information!
09-02-2014 10:48 AM
Your reasoning as to why range scaling is needed is correct - variables with larger ranges will dominate a nearest neighbor approach. The MBR node does not do any range scaling of your data, so you will need to handle this portion of the process external to the MBR node. You can range scale interval data by using the Transform Variables Node, and for the "Interval Inputs" property, select "Range."
As for correlated variables, there are two answers... the MBR can weight based off of correlation with the target, but the MBR will NOT handle correlation between input variables.
Each input variable is weighted by the absolute value of the correlation to the target variable. This will only apply in cases of interval targets or binary targets (nominal targets will multiple levels will NOT have a weighting based off of correlation). The "Weighted" property on the MBR Node controls whether you want weighting or not. This is different than your question, which I think is asking about correlated input variables.
You are correct that having highly correlated input variables will skew the nearest neighbor results towards favoring the underlying mechanism. This is most likely a problem though, only if the variables are highly correlated. Highly correlated input variables affect more methods than just the MBR, so I would recommend that you always consider handling correlated variables.
You can use the StatExplore node in Enterprise Miner to determine correlations, and then use the Metadata Node to reject some of these correlated variables.
I hope this helps!