02-24-2016 09:26 AM - edited 02-24-2016 10:55 AM
I am trying gradient boosting on a dataset of size 3.4m with event rate of 1.2%. I ran with following parameters:
But it fails to produce any result. blank output, no variable importance.
02-24-2016 12:42 PM - edited 02-24-2016 12:54 PM
Hi, can you please check that your flow is treating your target as binary, as opposed to interval? From your output, it looks like the target is interval, which doesn't seem to fit with your problem description.
The results you are seeing can be reproduced with randomly-generated x and y variables, so most likely none of your predictors is related to target.
02-26-2016 11:55 AM
Believe it or not you are now a little closer.
If you build a decision tree, do you actually get a tree or just a root node?
Your event level is quite rare. This is very common in predictive modeling, but you will probably need to use sampling techniques like oversampling in order to get a good model.
Here's a really good article (see the section 'inadequate or excessive input data') that talks through sampling and adjusting your target profile in Enterprise Miner.
Also, since this topic comes up from time to time, take a look at some of the other Community posts on rare events.
I hope this helps,
02-26-2016 03:21 PM
Try adding a Decisions (Assess menu) node to your flow between your input data source and your boosting node.
Use the following property settings:
Apply Decisions = Yes
Decisions = Custom
In the custom editor, specify Yes in the Decisions tab.
Assuming your target is binary with 1 representing the event value, specify a matrix like this in the Decision Weights tab:
Then press OK rerun your flow.
This will change the probability cutoff for predicting 0 and 1. You will know it worked if you see two rows for 'profit' in your boosting node results. If you don't, then the decision matrix wasn't actually picked up and you will need to check Decisions your settings.
Remember to also check out the article mentioned above for some background.
You can also tweak some of the GB properties. The 'traditional favorites' are N iterations and Shrinkage.
02-29-2016 05:20 AM
well, I added the decisions node, but it doesn't seem to change the results. It shows the lift chart, but nothing much. How do I access the model accuracy? It doesn't even show the variable importance. But I remember, we can access the variable importance and ROC values in R and Python packages.
03-23-2016 10:29 PM - edited 03-23-2016 11:09 PM
03-24-2016 01:36 AM
Are you suggesting that, I should try by removing the surrogate rules? Well, that was what I had tried initiallly
Or are you suggesting that Gradient boosting should not be used with high dimensionality data which might have lot of noise.
The only time it worked on my data was, when I had around 30-40 variables and I had oversampled, as my event rate was about 1.22%.
But Some people say, that boosting should work.
03-30-2016 10:02 PM
04-12-2016 05:36 AM