11-29-2016 11:53 AM
I have a data set with 24 000 training observations. I am using a decision tree in enterprise miner 14.1 to build a predictive model. If I set the Leaf size under Node to 100 (default is 5), I expect the program to not create any leaf smaller than 100 observations. When I inspect the tree, it first splits by a binary variable, where 13 000 have a 0. These 13 000 obs are not split any further.
If I change back to the default value of 5, however, the tree splits these 13 000 into two groups of about 6 500, and continues splitting nodes.
The description of leaf size says: "Specifies the smallest number of training observations that a leaf can have."
In both these cases, the leaf in question is far more than 100 observations, so why does the procedure not split the 13 000 observations even with a 100 observation minimum leaf size? Am i totally misinterpreting what this means?
11-30-2016 10:04 AM
You understand it perfectly. The software appears to be confused. Something is confusing it.
There is a parameter to set the within-node sample size. (I do remember its name.) Set it to something larger than 24,000.
If that does not cure it, I wonder whether there a FREQ variable, and what the counts of the target classes are.
Let me know and we can go from there.
11-30-2016 10:51 AM
Tried adjusting the Node Sample under Split Search from 20 000 to 100 000, but nothing changes.
The train sample is 24 000, with 22 % 1 and 78% 0 on the target variable.
Tried using interactive tree to check out the node of 13 000 that wont be split. But in the interactive tree, both training and validate date are set to 10 000, even though the actual tree has 24 000 and 16 000
11-30-2016 12:09 PM
I'm stumped. May I use your data to reproduce it? If so, you can either upload the data to the community, or upload it privately through SAS Technical Support. In that case tell Tech Support that "Padraic Neville wants the data in order to investigate the problem," and they will quickly let me know when it is available. Technical Support will want the site number that appears at the top of the SAS logs. In the Enterprise Miner, you can get the log by:
11-30-2016 05:19 PM
12-01-2016 05:04 AM
No it doesnt remain at 2 branches, the tree will split in more than two leafs furter down on the other branch. But I did try to set the maximuim back to 2 branches, and in that case, the tree will split the troublesome 13 000 several times. But as soon as I put the maximum branch to 3 or more, the tree collapses and wont split this group.
I have a dataset with over 200 000 cases, so I tried resampling the data with 30 000 cases, with the same result. Also tried 10 000, this time the tree wont split this group even with the default settings of maximum 2 branches
Most of my input variables are binary, so building it with only continuous is not an option.
I suspect this must be due to the most significant variable, as you have alluded, so I have tried removing the most significant variable, which the tree uses for the first split.
After removing the variable, the tree seems to behave as expected, with no leafs smaller than 100 obs. But I still cant get my head around why setting the leaf size to 5 instead of 100 also works, even with the most siginficant variables included.
12-01-2016 05:39 AM - edited 12-01-2016 05:41 AM
I am will look into the possibilities of sharing the data. This is enterprise data, so I wouldnt be surprised if the answer is no. Might have to do some work on removing any non-shareable information first if I get the thumbs up.
But to recap.
I have a data set with two binary input variables that are highy correlated to the binary target - having a credit card, and having a debit card. 3 percent of customers without a credit card have a 1 on the target variable, compared to 30 percent of customers who do have a credit card. For debit card, the numbers are 8 and 40. Put another way - 95 percent of the target1 group has a debet card, and 80 percent have credit card. This seems to be at the heart of the issue.
Usually, debet card is the first split. If I set maximum branch to more than 2, and have more than about a 50 obs leaf size minimum, the brach containing customers without a credit card (where about 8 percent has the target 1), the tree stops growing on this side. Even though there are clearly several good splits to chose from, as is apparent in both an interactive tree, or if I set the minimum leaf size to the default 5, or if I set the maximum branch to more than 2. Removing both these variables makes the tree bahave as normal, but needless to say, reduces the model accuracy.
Somehow, the leaf size, and\or maximum braches influences whether or not the tree choses to split the group without credit or debit card.
12-01-2016 04:03 PM