BookmarkSubscribeRSS Feed
Ullsokk
Pyrite | Level 9

I have a data set with 24 000 training observations. I am using a decision tree in enterprise miner 14.1 to build a predictive model. If I set the Leaf size under Node to 100 (default is 5), I expect the program to not create any leaf smaller than 100 observations. When I inspect the tree, it first splits by a binary variable, where 13 000 have a 0. These 13 000 obs are not split any further. 

 

If I change back to the default value of 5, however, the tree splits these 13 000 into two groups of about 6 500, and continues splitting nodes. 

 

The description of leaf size says: "Specifies the smallest number of training observations that a leaf can have."

 

In both these cases, the leaf in question is far more than 100 observations, so why does the procedure not split the 13 000 observations even with a 100 observation minimum leaf size? Am i totally misinterpreting what this means?

7 REPLIES 7
PadraicGNeville
SAS Employee

You understand it perfectly.  The software appears to be confused.  Something is confusing it.

 

There is a parameter to set the within-node sample size. (I do remember its name.)  Set it to something larger than 24,000.

 

If that does not cure it, I wonder whether there a FREQ variable, and what the counts of the target classes are.

Let me know and we can go from there. 

-Padraic

 

 

   

Ullsokk
Pyrite | Level 9

Tried adjusting the Node Sample under Split Search from 20 000 to 100 000, but nothing changes.

 

The train sample is 24 000, with 22 % 1 and 78% 0 on the target variable.

 

Tried using interactive tree to check out the node of 13 000 that wont be split. But in the interactive tree, both training and validate date are set to 10 000, even though the actual tree has 24 000 and 16 000

PadraicGNeville
SAS Employee

I'm stumped.  May I use your data to reproduce it?     If so, you can either upload the data to the community, or upload it privately through SAS Technical Support.   In that case tell Tech Support that  "Padraic Neville wants the data in order to investigate the problem,"  and they will quickly let me know when it is available. Technical Support will want the site number that appears at the top of the SAS logs.  In the Enterprise Miner, you can get the log by:

  1. Launch the SAS Enterprise Miner client.
  2. Open any project and run your diagram flow.
  3. Right-click on Results and select View►SAS Results►Log.
  4. Search for Site at the top of the log to identify your site number.
JasonXin
SAS Employee
It seems your max branch remains at 2. For the sake of testing, wonder if you can relax it to, say, 5, or 8. Also run StaExplore (if you have not) under Explore to profile the input variables. See if you have any significant /strong/dominating and highly categorical variables in the input set, like the most significant binary variable you mentioned. When I ran into situation like this, i often hold out (strong) categorical variables , and build a DT with continuous variables. Then throw back the hold out variables to check the impact on the model.
Ullsokk
Pyrite | Level 9

No it doesnt remain at 2 branches, the tree will split in more than two leafs furter down on the other branch. But I did try to set the maximuim back to 2 branches, and in that case, the tree will split the troublesome 13 000 several times. But as soon as I put the maximum branch to 3 or more, the tree collapses and wont split this group. 

 

I have a dataset with over 200 000 cases, so I tried resampling the data with 30 000 cases, with the same result. Also tried 10 000, this time the tree wont split this group even with the default settings of maximum 2 branches

 

Most of my input variables are binary, so building it with only continuous is not an option.

 

I suspect this must be due to the most significant variable, as you have alluded, so I have tried removing the most significant variable, which the tree uses for the first split. 

 

After removing the variable, the tree seems to behave as expected, with no leafs smaller than 100 obs. But I still cant get my head around why setting the leaf size to 5 instead of 100 also works, even with the most siginficant variables included. 

 

 

Ullsokk
Pyrite | Level 9

I am will look into the possibilities of sharing the data. This is enterprise data, so I wouldnt be surprised if the answer is no. Might have to do some work on removing any non-shareable information first if I get the thumbs up.

 

But to recap. 

 

I have a data set with two binary input variables that are highy correlated to the binary target - having a credit card, and having a debit card. 3 percent of customers without a credit card have a 1 on the target variable, compared to 30 percent of customers who do have a credit card. For debit card, the numbers are 8 and 40. Put another way - 95 percent of the target1 group has a debet card, and 80 percent have credit card. This seems to be at the heart of the issue.

 

Usually, debet card is the first split. If I set maximum branch to more than 2, and have more than about a 50 obs leaf size minimum, the brach containing customers without a credit card (where about 8 percent has the target 1), the tree stops growing on this side. Even though there are clearly several good splits to chose from, as is apparent in both an interactive tree, or if I set the minimum leaf size to the default 5, or if I set the maximum branch to more than 2. Removing both these variables makes the tree bahave as normal, but needless to say, reduces the model accuracy. 

 

Somehow, the leaf size, and\or maximum braches influences whether or not the tree choses to split the group without credit or debit card. 

JasonXin
SAS Employee
I agree your assessment about transmitting corporate data to SAS. It is doable. It just needs to go through some paperwork. Technology is ready. SAS TS has experience with it. But it depends on if your company grants it or not. Thanks.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 3180 views
  • 0 likes
  • 3 in conversation