BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Zachary
Obsidian | Level 7

We are building a very basic Decision Tree - love EM too! We have the four following nodes:

Input Data,

Data Partition,

Consolidation Tree, then

Decision Tree.

The Consolidation Tree is actually a variation of a Decision Tree where we are taking some variables with many nominal categories (sometimes in the hundreds) and seeing we can relate them into simplified groupings to our main dependent/target variable. Below is is a snapshot of part of our Consolidation Tree:

Capture.PNG

The origin node looks great and we split our data 80/20 between training/validation. The first significant level is called NAC_CODE and it grouped the variable into two nice nodes. But the next level down for one of the nodes splits GOVERNING_CLASS into two nodes again - problem is one of them is a node for Missing_Values_Only. I normally would not be too concerned as many of the variables within our dataset have missing values. But GOVERNING_CLASS has zero. I fully understand how EM automatically groups the missing with other values of response for varying nodes even when there might be none in the present dataset for scoring purposes, but this does not make sense at all to be by itself.

Please help. I have some other questions coming after this one is resolved as well.

Thank you very much.

Zach Feinstein, Statistical Data Modeler

P (952) 838-4289 C (612) 590-4813  F (952) 838-2010

SFM Mutual Insurance Company

3500 American Blvd. W,
Suite 700, Bloomington, MN 55431

1 ACCEPTED SOLUTION

Accepted Solutions
WendyCzika
SAS Employee

I think what is happening here is that categories with less than the value specified for the Decision Tree node property Minimum Categorical Size are treated as missing, so that's why you are seeing that branch for GOVERNING_CLASS even though it has no missing values.  So one option is to change (lower) that value so categories with extremely small numbers are not treated as missing.  The second thing you can change is the Missing Values property to something other than Use in search.  This will prevent a branch from ever having only missing values (true missings and those defined by Min Cat Size).  Hope that helps!

View solution in original post

2 REPLIES 2
WendyCzika
SAS Employee

I think what is happening here is that categories with less than the value specified for the Decision Tree node property Minimum Categorical Size are treated as missing, so that's why you are seeing that branch for GOVERNING_CLASS even though it has no missing values.  So one option is to change (lower) that value so categories with extremely small numbers are not treated as missing.  The second thing you can change is the Missing Values property to something other than Use in search.  This will prevent a branch from ever having only missing values (true missings and those defined by Min Cat Size).  Hope that helps!

AnnaBrown
Community Manager

Welcome to the community, Zach! I hope you find some good advice in this forum. Keep the questions coming!

Anna


Join us for SAS Community Trivia
SAS Bowl XXIX, The SAS Hackathon
Wednesday, March 8, 2023, at 10 AM ET | #SASBowl

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1236 views
  • 1 like
  • 3 in conversation