BookmarkSubscribeRSS Feed
Lychee
Calcite | Level 5

To build a market response model either by logistic regress or decision tree techniques, I often include some nominal variables that contain mutiple levels. A typical example would be 66 PRIZM segments or over 800 zip3 region segments. I understand that SAS, by default, sets a threshold for a class variable to 20, which means that any class variable with more than 20 levels will be excluded from the further process. So I usually increas this threshold into 80 in order to include PRIZM codes. But the difficulty is it would be very difficult to explain these multiple level variable model to business and management. Is there any technique to reduce the number of level but not lose any specific information?

6 REPLIES 6
PaigeMiller
Diamond | Level 26

I understand that SAS, by default, sets a threshold for a class variable to 20, which means that any class variable with more than 20 levels will be excluded from the further process

I never heard of this, and I don't think its true. At least it wasn't for me using PROC GLM.

But the difficulty is it would be very difficult to explain these multiple level variable model to business and management. Is there any technique to reduce the number of level but not lose any specific information?

Without knowing a whole lot about your market response model, I could imagine that cluster analysis might be a way to combine some of your levels. But as soon as you do that, you do lose information. I could also imagine Partial Least Squares with categorical X variables (your PRIZM levels or zip3 region) might work. It's hard to say.

--
Paige Miller
huiping_fang
SAS Employee

I'd also use cluster node on PRIZM or zipcodes. The cluster node creates segment variables and the segment variables can be used in the logistic regression or decision tree models... then you can use segment profile node to come up with some meaningful description.

adjgiulio
Obsidian | Level 7

You can do a couple of things:

1) Group rare classes to fulfill EM's max number of classes requirement. That it is usually ok, in terms of loss of information, unless one of your rare classes is highly correlated with the target.

2) Replace the actual classes with their target mean and use this new continuous variable instead. But do keep in mind that this, if not done correctly, will increase the risk of overfitting. In particular, your validation and test datasets should not be used to calculated the target mean. That is somewhat hard to do in EM, but very easy to handle in R or Python.

RalphAbbey
SAS Employee

Replacing the actual classes with the target mean to form a new continuous variable can be done easily inside a SAS Code Node in Enterprise Miner. However, if you're trying to use only nodes, and avoid writing SAS code, then using R or Python doesn't make much sense either. If for some reason you feel strongly about doing it in R, EM 13.2 has an Open Source Integration Node that allows you to use R and interface that into Enterprise Miner.

AnnaBrown
Community Manager

And here's a tip from Shunping on Spectral Clustering in SAS® Enterprise Miner™ Using Open Source Integration Node.


Join us for SAS Community Trivia
SAS Bowl XXIX, The SAS Hackathon
Wednesday, March 8, 2023, at 10 AM ET | #SASBowl

RalphAbbey
SAS Employee

Since the segments are class variables, and not interval variables, it might be good to use the Decision Tree Node in Enterprise Miner. There is an option on the Decision Tree Node to set node_id. On the Decision Tree properties, under Score, the property "Leaf Role" can be set to "Input."

If you preform an initial modeling with only the PRIZM variable and the target variable, the tree will be built only on the levels of the PRIZM variable. The output set will now have a node_id variable, which will be a binned version of the PRIZM variable, and contain fewer levels. The binning will have been performed by how much the various class levels of the PRIZM variable relate to the target. You can also easily see what levels of the class variable are associated with a given node_id.

Next you can perform further modeling using all of your input variables, but instead of using the PRIZM variable you will use the node_id variable.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1277 views
  • 6 likes
  • 6 in conversation