Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Is this a high demensionality issue? How to deal with it?

Reply
Occasional Contributor
Posts: 11

Is this a high demensionality issue? How to deal with it?

To build a market response model either by logistic regress or decision tree techniques, I often include some nominal variables that contain mutiple levels. A typical example would be 66 PRIZM segments or over 800 zip3 region segments. I understand that SAS, by default, sets a threshold for a class variable to 20, which means that any class variable with more than 20 levels will be excluded from the further process. So I usually increas this threshold into 80 in order to include PRIZM codes. But the difficulty is it would be very difficult to explain these multiple level variable model to business and management. Is there any technique to reduce the number of level but not lose any specific information?

Trusted Advisor
Posts: 1,630

Re: Is this a high demensionality issue? How to deal with it?

I understand that SAS, by default, sets a threshold for a class variable to 20, which means that any class variable with more than 20 levels will be excluded from the further process

I never heard of this, and I don't think its true. At least it wasn't for me using PROC GLM.

But the difficulty is it would be very difficult to explain these multiple level variable model to business and management. Is there any technique to reduce the number of level but not lose any specific information?

Without knowing a whole lot about your market response model, I could imagine that cluster analysis might be a way to combine some of your levels. But as soon as you do that, you do lose information. I could also imagine Partial Least Squares with categorical X variables (your PRIZM levels or zip3 region) might work. It's hard to say.

SAS Employee
Posts: 2

Re: Is this a high demensionality issue? How to deal with it?

I'd also use cluster node on PRIZM or zipcodes. The cluster node creates segment variables and the segment variables can be used in the logistic regression or decision tree models... then you can use segment profile node to come up with some meaningful description.

Occasional Contributor
Posts: 17

Re: Is this a high demensionality issue? How to deal with it?

You can do a couple of things:

1) Group rare classes to fulfill EM's max number of classes requirement. That it is usually ok, in terms of loss of information, unless one of your rare classes is highly correlated with the target.

2) Replace the actual classes with their target mean and use this new continuous variable instead. But do keep in mind that this, if not done correctly, will increase the risk of overfitting. In particular, your validation and test datasets should not be used to calculated the target mean. That is somewhat hard to do in EM, but very easy to handle in R or Python.

SAS Employee
Posts: 15

Re: Is this a high demensionality issue? How to deal with it?

Replacing the actual classes with the target mean to form a new continuous variable can be done easily inside a SAS Code Node in Enterprise Miner. However, if you're trying to use only nodes, and avoid writing SAS code, then using R or Python doesn't make much sense either. If for some reason you feel strongly about doing it in R, EM 13.2 has an Open Source Integration Node that allows you to use R and interface that into Enterprise Miner.

Community Manager
Posts: 486

Re: Is this a high demensionality issue? How to deal with it?

SAS Employee
Posts: 15

Re: Is this a high demensionality issue? How to deal with it?

Since the segments are class variables, and not interval variables, it might be good to use the Decision Tree Node in Enterprise Miner. There is an option on the Decision Tree Node to set node_id. On the Decision Tree properties, under Score, the property "Leaf Role" can be set to "Input."

If you preform an initial modeling with only the PRIZM variable and the target variable, the tree will be built only on the levels of the PRIZM variable. The output set will now have a node_id variable, which will be a binned version of the PRIZM variable, and contain fewer levels. The binning will have been performed by how much the various class levels of the PRIZM variable relate to the target. You can also easily see what levels of the class variable are associated with a given node_id.

Next you can perform further modeling using all of your input variables, but instead of using the PRIZM variable you will use the node_id variable.

Ask a Question
Discussion stats
  • 6 replies
  • 470 views
  • 6 likes
  • 6 in conversation