BookmarkSubscribeRSS Feed
Quartz | Level 8

Dear members,

I have a classification problem with 6 classes and 14500 features (Interval ). I need to reduce the dimensions.

whats is the optimal method to do that?

Decision tree, PCA,....


Barite | Level 11

Hi Hussein,

A great thing about Enterprise Miner is that you can try multiple techniques at once. Since two flows on a diagram run in parallel you are also making the most out of your time.

Note that nodes like decision tree will do variable selection, and nodes like PCA will do dimension reduction.

Advanced Predictive Modeling Using SAS Enterprise Miner is a course that explains very well advanced topics, including unsupervised (PCA, variable clustering, etc) and supervised (PLS, LARS, LASSO, etc) dimension reduction techniques. Highly recommended!

A best practice is to try several techniques and select the one that suits your target, number of observations, and number of input variables. Also check this discussion ( ) where you can find a list of nodes that you can use for variable selection.

Also, what kind of classification of problem are you trying to solve? Are you dealing with missing values in the input variables? And what is the distribution of those 6 classes? You might even want to use a two-step model.

I hope it helps,


Obsidian | Level 7

Hi Hussein,

I agree with Miguel and is even using the book from the course he recommends above in my day-to-day work.

From my perspective, PCA is an awesome and great way to reduce the dimensions, although the problem with the technique is that it often becomes difficult to explain to a client or user of the model what exactly the PCA variables mean and how they relate to the parameters of the model. I would maybe use a decision tree to select useful variables and go from there (there's a specific way to configure the decision tree node to select variables). You can also use regression to select variables (especially forward selection, which is good at detecting strong interactions). Just note that these nodes may run a considerable amount of time due to your big dataset. Also (something I use a lot), since your inputs are interval variables, you could cluster them into groups using eminer's clustering node and then use one representative from each cluster for subsequent modelling. Eminer automatically exports the cluster representatives if you set the Variable Selection option in the clustering node to "Best Variables". Something I also sometimes use (although it's not as successful as the other techniques mentioned earlier) is SAS's variable selection node and then setting the minimum Chi Square lower bound much lower than Eminer's default value so that it may select a larger number of important variables.

Hope you succeed!


Obsidian | Level 7

One thing I'd add is that I wouldn't trust one decision tree to do variable selection for me. Decision trees are highly unstable models, and small changes in the data (i.e. even a change in SEED) can produce vastly different variable selections. Especially when many of your variables are correlated. That is why I'd rather use a Random Forest to get a feeling for variable importance. Not sure EM has random forests though, haven't use EM in a while.

SAS Employee

Yes, beginning in SAS Enterprise Miner 13.1, random forests can be used for variable selection with the HP Forest node.

SAS Employee

Generally speaking the hp forest procedure has been available beginning with Enterprise Miner 7.1.

As Wendy has pointed out, the HP Forest Node is capable of variable selection beginning with Enterprise Miner 13.1. Enterprise Miner 13.1 was released December 2013.

Quartz | Level 8

Thanks for discussion and information.

Quartz | Level 8

I have EM 6.2. so this is a problem for me.


Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.


Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 6 in conversation