10-05-2014 01:13 PM
I have a classification problem with 6 classes and 14500 features (Interval ). I need to reduce the dimensions.
whats is the optimal method to do that?
Decision tree, PCA,....
10-07-2014 02:16 PM
A great thing about Enterprise Miner is that you can try multiple techniques at once. Since two flows on a diagram run in parallel you are also making the most out of your time.
Note that nodes like decision tree will do variable selection, and nodes like PCA will do dimension reduction.
Advanced Predictive Modeling Using SAS Enterprise Miner is a course that explains very well advanced topics, including unsupervised (PCA, variable clustering, etc) and supervised (PLS, LARS, LASSO, etc) dimension reduction techniques. Highly recommended!
A best practice is to try several techniques and select the one that suits your target, number of observations, and number of input variables. Also check this discussion ( ) where you can find a list of nodes that you can use for variable selection.
Also, what kind of classification of problem are you trying to solve? Are you dealing with missing values in the input variables? And what is the distribution of those 6 classes? You might even want to use a two-step model.
I hope it helps,
10-10-2014 09:43 AM
I agree with Miguel and is even using the book from the course he recommends above in my day-to-day work.
From my perspective, PCA is an awesome and great way to reduce the dimensions, although the problem with the technique is that it often becomes difficult to explain to a client or user of the model what exactly the PCA variables mean and how they relate to the parameters of the model. I would maybe use a decision tree to select useful variables and go from there (there's a specific way to configure the decision tree node to select variables). You can also use regression to select variables (especially forward selection, which is good at detecting strong interactions). Just note that these nodes may run a considerable amount of time due to your big dataset. Also (something I use a lot), since your inputs are interval variables, you could cluster them into groups using eminer's clustering node and then use one representative from each cluster for subsequent modelling. Eminer automatically exports the cluster representatives if you set the Variable Selection option in the clustering node to "Best Variables". Something I also sometimes use (although it's not as successful as the other techniques mentioned earlier) is SAS's variable selection node and then setting the minimum Chi Square lower bound much lower than Eminer's default value so that it may select a larger number of important variables.
Hope you succeed!
10-15-2014 12:32 PM
One thing I'd add is that I wouldn't trust one decision tree to do variable selection for me. Decision trees are highly unstable models, and small changes in the data (i.e. even a change in SEED) can produce vastly different variable selections. Especially when many of your variables are correlated. That is why I'd rather use a Random Forest to get a feeling for variable importance. Not sure EM has random forests though, haven't use EM in a while.
10-15-2014 02:28 PM
Generally speaking the hp forest procedure has been available beginning with Enterprise Miner 7.1.
As Wendy has pointed out, the HP Forest Node is capable of variable selection beginning with Enterprise Miner 13.1. Enterprise Miner 13.1 was released December 2013.