NOTE: Updated to include questions, slides and links from the November 16, 2018 Ask the Expert session.
Did you miss the Ask the Expert session on Variable Selection in SAS Enterprise Guide and SAS Enterprise Miner? Not to worry, you can catch it on-demand at your leisure.
The session covers various Variable Selection options. You learn how to:
Here are some highlighted questions from the Q&A segment held at the end of the session for ease of reference.
Where is the data that’s used for these illustrations?
Download the zip file in the 2nd bullet here (Example Data for Getting Started with SAS Enterprise Miner 15.1) and use the donor_raw and donor_score data here.
Where can I find reference materials?
Click links below for specific resources for variable selection. Also see slides for additional resources
• PROC LOGISTIC selection methods
• PROC REG selection methods
• Variable Reduction in SAS by Using Weight of Evidence and Information Value
• Principal Component Analysis Chapter
• PROC HPBIN Documentation
What is the difference between Backward Selection and Fast Backward Selection?
Backward - This method starts with all effects in the model and deletes effects that do not meet the criterion level. Fast Backward - This method starts with all effects in the model and deletes effects without refitting the model.
Is SAS HP Procedure tools such as PROC HPIMPUTE and PROC HPLOGISTIC part of base SAS 9.4 or are they licensed separately?
SAS HP Procedures are available as part of their respective product. So there are a common set of procedures that are released with Base which include HPIMPUTE. HPLOGISTIC and HPREG are part of SAS/STAT. See the chart below for the procedures available depending on which products you have licensed. Also click here for documentation on
Base High Performance procedures
High Performance data mining procedures
Say you have 200 binary variables (1,0). What techniques would you employ to pick best ones?
Decisions Trees do a pretty good job with binary variable inputs. That would be my first choice and then I would try Variable Clustering.
Just to confirm, most of methods are good for use with numerical data, correct?
Yes, that is correct.
I have a data set, over 300 variables, and there is quite a bit of multicollinearity between the predictor variables. What is the best method to reduce these variables but keeping the model interpretable? I am interested in variable interpretability.
It depends. I would start with Variable Clustering. If you have access to SAS Enterprise Miner then the Variable Selection, LARS and HP Forest would be good methods to explore as well.
Can Enterprise Guide or Base SAS do all the functions of SAS Enterprise Miner?
No, SAS Enterprise Guide provides data management capabilities as well as statistical capabilities based upon your license. Enterprise Miner creates descriptive and predictive models. In addition, Enterprise Miner has additional analytical tools (PROC's) that are used specifically for data mining and predictive modeling. SAS Enterprise Miner is also designed to use large data and utilize training and validation data sets. Algorithms like Neural Networks, Gradient Boosting, Support Vector Machines, Random Forest and the ability to compare models and automatically create Base SAS score code for the winning model are unique to Enterprise Miner
How would you choose a variable to represent a cluster when doing variable clustering in Enterprise Guide? I saw that Enterprise Miner would select for you
You can use the same criterion that Enterprise Miner uses which is the minimum R squared ratio. So for Cluster 1 in EG LIFETIME_CARD_PROM has the lowest R squared ratio at 0.1156.
Is there an equivalent method to combining results for variable selection to the metadata node in SAS Enterprise Miner available in SAS Enterprise Guide?
No. The Metadata Node is unique to SAS Enterprise Miner. You could write code to allow you to combine the results from several different methods.
Can you do decision tree with Enterprise Guide?
There is not a task to do Decision Trees within Enterprise Guide. If you have SAS 9.4 there is PROC HPSPLIT which does High Performance Decision Trees.
What method can you use when you do NOT have binary value in the target?
You would not be able to use Logistic Regression but you could certainly use Linear Regression methods. You can also use Decision Trees, Correlation, Principal Components and Variable Clustering.
Would it be possible to use a structural equation modeling or hierarchical linear modeling in Enterprise Miner?
You would only be able to do this type of modeling within Enterprise Miner by using the SAS Code node using proc CALIS. There is not a node available for structural equation modeling.
Is HP forest is the same as gradient boosting?
Gradient Boosting and HP Forest (Random Forest) are both algorithms that create multiple trees. But the algorithm is different from there. The simplest explanations is HP Forest randomly samples both rows and variables for each tree. Gradient Boosting fits the next tree to the residuals of the previous tree.
When will lasso be available for binary outcomes?
LARS is available in SAS Enterprise Miner for binary outcomes. I do not see it on the plan for Proc LOGISTIC. If you would like it to be added please make a suggestion to support@sas.com.
I signed up for but missed the event on June 23, Model Selection. Is there some way for me to watch a recorded version of that event?
Yes, you can watch the recording for the Model Selection session.
Why not use absolute correlation?
You can certainly use absolute correlation and may be a better way to look at the results. I kept both the negative and positive correlation in order to see the direction of the correlation as I was evaluating variables to use in the model.
Why do we get different results between EM and EG? Are the datasets fed into both the same?
The data in EM is split into training and validation datasets and EG uses the entire dataset. Also, the procs being used are different (ie EG uses PROC LOGISTIC and EM uses PROC DMREG)
How would you cluster categorical variables using Enterprise Guide? (I think your example was for continuous variables only.)
The clustering algorithm in EG only accepts numerical variables. If you wanted to cluster your categorical variables you would need to recode them into indicator (0,1) variables and use the indicator variables in the clustering algorithm.
What is the difference between SAS enterprise Guide and SAS Enterprise Miner?
SAS Enterprise Guide is a point and click windows interface for using SAS. SAS Enterprise Miner is a data mining product for implementing data mining projects.
I have like 80 variables, after stepwise selection, there are still like 30 left in the model, is it normal? what potential problems may be there?
Yes, that is normal. It may or may not create problems. What is the objective of your model? If it is accuracy then you will want to watch the Model Selection Ask the Expert and select a criterion like misclassification to compare several models. You can choose from some of the other variable selection techniques and create additional models and then use the techniques for model selection to see which model is the best.
Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q/A, slides and recordings from other SAS Ask the Expert webinars. To subscribe, select Subscribe from the Options drop down button above the articles.
Hi Melodie,
Thanks for the tips!
I'm developing my model in SAS Miner and wanted to use your tip about combining different variable selection nodes in metadata node.
However when I try to change "Combine Rule" option I always get the same variables. I also noticed that they change according to the node selected in "Import Selection". What am I doing wrong?
Hope you can help me,
Magda
@MBRACH - Thank you for your comment. @MelodieRush was busy so she asked me to respond to your note. There is likely not a problem in this situation. The Metadata node is only able to display the metadata for one preceding node at a time, so you should not be looking at the Metadata node to see the differences in the models that are formed. Instead, consider creating multiple Metadata nodes with different settings for the Combine property (Combine = Any, Combine = All, Combine = Majority) and then follow each of those Metadata nodes with a default Regression node where no additional variable selection is done. You can then connect all of the new Regression nodes to a new Model Comparison node and run the flow. You will then likely see differences in the Regression node results as well as in the Model Comparison node. Depending on complexity of the relationships in your data, it is possible that not all of the fitted models will be different. Also, given the large number of variable selection methods you are using, it is possible that using Combine=Any will result in no input variables being selected at all since this option rejects a variable if any of the variable selection methods rejected that variable. You might also consider looking at a subset of variables selection methods rather than so many for Combine=Any. I often like to use Combine=All and do further variable selection in the final modeling node when possible. For example, using a selection method in a final Regression node so that it chooses only the variables that are necessary for that model. If your final modeling node does not have variable selection (e.g. a Neural Network node), then you might consider using different subsets of variables and passing them to different Neural Network nodes to see how they impact the overall fit.
Hope this helps!
Cordially,
Doug
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Ready to level-up your skills? Choose your own adventure.
Your Home for Learning SAS
SAS Academic Software
SAS Learning Report Newsletter
SAS Tech Report Newsletter