NOTE: Updated to include questions, slides and links from the August 11, 2017 Ask the Expert session.
Did you miss the Ask the Expert session on Variable Selection in SAS Enterprise Guide and SAS Enterprise Miner? Not to worry, you can catch it on-demand at your leisure.
The session covers various Variable Selection options. You learn how to:
Here are some highlighted questions from the Q&A segment held at the end of the session for ease of reference.
Where is the data that’s used for these illustrations?
Download the zip file in the 2nd bullet here (Example Data for Getting Started with SAS Enterprise Miner 14.1) and use the donor_raw and donor_score data here.
Where can I find reference materials?
Click links below for specific resources for variable selection. Also see slides for additional resources
• PROC LOGISTIC selection methods
• PROC REG selection methods
• Variable Reduction in SAS by Using Weight of Evidence and Information Value
• Principal Component Analysis Chapter
• PROC HPBIN Documentation
What is the difference between Backward Selection and Fast Backward Selection?
Backward - This method starts with all effects in the model and deletes effects that do not meet the criterion level. Fast Backward - This method starts with all effects in the model and deletes effects without refitting the model.
Is SAS HP Procedure tools such as PROC HPIMPUTE and PROC HPLOGISTIC part of base SAS 9.4 or are they licensed separately?
SAS HP Procedures are available as part of their respective product. So there are a common set of procedures that are released with Base which include HPIMPUTE. HPLOGISTIC and HPREG are part of SAS/STAT. See the chart below for the procedures available depending on which products you have licensed. Also click here for documentation on
Say you have 200 binary variables (1,0). What techniques would you employ to pick best ones?
Decisions Trees do a pretty good job with binary variable inputs. That would be my first choice and then I would try Variable Clustering.
Just to confirm, most of methods are good for use with numerical data, correct?
Yes, that is correct.
I have a data set, over 300 variables, and there is quite a bit of multicollinearity between the predictor variables. What is the best method to reduce these variables but keeping the model interpretable? I am interested in variable interpretability.
It depends. I would start with Variable Clustering. If you have access to SAS Enterprise Miner then the Variable Selection, LARS and HP Forest would be good methods to explore as well.
Can Enterprise Guide or Base SAS do all the functions of SAS Enterprise Miner?
No, SAS Enterprise Guide provides data management capabilities as well as statistical capabilities based upon your license. Enterprise Miner creates descriptive and predictive models. In addition, Enterprise Miner has additional analytical tools (PROC's) that are used specifically for data mining and predictive modeling. SAS Enterprise Miner is also designed to use large data and utilize training and validation data sets. Algorithms like Neural Networks, Gradient Boosting, Support Vector Machines, Random Forest and the ability to compare models and automatically create Base SAS score code for the winning model are unique to Enterprise Miner
How would you choose a variable to represent a cluster when doing variable clustering in Enterprise Guide? I saw that Enterprise Miner would select for you
You can use the same criterion that Enterprise Miner uses which is the minimum R squared ratio. So for Cluster 1 in EG LIFETIME_CARD_PROM has the lowest R squared ratio at 0.1156.
Is there an equivalent method to combining results for variable selection to the metadata node in SAS Enterprise Miner available in SAS Enterprise Guide?
No. The Metadata Node is unique to SAS Enterprise Miner. You could write code to allow you to combine the results from several different methods.
Can you do decision tree with Enterprise Guide?
There is not a task to do Decision Trees within Enterprise Guide. If you have SAS 9.4 there is PROC HPSPLIT which does High Performance Decision Trees.
What method can you use when you do NOT have binary value in the target?
You would not be able to use Logistic Regression but you could certainly use Linear Regression methods. You can also use Decision Trees, Correlation, Principal Components and Variable Clustering.
Would it be possible to use a structural equation modeling or hierarchical linear modeling in Enterprise Miner?
You would only be able to do this type of modeling within Enterprise Miner by using the SAS Code node using proc CALIS. There is not a node available for structural equation modeling.
Is HP forest is the same as gradient boosting?
Gradient Boosting and HP Forest (Random Forest) are both algorithms that create multiple trees. But the algorithm is different from there. The simplest explanations is HP Forest randomly samples both rows and variables for each tree. Gradient Boosting fits the next tree to the residuals of the previous tree.
When will lasso be available for binary outcomes?
LARS is available in SAS Enterprise Miner for binary outcomes. I do not see it on the plan for Proc LOGISTIC. If you would like it to be added please make a suggestion to email@example.com.
I signed up for but missed the event on June 23, Model Selection. Is there some way for me to watch a recorded version of that event?
Yes, you can watch the recording for the Model Selection session.
Why not use absolute correlation?
You can certainly use absolute correlation and may be a better way to look at the results. I kept both the negative and positive correlation in order to see the direction of the correlation as I was evaluating variables to use in the model.
Want more tips? Be sure to subscribe to the Data Mining Library to receive follow up Q/A, slides and other related resources from the webinar. From the Data Mining Library, just click Subscribe from the orange bar underneath the list of the recent articles.