UPDATED for the 2/8/19 session
Did you miss the Ask the Expert session on SAS Modeling Best Practices? Not to worry, you can catch it on-demand at your leisure.
Watch the webinar
This session provides general guidelines for accessing and determining the best modeling methodology or methodologies for a given business issue. Best demonstrated in SAS® Enterprise Miner™, learn how to:
Identify and outline your business issue, question or desire, determining whether you want to predict or describe something.
Determine the type of problem or question at hand.
Identify the best modeling algorithm or algorithms for the business issue or requirement.
Measure the effectiveness, performance or accuracy of your model.
Apply these concepts in SAS Enterprise Miner.
Here are some highlighted questions from the Q&A segment held at the end of the session for ease of reference.
You briefly mentioned there are ways to handle missing values. What are some that you would suggest?
For numeric variables, the most common way to handle missing values is by using the mean or median of the variable in question. There are additional methods such as mid-range, Tukey’s biweight, Huber, and Andrew’s Wave. For categorical variables, the most common methodology is “count” wherein you fill the missing values with the most common level of the categorical variable. Additional methods are distribution wherein the replacement values are calculated based on random percentiles of the variable’s distribution and tree imputation wherein the replacement values are estimated by analyzing each input as the target with the remaining input and rejected variables serving as the predictors. These two methods also apply to numeric variables.
When would you use PCA vs. variable clustering for dimension reduction?
Both methodologies work well for dimension reduction. They both remove multicollinearity and decrease variable redundancy. However, the results of variable clustering are easier to interpret than the results of PCA. Additionally, both are computationally expensive as the size of the data set grows and do not handle categorical variables well. In the case of categorical variables, both methods convert them to dummy variables wherein each level of the categorical variable serves as its own variable. This can slow down the process as the number of dimensions grows with this conversion. Other options for categorical variables is to apply weight of evidence encoding.
When would you use random forests instead of decision trees?
This depends upon what your priorities are. If you are more concerned with getting a model together fast and one that is more interpretable, you would most likely want to apply the decision tree algorithm as it handles large data sets well, is fast to compute, and easier to understand and explain. Decision trees, however, are prone to overfitting one’s training data and are therefore sensitive to any changes made to such data. Random forests limit overfitting without substantially increasing the error due to bias. Therefore, you’re more likely to get a better performing model with random forests. However, do keep in mind they are computationally expensive and are harder to interpret. So ultimately it depends upon how much time you have, what level of transparency you need to provide and how well you need your model to perform.
What evaluation metrics would you suggest using, outside of the ones mentioned?
I mentioned ASE for regression models and misclassification (or accuracy) for classification models. However, there are several options for either type of model that you can use for evaluation and should most likely consider along with the metrics already mentioned. When it comes to classification, you’ll want to investigate the confusion matrix in addition to obtaining the misclassification and accuracy rates. If there is information on profit or loss (benefit or cost), this will be multiplied by the expected rates of the confusion matrix to obtain the total profit or loss obtained with a specific model. Additionally, you can use the precision or recall measurements to get an idea of the number of false positives or false negatives, respectively. The F1-score is a combination of the precision and recall that may tell a more complete story of your classifier’s predictive performance. Log loss is an additional metric for classification that measures the value of the model’s predictions; the higher the log loss, the better the predictions. For regression models, the RMSE (root mean squared error) is one of the most common measurements (along with the ASE or MSE). However, the explained variance score (proportion of variance in the data explained by the model) and the R2 score (proportion of variance in the dependent variable that is predictable from the independent variables) are also used to measure the goodness of fit.
Is there a minimum response rate (1s in the target variable) that is necessary to build a classification model? Is 5% too less to build a model?
There is not a minimum, but you may find that my either oversampling a rare event and then fitting the oversampled dataset or by using techniques that are available for rare events, will give you a better predictive model. The rule induction node was created to help with rare target variables, so you may want to try it out.
Where can I find more information about how to assign the importance rate to the target levels as well about the cost matrix and the expected value.
The Getting Started with SAS Enterprise Miner Book has an example of using the cost matrix and decisioning. You can find it here. Go to the tab for your current version.
Can you show a demo of scoring a test set?
Here is a video to help you with scoring new data within SAS Enterprise Miner.
What is the good practice for variable selection to start the model development process?
There are many theories around variable selection. The great news is using tools like SAS Enterprise Miner you can try several and compare the results. We have an Ask the Expert session on Variable Selection that talks about several of these methods.
Can modeling be done without using Enterprise Miner? Can modeling be done using Base SAS & SAS EG?
Yes, the analytics lifecycle presented in this session is applicable to any tool you use SAS Enterprise Guide, SAS/Stat & Base SAS, SAS Studio, SAS Visual Statistics, SAS Visual Data Mining and Machine learning and any combination of these tools or others.
Does SAS Rapid Predictive Modeling (RPM) transform variable automatically if they are not normally distributed?
Yes, it will depend on which flow you choose (basic, intermediate or advanced) as to the method that is used. Here is a link where you can learn more about what RPM is doing behind the scenes.
What do you think about variable information value and weight of evidence in combination of R-Square to select variables?
Variable information values and weight of evidence (WOE) is a very uses variable selection technique. If you have the Credit Scoring add-on to SAS Enterprise Miner, you can do this directly in EM. Otherwise you can use PROC HPBIN to calculate these statistics. These options are covered in another Ask the Expert on Variable Selection.
Are data transformations applied to Train, Validation, and Test sets? Or only to Train set?
The data transformations are applied to all the data sets you have created. So, if you have created a Training, Validation and Test datasets each will include the transformed variables. The transformations are also included in the SAS score code created in the Score node for the best model.
Recommended Resources
Learn SAS Enterprise Miner
SAS Enterprise Miner Documentation
Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q/A, slides and recordings from other SAS Ask the Expert webinars. To subscribe, select Subscribe from the Options drop down button above the articles.
... View more