About M_Maldonado

M_Maldonado · ‎07-22-2014

Omer, Just to confirm that you got what you need from this other thread Thanks, Miguel

M_Maldonado · ‎07-22-2014

Pruthvi, Take a look at the documentation, on the section Decision Thresholds and Profit Charts. It is on the Predictive Modeling section. You can access it pressing F1 if you are on Enterprise Miner. Or you can download the SAS EM reference help in this link SAS Enterprise Miner. I hope it helps, Miguel

M_Maldonado · ‎07-18-2014

You can also use proc export in a SAS Code node. Paste the below on the code editor, or change &EM_EXPORT_TRAIN for the macro variable of the data set that you are interested. proc export data=&EM_EXPORT_TRAIN outfile='c:\myfiles\mymodel.csv' dbms=csv replace; run;

M_Maldonado · ‎07-09-2014

You use a decision tree because you need a model, not just groupings. For the case in our example you are using a decision tree to model the target p_good_bad using all your inputs to determine what inputs were the most important in modeling the target good_bad. The decision tree algorithm (proc arbor) running behind the scenes of Reporter node does all these three tasks at once: -it builds a model based on the predicted probability event of your modeling node (the p_<target> calculated by the model node you used right before the Score node and the Reporter node) -it calculates the variable importance, and the binning of the input variables -you do not need to build separate decision tree models. The way it works, proc arbor has the option to output the splitting rules into a file. In this file, the selected split is labeled as "primary", and all other candidate splits are ranked as "competitors". This file contains all the information of both the split that was selected, and the splits that could have been selected. This means that in just one proc call, you get all the information you need to create bins for your variables. I hope this helps, Miguel

M_Maldonado · ‎07-09-2014

Dear G C J, Thanks for your comment! Your path may vary depending on your data, and on the RPM task you select (basic, intermediate, or advanced). These results (variable importance graph and cross tabulations scorecard) are actually generated on the Reporter node at the end of your diagram flow, whether or not you have a node that does binning or grouping in your diagram. As a follow up tip, you can use the Reporter node to generate these results for any model node. To do that: 1. Connect a Score node, and a Reporter node to any model node. 2. Specify the nodes property of your Reporter node as summary. 3. Run your diagram flow, and open the results. Click on View to open the pdf from the Reporter node. You will find the same report that RPM generates. Notice that a decision tree ran behind the scenes to both calculate the variable importance of the inputs of your model related to the predicted probability of event, and to bin the inputs of your model for the cross tabulation scorecard. Does this answer your questions? Best regards, Miguel

M_Maldonado · ‎07-09-2014

Hey Omer, Here a great resource that summarizes statistical tests and how to code them in SAS. Choosing the Correct Statistical Test in SAS, Stata and SPSS I hope it helps, Miguel

M_Maldonado · ‎07-09-2014

Fabio, You can find more about SAS Enterprise Miner system requirements in http://support.sas.com/documentation/onlinedoc/miner/index.html Look for the Administration and Configuration documentation. Try different values for memory size. SAS Enterprise Miner 13.1 also has HPCluster node, with proc hpclus running behind the scenes, if you want to try that too. I hope it helps, Miguel

M_Maldonado · ‎07-03-2014

I highly recommend you to take the course Advanced Analytics Using SAS Enterprise Miner to learn solid foundations on most Enterprise Miner Analytics tasks. In the meantime you can read the Getting Started with SAS Enterprise Miner section in the reference help (Help->Contents menu, or press key F1), and other sections as you need them. There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs. You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help. Good luck!

M_Maldonado · ‎07-03-2014

Hi Teodoro, To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it. Two options to see the classification matrix: 1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button). 2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node. There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs. You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help. I hope it helps, Miguel

M_Maldonado · ‎06-26-2014

Rapid Predictive Modeler is really useful. In just a few clicks, you can create a model that serves as a great starting point for any data mining project. You can then tweak this model into a better predictive model that suits your needs. Plus, the report it generates helps you understand the main drivers of your RPM model. In this article, you’ll learn how to interpret the results of your RPM model. If you’re looking for a guide on how to import your RPM model into SAS Enterprise or just want to learn more about RPM, check out 's article . Data For this example, imagine that you need to create a model using RPM to identify customers who have a high probability of defaulting on their credit payments. The data set you will use is the German Credit table found in the library sampsio. This sample data set contains inputs to model a binary target called good_bad, which flags all customers that defaulted on their credit payments. Rapid Predictive Modeler Results Once you have run your rapid predictive model, open the PDF report that it generates. You will find several summary tables and graphs for your model. The tables and graphs in your report help you understand the model that was selected. This means that different settings of RPM on the same data set will give you different tables and graphs as your report depends on the final model that RPM selected, not on your data set. Selected Variable Importance and Cross Tabulations Scorecard Two of the most useful summary results are the Selected Variable Importance and the Cross Tabulations Scorecard. The Selected Variable Importance graphic shows you which variables contribute the most to your chosen model whereas the Cross Tabulations Scorecard helps you determine which values within those significant variables have the most effect. These two reports used together provide you with the whole picture about which variables, and which values within that variable drive your RPM model. How does Rapid Predictive Modeler Calculate Selected Variable Importance? The variable importance in this plot is calculated using a decision tree algorithm to explain the predicted outcome variable of your RPM model. By default, the predicted outcome I_<target> (i_good_bad in this example) is used as target and all variables flagged as significant in the selected model as inputs. When you specify a decision matrix, the decision outcome chosen by the model (d_good_bad in this example) is used. This plot is particularly useful to explain black box models like neural networks or support vector machines. RPM’s tree-based variable importance will not necessarily match the variable importance of methods that already calculate variable importance like regression, random forest, gradient boosting, etc. Selected Variable Importance Interpretation Take a look at the Selected Variable Importance of this example. It indicates that there is one variable that is very significant in your model. The variable checking (years with checking account) is a very strong driver in your model to predict payment default. This information is very useful to us; we can now use it to dictate adequate delinquency policies and create new strategies for cross-sale and customer retention. Checking isn't the only important variable here. Several other variables contribute to your predictions and you’ll want to take a closer look at them, especially history, duration, amount and savings. Figure 1. RPM Selected Variable Importance graph How does Rapid Predictive Modeler create the Cross Tabulations Scorecard? The variables are first binned by the same decision tree algorithm used to determine the Selected Variable Importance above. Next the binned variables are used as inputs to a regression model using the predicted outcome as target. By default the regression uses i_<target> as the dependent variable, or d_<target> if you specified a decisions matrix. In our example, the regression uses the binned inputs of the german credit data set to explain the variable i_good_bad calculated by the selected RPM model. The scorecard points are calculated through a scaling that starts by identifying the lowest parameter estimate of the regression within a binned variable. This value is assigned a score of 0 and will be used as a reference value. The scorecard points for all other binned levels are scaled based on the difference between the parameter estimate of that bin and the parameter estimate of the reference level. The scorecard points values range from 0 to 1000 and increase as the difference between the parameter estimate and the reference parameter estimate increases. The more similar the parameter estimates across all binned levels within a variable, the closer they will be to 0. On the other extreme, if there is a binned level that explains most of a variable, and has a very high parameter estimate compared to the other binned levels, the associated score will be closer to 1000. Cross Tabulation Scorecard Interpretation When interpreting the scorecard for Rapid Predictive Modeling results, you’ll have to make a clear distinction between a scorecard generated through SAS Credit Scoring for SAS Enterprise Miner and one from RPM. These two scorecards are not the same. The scorecard produced by the Scorecard node is a true scorecard in the sense that points are generated in terms of certain scaling properties, and are comparable across variables. Since the scorecard from Rapid Predictive Modeler is based on a different scaling algorithm, the points from one variable are not comparable to the points of another variable. However, Rapid Predictive Modeler Cross Tabulation Scorecard gives you a clear notion of the relationship of certain values of your inputs with the event you are modeling. Once you have identified significant variables from the Selected Variable Importance graphic above, you’re now able to dive deeper and determine what specific values, or range of values, are controlling the significance of the variable. In our example, you learned that the variable checking (years with checking account) was the most important variable in your RPM model. You can now look to the Cross Tabulation Scorecard and use the scorecard points to notice that the higher the number of checking accounts, the lower the chances of a customer being bad. You can also notice this relationship by comparing the 43.54% bad rate for bin 1 to the 20.57% bad rate of bin 3. Notice as well that there are twice as many customers in bin 3, compared to bin 1, which is an indicator of a good quality portfolio. Figure 2. Partial screen of RPM Cross Tabulation Scorecard Conclusion This example helps you understand better the algorithms and the logic behind two of the most useful results generated by the SAS Rapid Predictive Modeler. You should also be more familiar with the advantages and limitations of these results, and how to use them together to gain better insights about your selected model. If you find this tip helpful, have any questions, or simply want to share your thoughts, please comment below.

M_Maldonado · ‎06-26-2014

Hi Fred, In a very strong statistical sense, to be absolutely valid, your subset test population needs to be as close as possible to a random sample. If you had to go with that model anyway, you can get a rough idea of what to expect by comparing the probability distribution of your model between your subset and your training population. If you are introducing bias, you can test if moving the cutoff of your predicted probabilities helps. Now that you have a new model, it sounds like you already have the answer to your original question. If your models are different, it would have been really bad to use your original model in that subset. In the future, assessing the predicted probabilities for your training population and for your subset training can give you a better light of whether your subset is a good candidate to test a model. Good luck!

M_Maldonado · ‎06-25-2014

Try to post your question in the Data Management forum. You might have better chances there. Good luck!

M_Maldonado · ‎06-23-2014

Hey Fred, If I understand well, you trained a model using your active population. But you will only test the model on a subset of the active population, your customers with online banking. Although you most likely expect different results, you can still use your model to find good insights about your active customers, and the incentives that drive their expected purchase. Let's go through an example. Suppose you are building a model that predicts who in your active portfolio transfers a balance from other credit cards using a promotion with a special low rate. Your model finds out that historically, customers that take this promotion are men in their earlier 30's, who pay a yearly average interest greater or equal than 850 USD. Instead of sending this promotion via snail mail like every year, this time you are only displaying a banner on the welcome page of the ATM screens when a customer withdraws cash. Suppose also that customers in this segment are used to pay everything with their credit cards, and rarely use cash or your ATM's. Although you expect less customers to take the promotion because some of them will not see the banner, you can use your model to estimate the response of your customers in their 20's an in their 40's that do visit your ATM's. It seems to me that you have all the pieces of information to evaluate how different your model could perform on a subset of your population. For instance, you can compare how your assessment metrics change across both your active population, and the subset that regularly uses online banking. Assess differences in the distribution of the predicted events or non-events, ROC curves, and fit statstics for the population and your subset. I hope it helps, Miguel

M_Maldonado · ‎06-19-2014

What about the precision recall cutoff curve in your Cutoff node results, what does it look like?

M_Maldonado · ‎06-16-2014

Hi Aditya, Boosting procedures calculate the residuals as the derivatives of a loss function. The Gradient Boosting node, with proc treeboost behind the scenes, uses a stochastic gradient function as a loss function. The boosting from Start Groups node uses SAS code to calculate a cumulative loss function. In other words, the main difference is that Gradient Boosting node has a more modern loss function. What do you mean when you say that ensembles let you skip the usage of a cutoff node? It seems to me that you use a Cutoff node if you are interested in a better assessment of your predicted probabilities. But this would be true for any model node, regardless of whether it is an ensemble. Am I missing something? Please briefly explain if you have a chance. Thanks, Miguel

Online Status	Offline
Date Last Visited	‎02-28-2018 11:39 AM

Re: Unbalanced data - miner

Re: SAS EM only: How to use parameter estimates in the next node?

Re: StatExplore Node

Re: How many leaves and nodes should a tree

Re: Export scoring code for Cross Validation in SAS Enterprise Miner

Re: Export scoring code for Cross Validation in SAS Enterprise Miner

Re: run time error ensemble model

Re: run time error ensemble model

Re: help with hash table

Re: help with hash table

Re: StatExplore Node

Re: How to access Variable importance in neural network in EM?

Re: Grouping variables to create new variables SAS Enterprise Miner

Re: Error when running market basket node in SAS EM

Re: Seed Initialization Method for Hierarchical Clustering

Re: How can I use the Tobit's Model in SAS?

Re: Using cross-validation in Enterprise Miner;

Re: How come no Segment Profile after I set "Cluster Variable Role" = ...

Re: Confusion matrix in Enterprise Miner

Re: How can we export dataset from enterprise Miner as a csv file or t...

Credit Scoring by Example in SAS® Enterprise Miner™

Tip: How to model a rare target using an oversample approach in SAS® ...

Tip: How to interpret your SAS® Rapid Predictive Modeler results

Tip: Use the Cutoff Node in SAS® Enterprise Miner™ to Consume the Post...

Tip: How to build a scorecard using Credit Scoring for SAS® Enterprise...

Re: Correlation measure between binary variables

Re: Doubt on soliciting and calculation of average profit

Re: How can we export dataset from enterprise Miner as a csv file or t...

Re: Tip: How to interpret your SAS® Rapid Predictive Modeler results

Re: Tip: How to interpret your SAS® Rapid Predictive Modeler results

Re: Correlation between interval variables and binary variables

Re: System requirements for proc fastclus

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Tip: How to interpret your SAS® Rapid Predictive Modeler results

Re: Is this bias and will validation return useless results?

Re: Will greyed out code be ignored in 'User Written Body' in SAS DIS

Re: Is this bias and will validation return useless results?

Re: Is cut off node should still be used when boosting/ensemble models...

Re: Is cut off node should still be used when boosting/ensemble models...