Hi everyone,
I have been looking online for an explanation to this and came here as a last resort.
- Can anyone tell me tell me the difference between the gains chart and the % response chart in SAS EM and their applications.
- The gains chart plots positive predicted value (or gains) vs depth and therefore i believe can be used in optimising response rates for target marketing projects. However, the %response chart plots %response on depth but generates different results to the gains chart. Please can someone clarify?
Thanks
The short answer is that Gains are calculated relative to the baseline (overall average) rate while Response rates are computed without adjusting for the overall rate. The SAS Enterprise Miner help utility is available by openging the application and then clicking on Help --> Contents to display the help topics. You can click on the magnifying glass near the top and type "Gains" and hit enter. In the panel on the left, click on Model Comparison node and then scroll down in the panel on the rigtht until you see the Data Mining Measures section under which you will see something like the following (some information omitted here):
The following terminology is used for the Gains and Lift Chart:
By way of example, suppose the response rate (aka % Response) is 12% in a particular bin, but the overall average response rate for the whole population is 4%. The % Response for the bin = 12% but Gain = ((12%/4%)-1) = (3-1) = 2 (or 200%) gain since the gain from 4% to 12% represents a change of 8% which is twice as large as the baseline value of 4%. In this way, Gain might be negative but % Response cannot ever be negative. In my example, SAS Enterprise Miner the axis for Gain was marked using a % scale so my example above would have displayed the Gain as 200 for that decile.
In practice, the Gain plot in SAS Enterprise Miner represents the cumulative gain at a particular bin and produces exactly the same plot as the Cumulative % Response graph but with different axis markings on the vertical axis. Suppose you are looking at the 3rd decile. In this example, the % Response plot only shows the percent of respondents in the 3rd decile while the Cumulative % Response plot shows the percent response for the top 3 deciles.
Hope this helps!
Doug
The short answer is that Gains are calculated relative to the baseline (overall average) rate while Response rates are computed without adjusting for the overall rate. The SAS Enterprise Miner help utility is available by openging the application and then clicking on Help --> Contents to display the help topics. You can click on the magnifying glass near the top and type "Gains" and hit enter. In the panel on the left, click on Model Comparison node and then scroll down in the panel on the rigtht until you see the Data Mining Measures section under which you will see something like the following (some information omitted here):
The following terminology is used for the Gains and Lift Chart:
By way of example, suppose the response rate (aka % Response) is 12% in a particular bin, but the overall average response rate for the whole population is 4%. The % Response for the bin = 12% but Gain = ((12%/4%)-1) = (3-1) = 2 (or 200%) gain since the gain from 4% to 12% represents a change of 8% which is twice as large as the baseline value of 4%. In this way, Gain might be negative but % Response cannot ever be negative. In my example, SAS Enterprise Miner the axis for Gain was marked using a % scale so my example above would have displayed the Gain as 200 for that decile.
In practice, the Gain plot in SAS Enterprise Miner represents the cumulative gain at a particular bin and produces exactly the same plot as the Cumulative % Response graph but with different axis markings on the vertical axis. Suppose you are looking at the 3rd decile. In this example, the % Response plot only shows the percent of respondents in the 3rd decile while the Cumulative % Response plot shows the percent response for the top 3 deciles.
Hope this helps!
Doug
Isn't the 200% the lift you have calculated and not the gain?
Lift can be computed by dividing the group response rate by the overall response rate. In my previous example, if the group response rate is 12% and the the overall response rate is 4%, Lift = 12% / 4% = 3 (no units on Lift) which is the same as 300%. Had the group only had a 4% response rate, Lift = 4% / 4% = 1 (or 100%) but since the group does no better than the overall population, the Gain is 0 (essentially, Lift -1) A lift of 4 would mean the group has a response rate which is 4 times the overall response rate but that is only 3 times better than the background rate so the gain would be 3 (or 300%).
Hope this helps!
Cordially,
Doug 
Hi, @DougWielenga, this is very helpful.
When doing a logistic regression, you can compute the Gains, Lift, %response, but also the ROC Curve and associated statistics of Gini and AUC. Do you have any advice on when to use each of these statistics to evaluate the quality of the model fit?
When doing a logistic regression, you can compute the Gains, Lift, %response, but also the ROC Curve and associated statistics of Gini and AUC. Do you have any advice on when to use each of these statistics to evaluate the quality of the model fit?
(Please note what follows is my personal opinion based on the modeling scenarios I have encountered, but opinions might vary greatly among modelers based on their own experience and objectives)
There are several things to consider when fitting a model, and how you want to use the model is critical to informing how you should select the model itself.
* Do you have specific objectives in how the model will be used? For example, if my goal is to make a business decision on certain observations (e.g. choose a specific strategy for each observation/account), I might be more concerned about how effective my strategies are on select portions of the population than on how the model fits overall. If I am dealing with a rare event, I am typically most interested in how the model performs on the (likely) small percent of the population on which I take action. Lift/Gain are calculated at specific percentages of the population allowing me to evaluate strategy effectiveness at any given depth while Gini and AUC (area under the curve) assess overall model performance across the entire population. In some cases, however, interpretation is critical which begs another question.
* Do you need the model to be interpretable? Should you have a need to interpret your fitted model, you will find yourself being forced to choose from a subset of modeling approaches in order to obtain this interpretation. You should still consider more flexible modeling strategies, however, since the performance on these other models can give you an idea how much performance you are sacrificing for interpretability. Your choice in this case is more challenging since decision trees are often considered when interpretability is desired, but trees don't lend themselves to smoothly changing metrics. Trees have a relatively small number of distinct predicted values from their terminal nodes, and every observation in the same node has the same predicted value. At times, some have chosen to apply secondary models to try and better sort the observations within a node to overcome this. If you have a Regression model, however, you can choose any distinct predicted value in the data as your cutoff value and there are typically not large blocks of observations with exactly the same score among the training observations. Suppose the highest response rate occurred in terminal tree nodes with 12.3%, 6.4% and 4.8% of the data. Lift charts are computed by evaluating bins (e.g. every 5% grouping) but this means the terminal nodes from the tree do not fit nicely into the bins.
* Do you have profit/cost information that should be considered? You can specify decision weights for categorical target variable that allow you to evaluate the most profitable outcome. In this case, many of the metrics you mentioned might be less critical than the projected revenue/profit from making certain decisions, even if those decisions are more likely to be false. Suppose you had data with a 1% fraud rate. A person 10 times as likely to be fraudulent still only has 10% chance of fraudulent even though there is a 90% chance of not being fraudulent. If the cost of fraud is in the tens of thousands and the cost of investigation is relatively small, you might consider investigating this person who is 90% likely to be non-fraudulent. These types of considerations are not considered by Lift/Gain or Gini/AUC.
It is best to put think of how these different metrics can be balanced against the business objectives. Are you more concerned about paying out huge fraud costs or are you more concerned about angering your non-fraudulent clientele and potentially losing business. As you can see, choosing the best criterion can be difficult at times.
Hope this helps!
Cordially,
Doug
Yes, that's very helpful. Thank you.
Hi,
First of all, I apologize to up this post but it is probably the closest one that relates to my understanding issues. I feel like I am about to grasp all these concepts but there are still blurry parts concerning definitions.
- What's the difference between the "true responders" and the "responders" you mentioned in the definitions? Or is it simply language abuse?
- For the % Response, I don't understand what the "proportion of true responders" is. Is it the proportion of responders in each decile calculated with respect to the total number of responders (on the whole data set), or the proportion of responders in each decile with respect to the size of the decile (number of participants in each decile)?
To clear things up, say the whole data set contains data about 20,000 individuals, among which 5,000 gave a positive response (say, a donation, associated with a primary event). With a certain predictive model, I get the "best 5%" of my population (which gives me in total 1,000 individuals, i.e. the ones most likely to donate provided that the predictive model is decent), "next best 5%", and so on, which gives me all my deciles. Now imagine that I have a great model and, among these 1,000 individuals, it appears that 760 of them donated. Then, what will my % Response be? Will it be 760/5000 (=15.2%) or rather 760/1000 (=76%)? My guess goes towards the first choice (15.2%), but the definition I gave might not correspond to % Response but rather to % Captured Response.
- I don't understand SAS EM's notion of "cumulative". To me, cumulative should add up to 100%, however it is never the case here (except for Cumulative % Captured Response). Therefore, I don't understand the notion of Cumulative % Response.
- I don't understand the notion of % Captured Response and of its Cumulative counterpart. What's the difference between % Response and % Captured Response?
I think that, with these aspects cleared up, I'll get a better understanding of the notion of lift.
Thanks in advance if anybody answers this message,
Yann.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.
