Solved: How to interpret lift / gain plot in score rankings in SAS EM: decisio...

ycenycute · Posted 10-04-2022 10:34 AM

In SAS EM: decision tree, how to interpret lift / gain plot in score rankings?

I am working on a problem where Y is a binary variable. In the lift / gain chart, X axis is depth of the tree. The gain / lift is declining as the depth grows. What does it mean? And what can we tell from the gain / lift chart?

sbxkoenk · Posted 10-14-2022 11:18 AM

Hello @ycenycute ,

Baseline cumulative lift : the lift that you get by random shuffling of your observations. So, at depth = 10% on your horizontal axis, you have 10% completely random observations (same statement is true for all other quantiles).
Same result as giving exactly the same event probability to every observation.
For example : posterior probability for the event = prior probability for the event !

Best cumulative lift : the lift that you get with a perfect prediction. 100% event (posterior) probability in case you have a real event, 0% event (posterior) probability if you have a real non-event.

Remember that observations are sorted by descending event probability on the horizontal axis.
Depth = 20% ( = the 20% of observations with highest event probability).

Koen

View solution in original post

sbxkoenk · Posted 10-04-2022 04:54 PM

It's really very simple (once you know it 🙄🙄 ).

It's a bit tedious and a lot of work to start explaining from a white paper.

First try this :

Learn SAS with Jeff Thompson
SAS Tutorial | Lift and Response Charts in SAS

https://www.youtube.com/watch?v=M5nwm94Q7xc

It's also omnipresent on internet.
Here's just a random link after a keyword search.

Cumulative Gains and Lift Charts : https://www3.nd.edu/~busiforc/Lift_chart.html

Good luck,

Koen

ycenycute · Posted 10-13-2022 10:40 PM

This is super helpful. Thanks. But I am wondering when I run a decision tree on a dataset without splitting into train or test or validate. What is the meaning of baseline cumulative lift or best cumulative lift? How are these two curves obtained?

ycenycute · Posted 10-13-2022 11:57 PM

When I compare different models based on test data. How can I determine which model is better? Do I also count like area under the curve, like ROC?

sbxkoenk · Posted 10-14-2022 11:29 AM

Hello @ycenycute ,

You have to choose a metric to compare models.

Area under the ROC curve is a good one, but F1-score and Kolmogorov-Smirnov distance are also good ones.
It depends a bit on what you want to achieve.

But do not choose a model based on test data.

You have to choose a model based on validation data.

For test data, it is not "allowed" for these observations to be seen by the model (not during model training , neither during model selection).
You get an honest measure of the generalization error of the final model by looking at the test data (assuming stationary conditions throughout time). If you choose a final model based on the test data, that generalization error is no longer honestly measured!

Koen

ycenycute · Posted 10-15-2022 09:59 AM

Interesting. Is this the setting for SAS? Because I usually just split the data into training and testing, and evaluate model performance on test data.

As far as I am concerned, validation data is in cross validation where we need to select optimal hyper parameters, so we further split our training data set into training and validation. But we still evaluate model performance based on test data. It is exactly because observations in the test data are not entered into the training process, test data can be used to evaluate the model performance.

Because in the end, we determine whether a prediction is a good one based on new data, not on the old or historical data. So I am indeed confused about the SAS norm to set training, validation and test data....

sbxkoenk · Posted 10-15-2022 11:44 AM

Hello @ycenycute ,

This TRAIN , VALIDATION , TEST data splitting thing

, and which set to use to do what is nothing SAS-specific.

It is widely accepted as I state it.

If you use TEST data to choose the final model, you can no longer say that the TEST data was never seen by the final model, because you have used it for model selection (not for model building, OK about that ... , but YES for model selection after comparison).

Hence, the generalization error (that tells how well the model performs on observations it has never seen) is no longer honest!

All data mining and machine learning models use Training AND Validation data for model building. Do not use Training data only (your models will be largely overfit, which means you went beyond pattern recognition and you included noise in the model)!!

Cross-Validation is a different thing. If you do not have enough data for data splitting (data partitioning), you can use k-fold cross-validation (instead of the 2 separate sets TRAIN and VALID).

If you have Training AND Validation data, you have to separate sets to do the model building.

With k-fold cross-validation, you only have one set. The set is split in k parts and every set is used once (1 time) for validation and k-1 times for training.

And to make it complicated ... 😁

Cross-validation and a validation set are not necessarily mutually exclusive.

In a single decision tree for example, the tree is grown based on the training data until you reach the so-called maximal tree. Then the maximal tree is pruned using the validation data. But during growing, to find the best splits, you could use cross-validation within the training set.

Cheers and good luck,

Koen

ycenycute · Posted 10-15-2022 10:13 PM

Thanks for the detailed explanation. If we train a model using training data and select a model using validation data. Then what is the purpose of separating test data?
Also, back to the cumulative lift problem. So, is cumulative lift not a metrics for evaluating different models? And we should instead use ROC?

sbxkoenk · Posted 10-16-2022 10:16 AM

Hey,

Just a correction: the validation set is not only used to select a model. The validation set is also used to shape each of the models, more specifically to make those models more "robust" (eliminate overfitting). The training set obviously weighs more on the outcome of each of the models, but the validation set also plays an important role.

Again, the test set only serves to provide a fair / honest estimate of the generalization power of a model. Since that is indeed a limited scope, that test set is often not created (in case of limited data availability) to be able to put more observations into the training and validation set.

E.g. I often do 55%, 25%, 20% (train, valid, test respectively)
But even more often I split like this : 65%, 35%, 0% (train, valid, test respectively)

As for the selection metrics :
Cumulative lift is not used that often, unless you are really interested in results down to a certain depth (and not lower).
ROC is widely used, but also F1-score and Kolmogorov-Smirnov distance are.
It depends on what you are after. Good sensitivity or good precision or a good balance between the two. Or something else?

Good luck,
Koen

sbxkoenk · Posted 10-14-2022 11:18 AM

Hello @ycenycute ,

Baseline cumulative lift : the lift that you get by random shuffling of your observations. So, at depth = 10% on your horizontal axis, you have 10% completely random observations (same statement is true for all other quantiles).
Same result as giving exactly the same event probability to every observation.
For example : posterior probability for the event = prior probability for the event !

Best cumulative lift : the lift that you get with a perfect prediction. 100% event (posterior) probability in case you have a real event, 0% event (posterior) probability if you have a real non-event.

Remember that observations are sorted by descending event probability on the horizontal axis.
Depth = 20% ( = the 20% of observations with highest event probability).

Koen

How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Re: How to interpret lift / gain plot in score rankings in SAS EM: decision tree

Ready to join fellow brilliant minds for the SAS Hackathon?