Re: Confusion in decisions node (while scoring after oversampling)

harsh0404 · Posted 10-10-2018 03:23 PM

I have a reponse rate of 2%. I over sampled (50/50) and built a model. ( I took the orignal dataset and oversampled using sample node in sas em)

Now I have to score on new observation, but before that I need to put a decisions node and make changes in decision weights. (to adjust the probabilities since I over-sampled, otherwise they are higher).

I am getting stuck at what all changes to make in decisions node. I dont have any profit loss matrix.

I am following this -

https://communities.sas.com/t5/SAS-Communities-Library/Tip-How-to-model-a-rare-target-using-an-overs...

in my decisions node, what should be on right lower corner, right upper corner, left lower corner, left upper corner on my decision weights tab? ( I didnt get how they got 1.0526 in the forum)

In the thread they have 5% response rate and oversampled to 50-50.

DougWielenga · Posted 01-16-2019 12:15 PM

I have a reponse rate of 2%. I over sampled (50/50) and built a model. ( I took the orignal dataset and oversampled using sample node in sas em) Now I have to score on new observation, but before that I need to put a decisions node and make changes in decision weights. (to adjust the probabilities since I over-sampled, otherwise they are higher). I am getting stuck at what all changes to make in decisions node. I dont have any profit loss matrix.

I am following this -

https://communities.sas.com/t5/SAS-Communities-Library/Tip-How-to-model-a-rare-target-using-an-overs...

in my decisions node, what should be on right lower corner, right upper corner, left lower corner, left upper corner on my decision weights tab? ( I didnt get how they got 1.0526 in the forum)

In the thread they have 5% response rate and oversampled to 50-50.

There are several things to consider in this situation.

Oversampling to 50/50: This popular approach seems to originate from the fact that the greatest power for detecting a binary outcome with a fixed sample size occurs when there is a balanced sample. When you are talking about sampling, however, you are no longer talking about a fixed sample size. In data mining, it is common to have one event be far more rare than the other. In this situation, oversampling to 50/50 (especially when you have only a 2% sample) risks having a non-representative sample of the non-events. Classic metrics of model performance are often going to look very different in highly unbalanced situations as I discussed in

https://communities.sas.com/t5/SAS-Data-Mining-and-Machine/Help-with-over-under-sampling-of-the-rare...

Using Inverse Priors: In many situations, you can actually solve both the issues inherent in oversampling so much and the rareness of the event of interest by setting up decision weights as is discussed in SAS Note 47965 available at

http://support.sas.com/kb/47/965.html

In situations where you don't have specific costs/profits (and even in situations where you do!), this is a reasonable approach to identifying useful models that might otherwise not be available without oversampling heavily.

Readjusting the probabilities for oversampling: If you set up prior probabilities using Decision Processing (click on the ... to the right of Decisions for your Input Data node) and click on the Default with Inverse Prior Weights button on the Decisions tab inside the Decisions Processing dialog, your model will reflect the original population even if you oversample.

An important question to ask: Do you really need to get the probabilities in terms of the original population? Adjustments made using the Decisions node attempts to adjust the probabilities based on what a representative sample from the population might have created. In reality, it changes the predicted probabilities but does not change the sort order of the observations. Whether the population is oversampled and then adjusted or whether the raw data is used, the probabilities are still only approximations. In practice, the performance on holdout data is still likely optimistic for many reasons including

* the holdout data is often used to choose the final model

* the data is typically removed in time and other factors that influence the outcome might have changed

* the target represented a surrogate for the actual target of interest (e.g. when you model response to a past campaign to try and predict the response to a future campaign is a surrogate target scenario)

In practice, it might be more useful to identify the distribution of the predicted outcomes in spite of an oversampling. Should you wish to do so, however, you can specify the priors in the Decisions Processing dialog of the Decisions node in the same way you can in the Input Data node. Just be sure to do it in only one place. The dialog is the same in both places, but you access it by clicking on the ... to the right of Customer Editor in the Decisions node. In general, I prefer doing this in the Input Data node. Please note that changing the priors can also impact the associated decisions. In general, it makes sense to use decision weights based on the priors.

The weights you saw in the example came from a data set where the rare event had 5% and the common event had 95%. The inverse prior weights then became 20.00 (1 / 5%) and 1.0526 ( 1 / 95%).

Hope this helps!

Doug

Confusion in decisions node (while scoring after oversampling)

Re: Confusion in decisions node (while scoring after oversampling)

The 2025 SAS Hackathon has begun!