BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
nismail1976
Fluorite | Level 6

 hello everyone,

I am using SAS Enterprise Miner to create a model for a categorical response variable (0,1)..

since my event rate is about 2% and non-event rate is 98%, I have oversampled so that I have the following Proportions 30% event, 70% nonevent rate.

these are the Results from Oversampling

Data=TRAIN
Variable  Value  Count  Percent
Resp         0       3035      70%

Resp         1       1301       30%

 

At this point I correct for the bias in the sample by adding a decision node to adjust the priors right before placing a decision tree node. here is the flow process.

Process Flow.PNG

 

my question is as follows: since after the oversampling, the number event instances is 1301 why do I get only 86.72 event instances in the root node:

root.PNG

 

just to be clear : when I have oversampled I got 1301 for event and 3035 non-event. when I add decision node I get 86.75 event and 4249 non-event. why is that?

Thank you in advance

your help is greatly appreciated

1 ACCEPTED SOLUTION

Accepted Solutions
PadraicGNeville
SAS Employee

Hi.

No, SAS EM does not think you only have 86.72 events.  The display is adjusting the counts to reflect the adjusted priors. There might be a display setting that turns the adjustment off (I don't know).  In any case, the computational code knows about all the observations. 

 

The adjustments can change the computations in three places: 1. Depending on user-properties, the split search will act as if there are 86.72 events or as if there are 1301 events. 2. Depending on user-properties, the tree can be retrospectively pruned based on the adjusted numbers or the unadjusted numbers.  3. The posterior probabilities will be adjusted.  I am guessing that the default behaviour for 1 and 2 is to not incorporate the adjustments.  Why: because typically the adjustment makes the event look more rare, and rare events typically fool trees into being too small.

 

As you point out, the data step code coming out of logistic includes code at the end to adjust the posterior probabilities.   The decision tree code does not output corresponding code because it outputs posterior probabilities that are already adjusted:  decision tree computes the adjustments before outputing the data step code.

 

Let us know if you still have questions.

-Padraic

 

View solution in original post

13 REPLIES 13
JasonXin
SAS Employee
Hi,
What is the purpose of adding the Decision Node after you already oversampled it? could you share details inside the Decision Node? Apparently it flips back to pre-oversample ratio. Hope this helps. Jason Xin
nismail1976
Fluorite | Level 6

hi,

thank you for responding,

shouldnt I add a decision node after  I oversample? or should I add it before hand? this is how I corrected for the oversampling in the decision node:

decision.PNG

 

if you could tell me what am I doing wrong ( is my flow process wrong)?, I would really apreaciate it. I am really stuck here.

Thanks for you help

 

JasonXin
SAS Employee
No problem. Let me try. There are two things here, One is physically re-sampling your input file towards model building. The other is logically re-sampling meaning you don't physically change the input data set but tell the modeling node to try the sample as if it has been physically re-sampled. Now your case obviously is to change from the initial 'response rate' =2% to 30%, for whatever reasons. You did it in physically way (you probably ran it through BASE code....). Which is fine. As indicated by your screen shot, you already accomplished since the input to the Decision Node shows 30%=1. You don't really need to add the Decision Node. Because if you do, as you did, you place 0.02 here it flips back to 2%. More often we don't go back to BASE.. to re-code the data physically. A 'better' practice is we carry the raw data set, apply Decision Node where you can reset the ratio. I would encourage you to click through all the 4 tabs on the top, TARGET, prior probabilities, Decisions, and Decision Weights, to have a fuller understanding of what each one means. As you will see, Decision Node is very flexibility. Hope this helps. Jason Xin
nismail1976
Fluorite | Level 6

Hello,

thank you responding!

If I understand you correctly, I will need to place decision code before oversampling. right?

thank you 

Thanks

nabil

nismail1976
Fluorite | Level 6

would this flow process be more appropriate and will it adjust my posterior probabilities?

nismail1976
Fluorite | Level 6

would this flow process be more appropriate and will it adjust my posterior probabilities?flow.PNG

Thanks again, you have been awesome!

JasonXin
SAS Employee
If you put Decision Node right after the Data set node, to effect the ratio change (logically reweighting, that is), the job you initially wanted is essentially done. Placing Sample Node: the only legit purpose I can imagine, to not to alter the ratio reweighting you just did, is to proportionately sample the data set down? Not sure why you put sample node here in the flow. Typically Decision Node is used if you want to change the ratio logically. Sample Node is used if you desire to physically have a different data set (to reflect the new ratio). So I would say the two nodes are either or, but not both. (unless the data set is big in size, and you like to have a subset to represent it)

Data partition Node is different: you are creating training, validation, testing. If this is your goal, it then stays in the flow. Hope this helps. Jason Xin
nismail1976
Fluorite | Level 6

Hi Jason,

thank you for responding

I dont think I was clear from the begining. let me walk you through the steps I have taken.

I have an origninal dataset that I oversampled  ,patitioned, placed a decisions node to adjust my posterior probabilities and lastly I used the decision tree to model it, (I have taken all these steps in SAS Enterprise Miner only, I havent used base sas) here is the view:

 

workflow.PNG

 

Now in the original dataset the event rate is 2% and the non-event rate is 98%, when I oversample  the event rate becomes 30% and the non-event rate is 70% .

In data partion node my training dataset  contains : 3035 non-event rate and 1301  event rate for a total of 4336 observations 

In the decion node: I adjust the priors to 2% event and 98% non-event as shown below:

 

 

 

 

decision.PNG

 

 

Now, onto the decision tree:

 

 

if I dont use the decision node to adjust the priors , I get  these proportions (30% event, 70% non-event)

and counts (1301 events ,3035 non-event)at the root node:

 

oversample.PNG

 

which is correct given I didnt adjust for priors.

 

Now when I use the decision node to adjust the priors, I get these proportions (2% event,98%non-event)

and counts (86.72 event,4249 non-event) at the root node:

 

 

root.PNG

 

 

 

 

what I am trying to understand is that does sas enterprise miner think that I have only 86.72 events instead of 1301  or what is going on here? ( I am really confused about this) (I know the total number of observation is correct =4336)

 

 

Also when I build a logistic regression on the same oversampled dataset , I open the results and under view ->SAS Code , I get the updated probabilities as such:

 

*** Update Posterior Probabilities;
_P0 = _P0 * 0.02 / 0.2997;
_P1 = _P1 * 0.98 / 0.7003;
drop _sum; _sum = _P0 + _P1 ;
if _sum > 4.135903E-25 then do;
_P0 = _P0 / _sum;
_P1 = _P1 / _sum;
end;

that's how I know that sas adjusted my posterior probabilities ,on the other hand when using decion trees, I dont get this code.

 

 

I hope I explained myself better this time.

 

Thanks you Jason so much for your help

 

 

 

 

PadraicGNeville
SAS Employee

Hi.

No, SAS EM does not think you only have 86.72 events.  The display is adjusting the counts to reflect the adjusted priors. There might be a display setting that turns the adjustment off (I don't know).  In any case, the computational code knows about all the observations. 

 

The adjustments can change the computations in three places: 1. Depending on user-properties, the split search will act as if there are 86.72 events or as if there are 1301 events. 2. Depending on user-properties, the tree can be retrospectively pruned based on the adjusted numbers or the unadjusted numbers.  3. The posterior probabilities will be adjusted.  I am guessing that the default behaviour for 1 and 2 is to not incorporate the adjustments.  Why: because typically the adjustment makes the event look more rare, and rare events typically fool trees into being too small.

 

As you point out, the data step code coming out of logistic includes code at the end to adjust the posterior probabilities.   The decision tree code does not output corresponding code because it outputs posterior probabilities that are already adjusted:  decision tree computes the adjustments before outputing the data step code.

 

Let us know if you still have questions.

-Padraic

 

nismail1976
Fluorite | Level 6
Thank you so much, you are a life savior
nismail1976
Fluorite | Level 6

do Decision trees compute the adjusted posteriors the same as logistic regression?

Thanks again!

PadraicGNeville
SAS Employee

Yes.

 

P(class j) = scale * unadjusted_P( j) * prior(j) / proportion_in_data(j), 

where the scale is chosen to get sum over j of P(j) = 1.

nismail1976
Fluorite | Level 6

Thank you very much for your help, I really appreciate that!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 4052 views
  • 0 likes
  • 3 in conversation