BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Babloonew
Calcite | Level 5

This is in reference to an option OFFSET in  PROC LOGISTIC

The sas support link as mentioned here (http://support.sas.com/kb/22/601.html)  justifies it as it to used in cased of oversampling of an event (event=1)  but in the example code it does under sampling of the majority event rather than oversampling of the minority event

Technically oversampling resorts to creation of similar/duplicate observations which is not being done here

 

I think the description on the support link is wrong .

 

OFFSET would just be to adjust  the intercept using prior probabilities . It would not matter whether we reach a balanced data set by oversampling  the minority or under sampling the majority

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

The method discussed at that link, and which unfortunately is also called oversampling, is different from what is discussed in note 22601. The oversampling method discussed in the note, and the adjustments for them including use of the OFFSET= option, are often employed in rare event scenarios. This method is also mentioned in "Logistic Regression Using SAS: Theory and Application, Second Edition," (Allison, P., SAS Institute, 2012). The method only involves sampling the nonevents at a much lower rate than the events and then adjusting for the effect this has on the intercept in the logistic model. If you are looking to implement the method discussed in the link you provided, note 22601 does not address it.

View solution in original post

9 REPLIES 9
StatDave
SAS Super FREQ

The typical situation is that the event of interest is rare, so a very large data set might be collected in order to get some small number of events which then represents a tiny proportion of the data. That is the case in Note 22601 - in the FULL data set, only 10% of the observations are events, the remaining 90% are nonevents. Of course, it is not unusual for the event proportion to be even smaller, but this is done for illustration. If the FULL data set was sampled, say a 20% sample, then 20% of both events and nonevents would result. However, the common practice called "oversampling the event" keeps all of the events (essentially, sampling them at 100%) and sampling only the nonevents. That is how data set SUB is created - all events are retained and a 1/9th sample of the nonevents is taken. The Note goes on to show various ways to adjust the results for this sampling scheme. The use of the OFFSET= option is just one way and the result shows that the true intercept and slope are well estimated.

Babloonew
Calcite | Level 5
Thanks for your response.
My question was when we use the term "oversampling", doesn't it mean creating new observations ( synthetic or duplicate)

In the example, there wasn't any oversampling of the event. A balanced set was reached by undersampling the majority

Hence I thought the use of the phrase "oversampling of the event" was incorrect

Offset is just to adjust the intercept by inputting the prior probability.
Rick_SAS
SAS Super FREQ

I think this is just semantics. The original data has 1000 obs. The reduced sample has 202 obs. The proportion of events in the reduced sample is greater than the proportion of events in the full data.  The SAS Note calls this oversampling the event. You call it undersampling the non-event. Both are correct.

Babloonew
Calcite | Level 5
Rick

How can oversampling the event and undersampling the non event be same. That would result in 2 different training sets.

Rick_SAS
SAS Super FREQ

The reduced sample has 202 observations. If it were a random sample from the full data, you would expect

20.2 events and 181.8 nonevents.

 

In reality, the reduced sample has 102 events, so there are more events than expected. 

In addition, the reduced sample has 100 nonevents, so there are fewer nonevents than expected.

 

I call the first situation an oversampling and the second and undersampling. If you have different definitions, then perhaps that is the source of the confusion.

 

Babloonew
Calcite | Level 5
Thanks.

I will try one more time.

Look for the term oversampling and undersampling and then relate to what you have answered.

Good luck with your research

StatDave
SAS Super FREQ

The method discussed at that link, and which unfortunately is also called oversampling, is different from what is discussed in note 22601. The oversampling method discussed in the note, and the adjustments for them including use of the OFFSET= option, are often employed in rare event scenarios. This method is also mentioned in "Logistic Regression Using SAS: Theory and Application, Second Edition," (Allison, P., SAS Institute, 2012). The method only involves sampling the nonevents at a much lower rate than the events and then adjusting for the effect this has on the intercept in the logistic model. If you are looking to implement the method discussed in the link you provided, note 22601 does not address it.

Babloonew
Calcite | Level 5
Rick,
This might help you to understand firstly what oversampling and undersampling mean in Data science

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 4796 views
  • 6 likes
  • 3 in conversation