This is in reference to an option OFFSET in PROC LOGISTIC
The sas support link as mentioned here (http://support.sas.com/kb/22/601.html) justifies it as it to used in cased of oversampling of an event (event=1) but in the example code it does under sampling of the majority event rather than oversampling of the minority event
Technically oversampling resorts to creation of similar/duplicate observations which is not being done here
I think the description on the support link is wrong .
OFFSET would just be to adjust the intercept using prior probabilities . It would not matter whether we reach a balanced data set by oversampling the minority or under sampling the majority
The method discussed at that link, and which unfortunately is also called oversampling, is different from what is discussed in note 22601. The oversampling method discussed in the note, and the adjustments for them including use of the OFFSET= option, are often employed in rare event scenarios. This method is also mentioned in "Logistic Regression Using SAS: Theory and Application, Second Edition," (Allison, P., SAS Institute, 2012). The method only involves sampling the nonevents at a much lower rate than the events and then adjusting for the effect this has on the intercept in the logistic model. If you are looking to implement the method discussed in the link you provided, note 22601 does not address it.
The typical situation is that the event of interest is rare, so a very large data set might be collected in order to get some small number of events which then represents a tiny proportion of the data. That is the case in Note 22601 - in the FULL data set, only 10% of the observations are events, the remaining 90% are nonevents. Of course, it is not unusual for the event proportion to be even smaller, but this is done for illustration. If the FULL data set was sampled, say a 20% sample, then 20% of both events and nonevents would result. However, the common practice called "oversampling the event" keeps all of the events (essentially, sampling them at 100%) and sampling only the nonevents. That is how data set SUB is created - all events are retained and a 1/9th sample of the nonevents is taken. The Note goes on to show various ways to adjust the results for this sampling scheme. The use of the OFFSET= option is just one way and the result shows that the true intercept and slope are well estimated.
I think this is just semantics. The original data has 1000 obs. The reduced sample has 202 obs. The proportion of events in the reduced sample is greater than the proportion of events in the full data. The SAS Note calls this oversampling the event. You call it undersampling the non-event. Both are correct.
The reduced sample has 202 observations. If it were a random sample from the full data, you would expect
20.2 events and 181.8 nonevents.
In reality, the reduced sample has 102 events, so there are more events than expected.
In addition, the reduced sample has 100 nonevents, so there are fewer nonevents than expected.
I call the first situation an oversampling and the second and undersampling. If you have different definitions, then perhaps that is the source of the confusion.
The method discussed at that link, and which unfortunately is also called oversampling, is different from what is discussed in note 22601. The oversampling method discussed in the note, and the adjustments for them including use of the OFFSET= option, are often employed in rare event scenarios. This method is also mentioned in "Logistic Regression Using SAS: Theory and Application, Second Edition," (Allison, P., SAS Institute, 2012). The method only involves sampling the nonevents at a much lower rate than the events and then adjusting for the effect this has on the intercept in the logistic model. If you are looking to implement the method discussed in the link you provided, note 22601 does not address it.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.