Statistical Procedures

Babloonew · Posted 07-30-2019 01:50 AM

This is in reference to an option OFFSET in PROC LOGISTIC

The sas support link as mentioned here (http://support.sas.com/kb/22/601.html) justifies it as it to used in cased of oversampling of an event (event=1) but in the example code it does under sampling of the majority event rather than oversampling of the minority event

Technically oversampling resorts to creation of similar/duplicate observations which is not being done here

I think the description on the support link is wrong .

OFFSET would just be to adjust the intercept using prior probabilities . It would not matter whether we reach a balanced data set by oversampling the minority or under sampling the majority

StatDave · Posted 08-01-2019 02:42 PM

The method discussed at that link, and which unfortunately is also called oversampling, is different from what is discussed in note 22601. The oversampling method discussed in the note, and the adjustments for them including use of the OFFSET= option, are often employed in rare event scenarios. This method is also mentioned in "Logistic Regression Using SAS: Theory and Application, Second Edition," (Allison, P., SAS Institute, 2012). The method only involves sampling the nonevents at a much lower rate than the events and then adjusting for the effect this has on the intercept in the logistic model. If you are looking to implement the method discussed in the link you provided, note 22601 does not address it.

View solution in original post

StatDave · Posted 07-30-2019 10:38 AM

The typical situation is that the event of interest is rare, so a very large data set might be collected in order to get some small number of events which then represents a tiny proportion of the data. That is the case in Note 22601 - in the FULL data set, only 10% of the observations are events, the remaining 90% are nonevents. Of course, it is not unusual for the event proportion to be even smaller, but this is done for illustration. If the FULL data set was sampled, say a 20% sample, then 20% of both events and nonevents would result. However, the common practice called "oversampling the event" keeps all of the events (essentially, sampling them at 100%) and sampling only the nonevents. That is how data set SUB is created - all events are retained and a 1/9th sample of the nonevents is taken. The Note goes on to show various ways to adjust the results for this sampling scheme. The use of the OFFSET= option is just one way and the result shows that the true intercept and slope are well estimated.

Babloonew · Posted 07-30-2019 05:33 PM

Thanks for your response.
My question was when we use the term "oversampling", doesn't it mean creating new observations ( synthetic or duplicate)

In the example, there wasn't any oversampling of the event. A balanced set was reached by undersampling the majority

Hence I thought the use of the phrase "oversampling of the event" was incorrect

Offset is just to adjust the intercept by inputting the prior probability.

Rick_SAS · Posted 07-31-2019 08:26 AM

I think this is just semantics. The original data has 1000 obs. The reduced sample has 202 obs. The proportion of events in the reduced sample is greater than the proportion of events in the full data. The SAS Note calls this oversampling the event. You call it undersampling the non-event. Both are correct.

Babloonew · Posted 07-31-2019 08:46 AM

Rick

How can oversampling the event and undersampling the non event be same. That would result in 2 different training sets.

Rick_SAS · Posted 07-31-2019 09:13 AM

The reduced sample has 202 observations. If it were a random sample from the full data, you would expect

20.2 events and 181.8 nonevents.

In reality, the reduced sample has 102 events, so there are more events than expected.

In addition, the reduced sample has 100 nonevents, so there are fewer nonevents than expected.

I call the first situation an oversampling and the second and undersampling. If you have different definitions, then perhaps that is the source of the confusion.

Babloonew · Posted 07-31-2019 05:04 PM

Thanks.

I will try one more time.

Look for the term oversampling and undersampling and then relate to what you have answered.

Good luck with your research

Babloonew · Posted 07-31-2019 05:16 PM

https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947e...

StatDave · Posted 08-01-2019 02:42 PM

The method discussed at that link, and which unfortunately is also called oversampling, is different from what is discussed in note 22601. The oversampling method discussed in the note, and the adjustments for them including use of the OFFSET= option, are often employed in rare event scenarios. This method is also mentioned in "Logistic Regression Using SAS: Theory and Application, Second Edition," (Allison, P., SAS Institute, 2012). The method only involves sampling the nonevents at a much lower rate than the events and then adjusting for the effect this has on the intercept in the logistic model. If you are looking to implement the method discussed in the link you provided, note 22601 does not address it.

Babloonew · Posted 07-31-2019 05:16 PM

Rick,
This might help you to understand firstly what oversampling and undersampling mean in Data science

Statistical Procedures

OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Re: OFFSET in PROC LOGISTIC

Logistic Regression

Proc Logistic Question

Binary Logistic Regression

Proc Logistic: EffectPlot

Proc LOGISTIC - meaning of UNITS statement

Follow Us

What is...

Statistical Procedures

Our biggest data and AI event of the year.

Follow Us

What is...