Re: oversampleing in Binary Logistic

sxking2 · Posted 03-16-2024 02:12 AM

Hi,

How can I adjust for oversampling in my Binary Logistic analysis?

This is where I left off:

ods noproctitle;
ods graphics / imagemap=on;

proc logistic data=WORK.BANK;
	class combinedages 'married or not'n 'binned job'n 'binned education'n housing 
		loan contact poutcome 'contacted since by # months'n / param=glm;
	model purchased(event='1')=combinedages 'married or not'n 'binned job'n 
		'binned education'n housing loan contact poutcome 
		'contacted since by # months'n balance duration 'calls this campaing'n / 
		ctable link=logit technique=fisher;
	output out=work.bankstast2 predicted=pred_;
	score out=work.bankscores2;
run;

ods noproctitle;
ods graphics / imagemap=on;

proc logistic data=WORK.BANK;
	class combinedages 'married or not'n 'binned job'n 'binned education'n housing 
		loan contact poutcome 'contacted since by # months'n / param=glm;
	model purchased(event='1')=combinedages 'married or not'n 'binned job'n 
		'binned education'n housing loan contact poutcome 
		'contacted since by # months'n balance duration 'calls this campaing'n / 
		ctable link=logit technique=fisher;
	output out=work.bankstast2 predicted=pred_;
	score out=work.bankscores2;
run;

ods noproctitle;
ods graphics / imagemap=on;

proc logistic data=WORK.BANK;
	class combinedages 'married or not'n 'binned job'n 'binned education'n housing 
		loan contact poutcome 'contacted since by # months'n / param=glm;
	model purchased(event='1')=combinedages 'married or not'n 'binned job'n 
		'binned education'n housing loan contact poutcome 
		'contacted since by # months'n balance duration 'calls this campaing'n / 
		ctable link=logit technique=fisher;
	output out=work.bankstast2 predicted=pred_;
	score out=work.bankscores2;
run;

ods noproctitle;
ods graphics / imagemap=on;

proc logistic data=WORK.BANK;
	class combinedages 'married or not'n 'binned job'n 'binned education'n housing 
		loan contact poutcome 'contacted since by # months'n / param=glm;
	model purchased(event='1')=combinedages 'married or not'n 'binned job'n 
		'binned education'n housing loan contact poutcome 
		'contacted since by # months'n balance duration 'calls this campaing'n / 
		ctable link=logit technique=fisher;
	output out=work.bankstast2 predicted=pred_;
	score out=work.bankscores2;
run;

ballardw · Posted 03-16-2024 03:04 AM

You should describe how the sample and over-sample were done.

You would assign a weight for each observation which is typically the inverse of the probability of selecting that subject for the sample.

Also, if the sample methodology is complex you would use Proc Surveylogistic (and other Survey procs) for analysis where you provide additional information about the sample such as stratification and cluster variables, type of sample - simple random, proportional size, sequential or others.

sxking2 · Posted 03-16-2024 11:13 AM

Hi,
I do not know how they were done, and my project is a binary analysis. I
watched a youtube video on what you suggested and did not find it helpful.
I have 45000 observations and about 11% were events.

I saw one video in which it looked like the minority class( I am thinking
the event) data was just duplicated enough times to that it was the same %
of total events as the number of observations. My dataset is small enough
that I could do that manually.

Is that a correct assumption?

ballardw · Posted 03-16-2024 12:32 PM

Oversample typically means that some rule for selecting the sample was different for some part of the population. Without that differences in rules and the affect it would have an participation then it is extremely difficult to guess what may be needed to handle your particular "oversample".

sxking2 · Posted 03-16-2024 12:37 PM

oh,,, I thought it was when you had too few events per number of
observations.

ballardw · Posted 03-16-2024 11:49 PM

@sxking2 wrote:
oh,,, I thought it was when you had too few events per number of
observations.

Oversample is one of the techniques to improve counts of events to work with.

Suppose that in the general population of the country that one person in 1,000,000 has a characteristic.

But you have information that left-handed red-headed people under 5 ft tall have the occurrence 1 in 1000 (or some other more accessible rate) then you include more left-handed red-headed short people in the sample than would occur with a simple random selection of people in the population in hopes of getting more "events". The difference in probabilities allows you to weight data so the result is more generally useful. And tends to be a complex sample in some cases.

Ksharp · Posted 03-17-2024 10:03 PM

Check SCORE statement's option PRIOR= and PRIOREVENT= , put real probability in it .