BookmarkSubscribeRSS Feed
sxking2
Fluorite | Level 6

Hi,

How can I adjust for oversampling in my Binary Logistic analysis?

This is where I left off:

 

ods noproctitle;
ods graphics / imagemap=on;

proc logistic data=WORK.BANK;
class combinedages 'married or not'n 'binned job'n 'binned education'n housing
loan contact poutcome 'contacted since by # months'n / param=glm;
model purchased(event='1')=combinedages 'married or not'n 'binned job'n
'binned education'n housing loan contact poutcome
'contacted since by # months'n balance duration 'calls this campaing'n /
ctable link=logit technique=fisher;
output out=work.bankstast2 predicted=pred_;
score out=work.bankscores2;
run;
ods noproctitle;
ods graphics / imagemap=on;

proc logistic data=WORK.BANK;
class combinedages 'married or not'n 'binned job'n 'binned education'n housing
loan contact poutcome 'contacted since by # months'n / param=glm;
model purchased(event='1')=combinedages 'married or not'n 'binned job'n
'binned education'n housing loan contact poutcome
'contacted since by # months'n balance duration 'calls this campaing'n /
ctable link=logit technique=fisher;
output out=work.bankstast2 predicted=pred_;
score out=work.bankscores2;
run;

 

ods noproctitle;
ods graphics / imagemap=on;

proc logistic data=WORK.BANK;
class combinedages 'married or not'n 'binned job'n 'binned education'n housing
loan contact poutcome 'contacted since by # months'n / param=glm;
model purchased(event='1')=combinedages 'married or not'n 'binned job'n
'binned education'n housing loan contact poutcome
'contacted since by # months'n balance duration 'calls this campaing'n /
ctable link=logit technique=fisher;
output out=work.bankstast2 predicted=pred_;
score out=work.bankscores2;
run;
ods noproctitle;
ods graphics / imagemap=on;

proc logistic data=WORK.BANK;
	class combinedages 'married or not'n 'binned job'n 'binned education'n housing 
		loan contact poutcome 'contacted since by # months'n / param=glm;
	model purchased(event='1')=combinedages 'married or not'n 'binned job'n 
		'binned education'n housing loan contact poutcome 
		'contacted since by # months'n balance duration 'calls this campaing'n / 
		ctable link=logit technique=fisher;
	output out=work.bankstast2 predicted=pred_;
	score out=work.bankscores2;
run;

 

6 REPLIES 6
ballardw
Super User

You should describe how the sample and over-sample were done.

 

You would assign a weight for each observation which is typically the inverse of the probability of selecting that subject for the sample.

 

Also, if the sample methodology is complex you would use Proc Surveylogistic (and other Survey procs) for analysis where you provide additional information about the sample such as stratification and cluster variables, type of sample - simple random, proportional size, sequential or others.

sxking2
Fluorite | Level 6
Hi,
I do not know how they were done, and my project is a binary analysis. I
watched a youtube video on what you suggested and did not find it helpful.
I have 45000 observations and about 11% were events.

I saw one video in which it looked like the minority class( I am thinking
the event) data was just duplicated enough times to that it was the same %
of total events as the number of observations. My dataset is small enough
that I could do that manually.

Is that a correct assumption?

ballardw
Super User

Oversample typically means that some rule for selecting the sample was different for some part of the population. Without that differences in rules and the affect it would have an participation then it is extremely difficult to guess what may be needed to handle your particular "oversample".

sxking2
Fluorite | Level 6
oh,,, I thought it was when you had too few events per number of
observations.
ballardw
Super User

@sxking2 wrote:
oh,,, I thought it was when you had too few events per number of
observations.

Oversample is one of the techniques to improve counts of events to work with.

 

Suppose that in the general population of the country that one person in 1,000,000 has a characteristic.

But you have information that left-handed red-headed people under 5 ft tall have the occurrence 1 in 1000 (or some other more accessible rate) then you include more left-handed red-headed short people in the sample than would occur with a simple random selection of people in the population in hopes of getting more "events". The difference in probabilities allows you to weight data so the result is more generally useful. And tends to be a complex sample in some cases.

Ksharp
Super User

Check SCORE statement's option PRIOR= and PRIOREVENT= , put real probability in it .

 

Ksharp_0-1710727410442.png

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 453 views
  • 0 likes
  • 3 in conversation