02-07-2013 02:19 AM
I am trying to perform oversampling on my dataset (~200,000 observations), which consist of a flag variable of value 1 or 0. I want 100 samples with each sample to contain all the observations with flag=1 (~100 of them in total) and then randomly select ~750 of observations that have flag = 0. However, I seem to have some difficulty in getting what I want. I ended up with ~10,000 observations in total for each sample. And sometimes, my code takes forever to run. Can someone advice me on what is wrong?
My code is as follows:
do sample=1 to 100;
do i = 1 to _N_;
if flag= 1 or (flag=0 and ranuni(sample+7320) < 0.003) then output;
Thanks for any advice.
02-07-2013 06:27 AM
Why dont you try PROC SURVEYSELECT, it picks random samples.
proc surveyselect data=data_set
method=srs n=100 out=data_random;
02-07-2013 06:41 AM
My output is now really random, I want it to include all elements with flag=1 and then randomly select ~800 elements with flag=0. Is it possible to create that using the proc surveyselect?
Thanks for the advice
02-07-2013 08:11 AM
Use the STRATA option and set the sample rate for the "rare event category" to be 100%. See http://www.nesug.org/proceedings/nesug07/sa/sa02.pdf, beginning at the bottom of page 2.