DATA Step, Macro, Functions and more

Sampling using RANUNI()

Reply
New Contributor
Posts: 2

Sampling using RANUNI()

 

Starting with

The FREQ Procedure

 

qual_flag Frequency Percent Cumulative
Frequency Cumulative
Percent C71_80 C81_90 C91_95 C96_10 MIGRATE
1318337.721318337.72
1136232.512454570.23
450612.892905183.13
398011.393303194.51
19175.4934948

100.0

 
 
In SAS 9.1
 
data sampledown;
length sampfactor xyz 8;
set finalsample;

xyz = ranuni(35);

select(qual_flag);
when('C71_80') factor = 13183;
when('C81_90') factor = 11362;
when('C91_95') factor = 4506;
when('C96_10') factor = 3980 ;
when('MIGRATE') factor = 1917;
otherwise put 'ERROROROROROR';
end;

sampfactor = 1950 / factor;
if xyz le sampfactor then output;
run;

proc freq data=sampledown;
table qual_flag;
run;
 
would always give output with distribution of qual_flag very close to 1,950 in all levels.  Now in SAS 9.4 I'm getting distribution like
 
The FREQ Procedure

 

qual_flag Frequency Percent Cumulative
Frequency Cumulative
Percent C71_80 C81_90 C91_95 C96_10 MIGRATE
193720.15193720.15
188519.61382239.76
196820.47579060.24
190519.82769580.06
191719.949612100.00
 
Any idea why?

  

 

 

Super User
Posts: 10,497

Re: Sampling using RANUNI()

[ Edited ]

 

 

The two example Proc Freq output tables you show have VERY different frequencies. To the point I'm not sure that the second is what you meant. Since you don't show any raw input and it looks like the code you did use likely varied from what you posted it is a bit hard to determine

 

If you want to specify a sample size you'd save yourself a lot of work by swithching to proc survey select and a strata variable. Then you could specify either a sample size or sample rate for each stratum (your Qual_flag variable it looks like).

Something like:

 

proc surveyselect data=have out=want

        sampsize=(1950 1950 1950 1950 1950)  /* one value for each strata or level of the strata variable

                                                                            if you have additiona strata you are not interested in subset the data with the

                                                                            where clause data set option on the input data (have)*/

        seed = 1234   /* seed is similar to use in ranuni to repeat sequence*/

;

   strata qual_flag;

run;

 

one advantage: you'll get actual probability of selection and a sample weight variable is needed. If there are at least the number of records needed in each strata that is how many you will get in the output. The Have data set needs to be sorted by the strata variable.

New Contributor
Posts: 2

Re: Sampling using RANUNI()

That's great.  Thank you very much ballardw!

Trusted Advisor
Posts: 1,115

Re: Sampling using RANUNI()

Hi @NedKaufman,

 

I agree with @ballardw that using PROC SURVEYSELECT is more convenient for selecting random samples.

 

Still, I would like to encourage you to review the results produced by this procedure (or any other software for that matter) in the same way as you did with your data step approach. (I recently came across a seemingly surprising result of PROC SURVEYSELECT, but haven't investigated the cause yet.)

 

Sometimes it's a procedure option that you forgot to specify correctly, in rare cases it may even be a bug that distorts the results. (In 2001 I reported a bug in the CDF function of SAS 6.12 TS050 which led to grossly incorrect results for certain arguments.)

 

As to your question, I'm not aware of a change between SAS versions 9.1 and 9.4 which could explain what you describe. Moreover, the results in your final PROC FREQ output seem plausible to me.

 

You are surprised that the frequencies (for the first four categories; the fifth frequency necessarily equals 1917) are not closer to the expected value 1950? This would mean that you suspect that one or more of the selection probabilities are different from 1950/13183, 1950/11362, 1950/4506 and 1950/3980, respectively, right?

 

So, let's perform exact two-sided binomial tests for the four selections (which I think can be regarded as independent), including an adjustment for multiple testing:

data have;
input c $ t sel;
cards;
C71_80 13183 1937
C81_90 11362 1885
C91_95  4506 1968
C96_10  3980 1905
;

data test;
set have;
s=1; n=sel;
output;
s=0; n=t-sel;
output;
p0=1950/t;
call execute(cats('proc freq data=test; where c="', c, '"; weight n;',
                  'exact binomial; tables s / binomial(level="1" p=', p0,
                  '); output out=stats',_n_, '(keep=xp2_bin) binomial; run;'));
run;

data stats;
set stats1-stats4 nobs=k;
p_adj=1-(1-xp2_bin)**k; /* Šidák adjusted p-value */
proc print;
run;

Result:

Obs    XP2_BIN     p_adj

 1     0.76109    0.99674
 2     0.10757    0.36570
 3     0.59843    0.97400
 4     0.15820    0.49784

The adjusted p-values (and even the unadjusted ones) are all >0.05, so there is no evidence that the null hypotheses are not true.

Ask a Question
Discussion stats
  • 3 replies
  • 333 views
  • 0 likes
  • 3 in conversation