New Contributor
Posts: 4

# Sampling using RANUNI()

Starting with

The FREQ Procedure

qual_flag Frequency Percent Cumulative
Frequency Cumulative
Percent C71_80 C81_90 C91_95 C96_10 MIGRATE
 13183 37.72 13183 37.72 11362 32.51 24545 70.23 4506 12.89 29051 83.13 3980 11.39 33031 94.51 1917 5.49 34948 100

In SAS 9.1

data sampledown;
length sampfactor xyz 8;
set finalsample;

xyz = ranuni(35);

select(qual_flag);
when('C71_80') factor = 13183;
when('C81_90') factor = 11362;
when('C91_95') factor = 4506;
when('C96_10') factor = 3980 ;
when('MIGRATE') factor = 1917;
otherwise put 'ERROROROROROR';
end;

sampfactor = 1950 / factor;
if xyz le sampfactor then output;
run;

proc freq data=sampledown;
table qual_flag;
run;

would always give output with distribution of qual_flag very close to 1,950 in all levels.  Now in SAS 9.4 I'm getting distribution like

The FREQ Procedure

qual_flag Frequency Percent Cumulative
Frequency Cumulative
Percent C71_80 C81_90 C91_95 C96_10 MIGRATE
 1937 20.15 1937 20.15 1885 19.61 3822 39.76 1968 20.47 5790 60.24 1905 19.82 7695 80.06 1917 19.94 9612 100

Any idea why?

Super User
Posts: 13,512

## Re: Sampling using RANUNI()

[ Edited ]

The two example Proc Freq output tables you show have VERY different frequencies. To the point I'm not sure that the second is what you meant. Since you don't show any raw input and it looks like the code you did use likely varied from what you posted it is a bit hard to determine

If you want to specify a sample size you'd save yourself a lot of work by swithching to proc survey select and a strata variable. Then you could specify either a sample size or sample rate for each stratum (your Qual_flag variable it looks like).

Something like:

proc surveyselect data=have out=want

sampsize=(1950 1950 1950 1950 1950)  /* one value for each strata or level of the strata variable

if you have additiona strata you are not interested in subset the data with the

where clause data set option on the input data (have)*/

seed = 1234   /* seed is similar to use in ranuni to repeat sequence*/

;

strata qual_flag;

run;

one advantage: you'll get actual probability of selection and a sample weight variable is needed. If there are at least the number of records needed in each strata that is how many you will get in the output. The Have data set needs to be sorted by the strata variable.

New Contributor
Posts: 4

## Re: Sampling using RANUNI()

That's great.  Thank you very much ballardw!

Posts: 1,245

## Re: Sampling using RANUNI()

Hi @NedKaufman,

I agree with @ballardw that using PROC SURVEYSELECT is more convenient for selecting random samples.

Still, I would like to encourage you to review the results produced by this procedure (or any other software for that matter) in the same way as you did with your data step approach. (I recently came across a seemingly surprising result of PROC SURVEYSELECT, but haven't investigated the cause yet.)

Sometimes it's a procedure option that you forgot to specify correctly, in rare cases it may even be a bug that distorts the results. (In 2001 I reported a bug in the CDF function of SAS 6.12 TS050 which led to grossly incorrect results for certain arguments.)

As to your question, I'm not aware of a change between SAS versions 9.1 and 9.4 which could explain what you describe. Moreover, the results in your final PROC FREQ output seem plausible to me.

You are surprised that the frequencies (for the first four categories; the fifth frequency necessarily equals 1917) are not closer to the expected value 1950? This would mean that you suspect that one or more of the selection probabilities are different from 1950/13183, 1950/11362, 1950/4506 and 1950/3980, respectively, right?

So, let's perform exact two-sided binomial tests for the four selections (which I think can be regarded as independent), including an adjustment for multiple testing:

``````data have;
input c \$ t sel;
cards;
C71_80 13183 1937
C81_90 11362 1885
C91_95  4506 1968
C96_10  3980 1905
;

data test;
set have;
s=1; n=sel;
output;
s=0; n=t-sel;
output;
p0=1950/t;
call execute(cats('proc freq data=test; where c="', c, '"; weight n;',
'exact binomial; tables s / binomial(level="1" p=', p0,
'); output out=stats',_n_, '(keep=xp2_bin) binomial; run;'));
run;

data stats;
set stats1-stats4 nobs=k;
proc print;
run;``````

Result:

```Obs    XP2_BIN     p_adj

1     0.76109    0.99674
2     0.10757    0.36570
3     0.59843    0.97400
4     0.15820    0.49784```

The adjusted p-values (and even the unadjusted ones) are all >0.05, so there is no evidence that the null hypotheses are not true.

Discussion stats
• 3 replies
• 402 views
• 0 likes
• 3 in conversation