03-14-2016 10:29 AM
03-14-2016 11:12 AM - edited 03-14-2016 11:13 AM
The two example Proc Freq output tables you show have VERY different frequencies. To the point I'm not sure that the second is what you meant. Since you don't show any raw input and it looks like the code you did use likely varied from what you posted it is a bit hard to determine
If you want to specify a sample size you'd save yourself a lot of work by swithching to proc survey select and a strata variable. Then you could specify either a sample size or sample rate for each stratum (your Qual_flag variable it looks like).
proc surveyselect data=have out=want
sampsize=(1950 1950 1950 1950 1950) /* one value for each strata or level of the strata variable
if you have additiona strata you are not interested in subset the data with the
where clause data set option on the input data (have)*/
seed = 1234 /* seed is similar to use in ranuni to repeat sequence*/
one advantage: you'll get actual probability of selection and a sample weight variable is needed. If there are at least the number of records needed in each strata that is how many you will get in the output. The Have data set needs to be sorted by the strata variable.
03-16-2016 03:34 PM
I agree with @ballardw that using PROC SURVEYSELECT is more convenient for selecting random samples.
Still, I would like to encourage you to review the results produced by this procedure (or any other software for that matter) in the same way as you did with your data step approach. (I recently came across a seemingly surprising result of PROC SURVEYSELECT, but haven't investigated the cause yet.)
Sometimes it's a procedure option that you forgot to specify correctly, in rare cases it may even be a bug that distorts the results. (In 2001 I reported a bug in the CDF function of SAS 6.12 TS050 which led to grossly incorrect results for certain arguments.)
As to your question, I'm not aware of a change between SAS versions 9.1 and 9.4 which could explain what you describe. Moreover, the results in your final PROC FREQ output seem plausible to me.
You are surprised that the frequencies (for the first four categories; the fifth frequency necessarily equals 1917) are not closer to the expected value 1950? This would mean that you suspect that one or more of the selection probabilities are different from 1950/13183, 1950/11362, 1950/4506 and 1950/3980, respectively, right?
So, let's perform exact two-sided binomial tests for the four selections (which I think can be regarded as independent), including an adjustment for multiple testing:
data have; input c $ t sel; cards; C71_80 13183 1937 C81_90 11362 1885 C91_95 4506 1968 C96_10 3980 1905 ; data test; set have; s=1; n=sel; output; s=0; n=t-sel; output; p0=1950/t; call execute(cats('proc freq data=test; where c="', c, '"; weight n;', 'exact binomial; tables s / binomial(level="1" p=', p0, '); output out=stats',_n_, '(keep=xp2_bin) binomial; run;')); run; data stats; set stats1-stats4 nobs=k; p_adj=1-(1-xp2_bin)**k; /* Šidák adjusted p-value */ proc print; run;
Obs XP2_BIN p_adj 1 0.76109 0.99674 2 0.10757 0.36570 3 0.59843 0.97400 4 0.15820 0.49784
The adjusted p-values (and even the unadjusted ones) are all >0.05, so there is no evidence that the null hypotheses are not true.