turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- Sampling using RANUNI()

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-14-2016 10:29 AM

Starting with

The FREQ Procedure

qual_flag Frequency Percent Cumulative

Frequency Cumulative

Percent C71_80 C81_90 C91_95 C96_10 MIGRATE

Frequency Cumulative

Percent C71_80 C81_90 C91_95 C96_10 MIGRATE

13183 | 37.72 | 13183 | 37.72 |

11362 | 32.51 | 24545 | 70.23 |

4506 | 12.89 | 29051 | 83.13 |

3980 | 11.39 | 33031 | 94.51 |

1917 | 5.49 | 34948 | 100.0 |

In SAS 9.1

data sampledown;

length sampfactor xyz 8;

set finalsample;

xyz = ranuni(35);

select(qual_flag);

when('C71_80') factor = 13183;

when('C81_90') factor = 11362;

when('C91_95') factor = 4506;

when('C96_10') factor = 3980 ;

when('MIGRATE') factor = 1917;

otherwise put 'ERROROROROROR';

end;

sampfactor = 1950 / factor;

if xyz le sampfactor then output;

run;

proc freq data=sampledown;

table qual_flag;

run;

length sampfactor xyz 8;

set finalsample;

xyz = ranuni(35);

select(qual_flag);

when('C71_80') factor = 13183;

when('C81_90') factor = 11362;

when('C91_95') factor = 4506;

when('C96_10') factor = 3980 ;

when('MIGRATE') factor = 1917;

otherwise put 'ERROROROROROR';

end;

sampfactor = 1950 / factor;

if xyz le sampfactor then output;

run;

proc freq data=sampledown;

table qual_flag;

run;

would always give output with distribution of qual_flag very close to 1,950 in all levels. Now in SAS 9.4 I'm getting distribution like

The FREQ Procedure

qual_flag Frequency Percent Cumulative

Frequency Cumulative

Percent C71_80 C81_90 C91_95 C96_10 MIGRATE

Frequency Cumulative

Percent C71_80 C81_90 C91_95 C96_10 MIGRATE

1937 | 20.15 | 1937 | 20.15 |

1885 | 19.61 | 3822 | 39.76 |

1968 | 20.47 | 5790 | 60.24 |

1905 | 19.82 | 7695 | 80.06 |

1917 | 19.94 | 9612 | 100.00 |

Any idea why?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to NedKaufman

03-14-2016 11:12 AM - edited 03-14-2016 11:13 AM

The two example Proc Freq output tables you show have VERY different frequencies. To the point I'm not sure that the second is what you meant. Since you don't show any raw input and it looks like the code you did use likely varied from what you posted it is a bit hard to determine

If you want to specify a sample size you'd save yourself a lot of work by swithching to proc survey select and a strata variable. Then you could specify either a sample size or sample rate for each stratum (your Qual_flag variable it looks like).

Something like:

proc surveyselect data=have out=want

sampsize=(1950 1950 1950 1950 1950) /* one value for each strata or level of the strata variable

if you have additiona strata you are not interested in subset the data with the

where clause data set option on the input data (have)*/

seed = 1234 /* seed is similar to use in ranuni to repeat sequence*/

;

strata qual_flag;

run;

one advantage: you'll get actual probability of selection and a sample weight variable is needed. If there are at least the number of records needed in each strata that is how many you will get in the output. The Have data set needs to be sorted by the strata variable.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ballardw

03-14-2016 11:25 AM

That's great. Thank you very much ballardw!

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to NedKaufman

03-16-2016 03:34 PM

Hi @NedKaufman,

I agree with @ballardw that using PROC SURVEYSELECT is more convenient for selecting random samples.

Still, I would like to encourage you to review the results produced by this procedure (or any other software for that matter) in the same way as you did with your data step approach. (I recently came across a seemingly surprising result of PROC SURVEYSELECT, but haven't investigated the cause yet.)

Sometimes it's a procedure option that you forgot to specify correctly, in rare cases it may even be a bug that distorts the results. (In 2001 I reported a bug in the CDF function of SAS 6.12 TS050 which led to grossly incorrect results for certain arguments.)

As to your question, I'm not aware of a change between SAS versions 9.1 and 9.4 which could explain what you describe. Moreover, the results in your final PROC FREQ output seem plausible to me.

You are surprised that the frequencies (for the first four categories; the fifth frequency necessarily equals 1917) are not closer to the expected value 1950? This would mean that you suspect that one or more of the selection probabilities are different from 1950/13183, 1950/11362, 1950/4506 and 1950/3980, respectively, right?

So, let's perform exact two-sided binomial tests for the four selections (which I think can be regarded as independent), including an adjustment for multiple testing:

```
data have;
input c $ t sel;
cards;
C71_80 13183 1937
C81_90 11362 1885
C91_95 4506 1968
C96_10 3980 1905
;
data test;
set have;
s=1; n=sel;
output;
s=0; n=t-sel;
output;
p0=1950/t;
call execute(cats('proc freq data=test; where c="', c, '"; weight n;',
'exact binomial; tables s / binomial(level="1" p=', p0,
'); output out=stats',_n_, '(keep=xp2_bin) binomial; run;'));
run;
data stats;
set stats1-stats4 nobs=k;
p_adj=1-(1-xp2_bin)**k; /* Šidák adjusted p-value */
proc print;
run;
```

Result:

Obs XP2_BIN p_adj 1 0.76109 0.99674 2 0.10757 0.36570 3 0.59843 0.97400 4 0.15820 0.49784

The adjusted p-values (and even the unadjusted ones) are all >0.05, so there is no evidence that the null hypotheses are not true.