02-09-2015 04:54 PM
requests Bernoulli sampling, which consists of N independent selection trials, each with constant inclusion probability , where N is the total number of sampling units in the stratum or data set. The sample size is not fixed but is a random variable. For more information, see the section Bernoulli Sampling.
When you specify this method, you must provide the sampling rate (inclusion probability ) in the SAMPRATE= option. For stratified sampling (which you request with the STRATA statement), you can specify the same sampling rate for each stratum in the SAMPRATE=value option. Or you can specify different sampling rates for different strata in the SAMPRATE=(values) or SAMPRATE=SAS-data-set option.
Because Bernoulli sampling is based on a specified inclusion probability instead of a fixed sample size, METHOD=BERNOULLI does not use the SAMPSIZE= option. Also, the ALLOC= option in the STRATA statement (which allocates the total sample size among strata) is not available with METHOD=BERNOULLI.
02-09-2015 06:39 PM
Run a data set through surveyselect a few times without specifying a seed and method=bernoulli. You will note in the output (and if you requests STATS on the proc statement) that you get an expected sample size reflecting the sampling rate specified and an actual sample size that may be close to the expected but probably not the same in sequential runs . Also not the presence of an adjusted sampling weight. The difference is because there are trials for success with probability P (the samprate).
The do the same with method = SRS. You'll see that the generated sample size doesn't change.
The difference you may be thinking of with "number of success is the random variable" is tied to the Binary distribution, not the sample method.
02-09-2015 08:19 PM
"Because Bernoulli sampling is based on a specified inclusion probability instead of a fixed sample size, METHOD=BERNOULLI does not use the SAMPSIZE= option." This sentence is what confused me.
I did run SURVEYSELECT on the Cars data set twice and drew 21 and 29 samples. The two SAMPLE files had 428 records each (as did the input file) but the "Selected" columns of the two had "1"s in different rows and different numbers of "1"s. So a twenty sided die is thrown 428 times for each run. In contrast, SRS has skip intervals or draws a random number SAMPLESIZE times and selects the corresponding records.
I'm comfortable now, thanks.