Hi @halladje,
If I understand your requirements correctly, you want to modify one existing dataset (by setting a number of variables to missing). So, your probabilities (0.2782, 0.3497, etc.) are actually expected relative frequencies in that dataset (after the modification).
The main issue is: Most of the probabilities you've specified are marginal probabilities, but constraints such that "a pre-specified number to have all items deleted (9.17% missing all items)" or "it is not possible for all 5 columns to =1" imply that the Bernoulli random variables you're trying to simulate are statistically dependent. This means, you can't simply use RAND('bern',0.2782), RAND('bern',0.3497), etc. (or RAND('bern',0.1865), RAND('bern',0.258), etc. for that matter).
Maybe there is an additional issue: The relative frequencies would most likely differ from the specified probabilities due to random fluctuations. For example, on average, more than one out of ten selections from 1000 individuals using independent RAND('bern',0.3497) values will contain >368 individuals. Given the precision of the specified probabilities, you might not be happy with the results.
Here's an outline of how you could avoid both of these issues:
There are 2**5 - 2 = 30 different combinations of five zeros and ones after excluding "00000" (="no item missing") and "11111" (="all items missing"). Denote the relative frequencies to be determined for the 30 combinations "00001" (="only item 1 missing"), ..., "11110" (="only item 1 nonmissing") with x1, ..., x30.
Write down the constraints for the xi (besides xi>=0). These are linear equations. Examples: The constraint that 9.17% of the observations are to have all items missing translates to x1+x2+...+x30=0.9083 (=1-0.0917). The constraint that 23.72% of the observations are to have item 5 missing, but not all items missing, translates to x16+x17+...+x30=0.2372 (see first digit of 16, 17, ..., 30 in the binary system).
Solve the resulting system of linear equations (SAS/IML?). There will be many free parameters in the solution. Think of reasonable values for these parameters (or specify more constraints in step 2).
Compute the corresponding absolute frequencies from the solution obtained in step 3: If your dataset contains N individuals, determine n1, ..., n30 by ni=floor(xi*N) or ni=ceil(xi*N) and similarly n31=floor(0.0917*N) or n31=ceil(0.0917*N) so that n1+...+n31=N.
Use PROC SURVEYSELECT with the GROUPS=(n1 ... n31) option to assign the individuals randomly to the 31 groups (numbered 1, ..., 31).
In a DATA step, use the 1st, ..., 5th digit of the respective individual's group number in BINARY5. format (i.e. "00001", ..., "11111") to determine which of the items 5, 4, 3, 2, 1 need to be set to missing and perform this operation in a DO loop (1 to 5).
... View more