BookmarkSubscribeRSS Feed
halladje
Fluorite | Level 6

Hi there, 

 

I am trying to generate missingness for a summative scale. This involves: (1) randomly selecting individuals to be missing a summative score and (2) deleting individual items within the scale for those identified as being missing. I am struggling with #2 as all individuals need at least 1 item to be deleted and a pre-specified number to have all items deleted (9.17% missing all items) and individuals can have 1 to 5 missing items (within a 5 item scale). 

 

Probability of missing:

item1=0.2782

item2=0.3497

item3=0.3035

item 4= 0.3207

item 5=0.3289

 

All items=0.0917

*remaining probability of each item after accounting for all being missing*

item1=0.1865
item2=0.258
item3=0.2118
item 4= 0.229
item 5=0.2372

 

Essentially, I want to delete all items for 9.17% of the identified sample for missingness - likely based on a Bernoulli distribution as follows...

if js_Sel=. then sel_items=rand('BERNOULLI', 0.0917); else sel_items=0;

 

...and then, conditional on the full scale being missing (i.e. js_Sel=.) and not having all items missing (i.e. sel_items=0), using the remaining probabilities to delete the remaining individual items. However, if I do this using separate random bernoulli variables, I end up getting about 25% with no missing at all (when all identified observations need to have at least one item missing) and 10% extra with all items missing. 

 

Is there a way to create an array of Bernoulli random variables, based on the remaining probabilities, where at least 1 column needs to be =1 and it is not possible for all 5 columns to =1? 

 

Thanks in advance!

Jillian 

 

 

 

5 REPLIES 5
Reeza
Super User
Yes it's possible, but it would be much easier if you showed some sample data.
Because you have defined probabilities, use RAND() with the TABLE option first and then use the Bernoulli option to create the 1.
halladje
Fluorite | Level 6

Hi there, 

 

I am unable to post sample data - my apologies for the inconvenience. 

 

For the table option, this would only allow for one variable to be selected though, correct? Several observations have multiple observations deleted so the cumulative probabilities across items are >1. When using the table function, don't the probabilities need to =1 since only one variable is selected? 

 

Thanks for your thoughts

 

 

halladje
Fluorite | Level 6

I am not quite sure how to do that. Any general thoughts I would be able to test the table random function on my data (or other alternatives)? 

 

Thanks and again my apologies, Jillian 

FreelanceReinh
Jade | Level 19

Hi @halladje,

 

If I understand your requirements correctly, you want to modify one existing dataset (by setting a number of variables to missing). So, your probabilities (0.2782, 0.3497, etc.) are actually expected relative frequencies in that dataset (after the modification).

 

The main issue is: Most of the probabilities you've specified are marginal probabilities, but constraints such that "a pre-specified number to have all items deleted (9.17% missing all items)" or "it is not possible for all 5 columns to =1" imply that the Bernoulli random variables you're trying to simulate are statistically dependent. This means, you can't simply use RAND('bern',0.2782)RAND('bern',0.3497), etc. (or RAND('bern',0.1865)RAND('bern',0.258), etc. for that matter).

 

Maybe there is an additional issue: The relative frequencies would most likely differ from the specified probabilities due to random fluctuations. For example, on average, more than one out of ten selections from 1000 individuals using independent RAND('bern',0.3497) values will contain >368 individuals. Given the precision of the specified probabilities, you might not be happy with the results.

 

Here's an outline of how you could avoid both of these issues:

  1. There are 2**5 - 2 = 30 different combinations of five zeros and ones after excluding "00000" (="no item missing")  and "11111" (="all items missing"). Denote the relative frequencies to be determined for the 30 combinations "00001" (="only item 1 missing"), ..., "11110" (="only item 1 nonmissing") with x1, ..., x30.
  2. Write down the constraints for the xi (besides xi>=0). These are linear equations. Examples: The constraint that 9.17% of the observations are to have all items missing translates to x1+x2+...+x30=0.9083 (=1-0.0917). The constraint that 23.72% of the observations are to have item 5 missing, but not all items missing, translates to x16+x17+...+x30=0.2372 (see first digit of 16, 17, ..., 30 in the binary system).
  3. Solve the resulting system of linear equations (SAS/IML?). There will be many free parameters in the solution. Think of reasonable values for these parameters (or specify more constraints in step 2).
  4. Compute the corresponding absolute frequencies from the solution obtained in step 3: If your dataset contains N individuals, determine n1, ..., n30 by ni=floor(xi*N) or ni=ceil(xi*N) and similarly n31=floor(0.0917*N) or n31=ceil(0.0917*N) so that n1+...+n31=N.
  5. Use PROC SURVEYSELECT with the GROUPS=(n1 ... n31) option to assign the individuals randomly to the 31 groups (numbered 1, ..., 31).
  6. In a DATA step, use the 1st, ..., 5th digit of the respective individual's group number in BINARY5. format (i.e. "00001", ..., "11111") to determine which of the items 5, 4, 3, 2, 1 need to be set to missing and perform this operation in a DO loop (1 to 5).

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 870 views
  • 3 likes
  • 3 in conversation