BookmarkSubscribeRSS Feed
halladje
Fluorite | Level 6

Hello, 

 

I am trying to generate missingness in a dataset that resembles the pattern of missing in the complete sample. 

 

I have determined the number of observations that have 2 missing variables and the probability of each variable to be missing. I currently have created code to select 1 variable to be missing randomly, using probabilities of missingness using the following code:

 

data complete_miss3; set complete_miss2;
pmale=0.103; pagyrs=0.061; ped=0.620;
pint=0.342; pext=0.324; pov=0.218; psel=0.230;
nmiss_flag1a=rand("Table", pmale, pagyrs,
ped, pint, pext, pov, psel);
nmiss_flag1b=rand("Table", pmale, pagyrs,
ped, pint, pext, pov, psel);
run;
proc freq data=complete_miss3;
tables nmiss_flag1a nmiss_flag1b ;
run;

 

The issue with this, is that often nmiss_flag1a and nmiss_flag1b are both the same variable. I am wondering if there is a way to select 2 or 3 variables from the list to be missing. i.e. a multivariate Bernoulli random list where you set 2 or 3 to be missing?

 

I am also wondering if the probabilities have to add up to 1 across all variables in a table? These probabilities of the number of observations missing 1 or 2 variables (after deleting observations with particular common patterns). After accounting for common patterns of missing, less than 1% of the sample was only missing 1 item so I have combined the remaining into missing 2 item for the simulation (i.e. the probabilities of missingness are based on the sample missing 1 or 2 variables and do not add up to 1). Do I need to scale these probabilities so they=1 or will they automatically be scaled (the resulting distributions of random numbers from this code is accurate). 

 

Thanks so much, 

Jillian 

7 REPLIES 7
ballardw
Super User

I think you are missing some details. For instance:

The issue with this, is that often nmiss_flag1a and nmiss_flag1b are both the same variable. 

Cannot be true from the example code you posted. So something is missing.

Such

 I am wondering if there is a way to select 2 or 3 variables from the list to be missing. i.e. a multivariate Bernoulli random list where you set 2 or 3 to be missing?

what list? I don't see any list of variables.

 

You could assign different probabilities of whether a specific variable is to be set missing with something like:

 

If rand('uniform') le .25 then call missing(var);

 

would set the var to missing roughly 25 percent of the time. I used call missing as it doesn't care whether a variable is character or numeric.

If you have multiple variables that would use the same probability then a simple loop over an array:

 

array v   <list the variable names>; These would all have to be the same type, character or numeric variables

 

do I=1 to dim(v);

   If rand('uniform') le .25 then call missing(v[I]);

end;

 

would set each variable missing with a probability of .25.

If you want each variable to have a different value then an array of the probabilities could be used.

 

array v var1 var2 var3;

array p {3} _temporary_ ( 0.1, 0.25, 0.001);

do I=1 to dim(v);

   If rand('uniform') le p[I] then call missing(v[I]);

end;

would use 0.1 as probability for setting var1 missing, 0.25 for var2 and 0.001 for var3.

 

The number of elements in the P array would have to match the number of elements in the V array.

 

If your assigned table values do not total 1 then a last category is created with the remainder.

Note that in the following example you will get mostly 3 as results even though only 2 values are used in the table.

data junk;
   do i=1 to 10;
      x=rand('table',.1,.1);
      output;
   end;
run;
FreelanceReinh
Jade | Level 19

Hello @halladje,


@halladje wrote:

I am wondering if there is a way to select 2 or 3 variables from the list to be missing. i.e. a multivariate Bernoulli random list where you set 2 or 3 to be missing?


I think this is exactly what you'd actually need: A random vector consisting of 7 components (one for each variable) with Bernoulli distributions. Then, for a realization of, say, (0,1,0,0,1,0,0) you would set the second and fifth variable to missing.

 

Ideally, you would need to specify the multivariate distribution of that random vector (i.e. the joint distribution of the 7 components) in order to create "realistic" patterns of missingness. This distribution has 2**7-1=127 parameters, though. What you currently have seem to be only the marginal distributions (7 parameters). You may want to use the relative frequencies of the 128 possible missing patterns in the complete sample as estimates for the parameters. (The 128th relative frequency is, of course, 1 minus the sum of the others, hence only 127 [independent] parameters.) [Edit: You could then use the "Table" distribution with 127 parameters to select one of the 128 possible patterns of missingness.]

 

Alternatively, you could start with the marginal distribution (Bernoulli) of missingness for the first variable (which you have already), then determine the relative frequencies estimating the two conditional distributions (again, Bernoulli) for the second, given that the first is 1 ("missing") or 0 ("non-missing"), respectively, then continue with the four conditional distributions for the third variable, given the four possible patterns of the first and second variable (00, 01, 10, 11) and so on. (Again 127 parameters: 1+2+4+8+16+32+64=127.)

 

However, if you're fine with the assumption of 7 statistically independent Bernoulli distributions (do the relative frequencies support this assumption?), then your existing 7 parameters are sufficient. As ballardw has mentioned, you have used the "Table" distribution inappropriately. Instead, you should use the Bernoulli distribution seven times with the individual parameters (0.103, 0.061, etc.) for each variable.

 

halladje
Fluorite | Level 6

Hello there, 

 

Thank you for your comments and suggestions. 

 

How would I use all of the missing data patterns I did determine the missing data patterns and then I took any missing data pattern that was >=1% of the sample and dummy coded that pattern. 

 

So far, I have used the ran('table') function to randomly select the missing data patterns. If a certain pattern of missing is randomly selected, then those variables are deleted. So far the patterns include all patterns >1% of the sample. These include:
1) complete case p0=0.687;
2) missing parental education only p1=0.1636;
3) missing student reported SEL only p2=0.0215
4) missing internalizing symptoms p3=0.0197
5) missing externalizing symptoms p4=0.0184
6) missing internalizing, externalizing, and parental education only p5=0.0153
7) missing overall grade p6=0.0111

 

I have also additionally created:
😎 missing 1 or 2 items (but not in one of the most common patterns) p7=0.0466
9) missing 3+ items (but not in one of the most common patterns) p8=0.0169

 

My code is: 

 

*STEP 1: A explore missing data patterns in full sample*;

proc means data=is_agg nmiss n;
var male s_ageyrs j_pared2 J_int j_ext ov js_Sel ;
run;

proc mi data=is_agg nimpute=0 displaypattern=nomeans;
var male s_ageyrs j_pared2 J_int j_ext ov js_Sel ;
ods output MissPattern=Pattern;
run;

 

*code pattern as a variable*;
%let vars = male s_ageyrs j_pared2 J_int j_ext ov js_Sel;

data CodeMissing;
set is_agg;
array V[*] &vars;
MissInd = subpad('0', 1, dim(V));
do i = 1 to dim(V);
substr(MissInd, i, 1) = put(cmiss(V[i]), 1.);
end;
run;

 

data CodeMissing2;
set CodeMissing;
if MissInd=0010000 then patterns=1;
else if MissInd=0000001 then patterns=2;
else if MissInd=0000100 then patterns=3;
else if MissInd=0001000 then patterns=4;
else if MissInd=0011100 then patterns=5;
else if MissInd=0000010 then patterns=6;
else if MissInd=0000000 then patterns=0;
else patterns=7;
run;

*1=missing parended only 16.36, 2=missing jssel only 2.15%, 3=missing ext only 1.97,4=missing int only 1.84,
5=missing ext int pared 1.53% 6=missing ov only 1.11, 7=missing something but not a pattern>1%;

 

*add missing count variable*;

 

data CodeMissing3;
set CodeMissing2;
if patterns=7 then nmiss2=cmiss(of male s_ageyrs j_pared2 J_int j_ext ov js_Sel);
run;
proc freq data=codemissing3;
tables nmiss2;
run;

 

data codemissing3;
set codemissing3;
if nmiss2=1 then patterns=7;
else if nmiss2=2 then patterns=7;
else if nmiss2=3 then patterns=8;
else if nmiss2=4 then patterns=8;
else if nmiss2=5 then patterns=8;
else if nmiss2=6 then patterns=8;
else if nmiss2=7 then patterns=8;
else patterns=patterns;
run;

*collapsed number missing to be at least 1 % of the same.i.e. missing 1 collapesed with 2 missing 2 (pattern=7), and missing 3+ collapsed (patterns=8);

 

**generate missing in complete case sample**;

 

data complete_miss1; set complete_aux;
p0=0.687; p1=0.1636; p2=0.0215; p3=0.0197;
p4=0.0184; p5=0.0153; p6=0.0111; p7=0.0466; p8=0.0169;
pattern_flag=rand("Table", p0, p1, p2, p3, p4, p5, p6, p7, p8);
run;
proc freq data=complete_miss1;
tables pattern_flag;
run;
data complete_miss2; set complete_miss1;
if pattern_flag=2 then j_pared2=.; else j_pared2=j_pared2;
if pattern_flag=3 then js_Sel=.; else js_sel=js_sel;
if pattern_flag=4 then j_ext=.; else j_ext=j_ext;
if pattern_flag=5 then j_int=.; else j_int=j_int;
if pattern_flag=6 then j_int=.; else j_int=j_int;
if pattern_Flag=6 then j_ext=.; else j_ext=j_ext;
if pattern_flag=6 then j_pared2=.; else j_pared2=j_pared2;
if pattern_flag=7 then ov=.; else ov=ov;
run;

 

My post was initially asking about how to deal with the subsequent patterns 8 (missing 1 or 2 variables but not in a pattern >=1% of the sample) and 9 (missing 3+ variables but not in a pattern). Is there a different more efficient way to do this? 

 

FreelanceReinh
Jade | Level 19

Thanks for your detailed reply. I haven't checked all the details, but it seems that you're already selecting 0-1-patterns of missingness and then set variables to missing accordingly -- which is good. (And, yes, the probabilities for all possible patterns must add to 1.) A few things could be improved, e.g., variable MissInd is character, so should be compared to character literals such as '0010000', not numbers without quotes (see notes about automatic type conversion in the log). Statements like "else j_pared2=j_pared2;" are redundant.


@halladje wrote:

My post was initially asking about how to deal with the subsequent patterns 8 (missing 1 or 2 variables but not in a pattern >=1% of the sample) and 9 (missing 3+ variables but not in a pattern). Is there a different more efficient way to do this? 

 


At the end of the day, you are effectively simulating the random vector I mentioned in my earlier post. Maybe the most transparent approach would be:

  1. Specify the 127(+1) probabilities determining the multivariate distribution of the random vector. Many of them will be 0 if you don't want to get (most of the) patterns with more than two or three missings -- which simplifies things.
  2. Use the "Table" distribution with the non-zero values (or all but one) among those 128 probabilities to generate the sequence number of a pattern.
  3. Set variables to missing according to the randomly selected pattern.

As to step 1 above, you seem to have the probabilities already for the most frequent individual patterns. So, what remains is to define probabilities for the "collapsed" patterns you've mentioned. For example, if there are n distinct patterns (outside the set of the "most frequent" patterns) containing exactly 2 missings and their total probability should be, say, 0.04567, then you have to decide on the distribution: Would you deem 0.04567/n appropriate for the probability of each of the n patterns? Or rather the observed individual relative frequencies (assuming their sum is 0.04567)? Or perhaps the (scaled) expected relative frequencies assuming independence (using the marginal probabilities in the calculation)? Or something else?

koyelghosh
Lapis Lazuli | Level 10

@halladjeYou have choosen RAND('Table',x1,x2,x3,x4,x5,x6,x7). My understanding about RAND('Table',p1,p2,...pn) is as follows. My interpretation might be wrong but I have followed the official documentation as here. You are passing a list of N probabilities (in your case 7). Based on what is the probability at each index, that index will get selected and the output will be that index number itself. Please see the documentation link shared above. In other words if x1 has higher probability than x2 then the output will be 1 more times than 2.

 

For example, I created a demo data table with 100 rows where each row output was coming from a RAND('Table', 0.05, 0.1, 0.15, 0.22, 0.23, 0.2499) function. Here Index 1 has the least probability, index 2 has more probability, index 3 has still higher probability while index 6 has maximum probability. Also please note that they add up to very close to 1. So I expected 1,2,3,4,5 and 6 to be the output with probability coming from the function. The output is as below followed by the code used to generate the data and the output image.

 

Output percentage of Number matches the probability tableOutput percentage of Number matches the probability table

 

DATA DEMO;
	INPUT OutputNum 2. @@;
	OutputNum = RAND('Table',0.05,0.1,0.15,0.22,0.23,0.2499);
	DATALINES;
	0 0 0 0 0 0 0 0 0 0
	0 0 0 0 0 0 0 0 0 0
	0 0 0 0 0 0 0 0 0 0
	0 0 0 0 0 0 0 0 0 0
	0 0 0 0 0 0 0 0 0 0
	0 0 0 0 0 0 0 0 0 0
	0 0 0 0 0 0 0 0 0 0
	0 0 0 0 0 0 0 0 0 0
	0 0 0 0 0 0 0 0 0 0
	0 0 0 0 0 0 0 0 0 0
	;
RUN;

PROC TABULATE DATA=DEMO;
	CLASS OutputNum;
	TABLE OutputNum*PCTN;
RUN;

I am sorry if you knew all this. I just used the previous paragraphs as stepping stones to answer your questions, as below.

 

I am also wondering if the probabilities have to add up to 1 across all variables in a table? 

The documentation, shared above, clearly says that sum of probabilities can be different from 1 but then accordingly the integers that will be output will be different.

 

Do I need to scale these probabilities so they=1 or will they automatically be scaled (the resulting distributions of random numbers from this code is accurate). 

I don't think that will give you what you want but that is only my opinion.

 

I am wondering if there is a way to select 2 or 3 variables from the list to be missing. i.e. a multivariate Bernoulli random list where you set 2 or 3 to be missing?

Since for both the missing flags (that you are trying to simulate) you are essentially using the same probabilities, it is quite natural to assume that there will be instances when the number flags will be same. Actually the sum of 0.103,0.061,0.620,0.342,0.324,0.218,0.230 is definitely more than one and according to the rules described in the documentation the output will only take one of the four values (1,2,3 and 4). If you really want to produce entirely non-overlapping numbers then ballardw has already given you a very nice advise. Please follow that.

 

I am sorry if you knew all this already and I underestimated your problem or misread the requirement completely. I thought I should let you know about this so that you not only take advantage of the advise, already given, but also know a little more about what is going on with RAND('Table'). As I said, I am sorry if you already knew about this.

halladje
Fluorite | Level 6

Hi There, 

 

Thank you for this information - I am relatively new to SAS and I have a steep learning curve for this project so I appreciate any advice. 

 

It is my understanding that I can use ran('Table') if I want to select a mutually exclusive variable? For example, if you see my previous code I posted in a previous response. I initially used table to select a missing data pattern (mutually exclusive, as if you are pattern1 you cannot be pattern2). 

 

Is this correct from your point of view? 

 

I see your point, that in this case, my variables are not mutually exclusive - i.e. you can be missing more than one of the list so table is not appropriate. As per above, a vector of Bernoulli random variables is more appropriate. Although - by using Bernoulli random variables, I will have a chance of having >2 missing variables in the vector. I only want 2 of the 7 variables to be missing, based on different probabilities of missingness for each variable. Any ideas? 

 

Thanks so much for your time and insight!

ballardw
Super User

@halladje wrote:

Hi There, 

 

Thank you for this information - I am relatively new to SAS and I have a steep learning curve for this project so I appreciate any advice. 

 

It is my understanding that I can use ran('Table') if I want to select a mutually exclusive variable? For example, if you see my previous code I posted in a previous response. I initially used table to select a missing data pattern (mutually exclusive, as if you are pattern1 you cannot be pattern2). 

 

Is this correct from your point of view? 

 

I see your point, that in this case, my variables are not mutually exclusive - i.e. you can be missing more than one of the list so table is not appropriate. As per above, a vector of Bernoulli random variables is more appropriate. Although - by using Bernoulli random variables, I will have a chance of having >2 missing variables in the vector. I only want 2 of the 7 variables to be missing, based on different probabilities of missingness for each variable. Any ideas? 

 

Thanks so much for your time and insight!


The type of loop I showed with arrays could be modified to count the number of times a missing value has been selected and if you get 2 then either 1) Leave the loop (as in the LEAVE instruction or 2) don't execute the test and missing assignment.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1619 views
  • 6 likes
  • 4 in conversation