Solved: Determining missing values for the final analytic sample

Heejeong · Posted 05-27-2022 04:41 PM

Hello,

I am running a PROC GLM and have a total of 1255 participants in my sample. After conducting my GLM below, however, the final number of observations is 521 due to missing values across many covariates, predictor and the dependent variable.

proc glm data=JH.Final3 ;
model Norepinephrine = cage ClinicSex marriedmidus work race_orig  cedu cHHtotalIncome EverSmokeReg Exercise20mins CNSmeds cBMI cCESD cNeuroticism  cChronCondNumb  cAnyStressWide_sum pa_mlm2 cPAreactpost;
;run;

I am trying to write my Methods section and show how I reached my final analytic sample (e.g., "XX participants were excluded for missing data on income, XX for missing data on education....fill this out for all variables in the model....leaving a final sample size of 521." Since I have so many covariates, is there a syntax that can help me determine these numbers? I used the below syntax to see the sample size without missing variables for all variables, but it doesn't tell me how many missing vars exist for each covariate. But the sample size never goes down below 921 so I can't quite tell how I am left with 521 participants in the final sample. Thanks for any thoughts anyone might have on this!

data JH.Final3_nomiss;
   set JH.Final3;
   where not missing(cAnyStressWide_sum);
run;

*921 valid csrna_mlm2
*921 valid csrpa_mlm2
*1011 valid na_mlm2
*1011 valid pa_mlm2
*1243 valid Norepinephrine
*1054 valid cage
*1054 valid ClinicSex
*1054 valid marriedmidus
*1054 valid work
*1051 valid race_orig
*1051 valid cedu
*1032 valid cHHtotalIncome
*767 valid EverSmokeReg
*1255 valid Exercise20mins
*1137  valid CNSmeds
*1254 valid cBMI
*1246 valid cCESD
*1050  valid cNeuroticism
*1255  valid cChronCondNumb
*1011 valid  cAnyStressWide_sum

Reeza · Posted 05-27-2022 04:58 PM

If you have 100 observations, but 5 missing values for Variable 1 and 5 missing values for Variable 2 - in different rows then you have a total of up to 10 excluded records. You could have as low as 5. It depends on how they overlap within the rows, 5 means they overlap entirely, 10 means they do not overlap at all.

This blog posts illustrates how to see the pattern of missing data and then links to an interesting method to visualize it.

https://blogs.sas.com/content/iml/2016/04/18/patterns-of-missing-data-in-sas.html

View solution in original post

Reeza · Posted 05-27-2022 04:58 PM

If you have 100 observations, but 5 missing values for Variable 1 and 5 missing values for Variable 2 - in different rows then you have a total of up to 10 excluded records. You could have as low as 5. It depends on how they overlap within the rows, 5 means they overlap entirely, 10 means they do not overlap at all.

This blog posts illustrates how to see the pattern of missing data and then links to an interesting method to visualize it.

https://blogs.sas.com/content/iml/2016/04/18/patterns-of-missing-data-in-sas.html

ballardw · Posted 05-27-2022 08:29 PM

You really need to provide 1) example data and 2) what you expect to see as result for your example.

Your data step makes no sense in terms of what you are asking as single variables are not what you want.

Try this:

data JH.Final3_missinfo;
   set JH.Final3;
   nummissing =cmiss(Norepinephrine,cage,ClinicSex,marriedmidus,work,race_orig,cedu,cHHtotalIncome,EverSmokeReg,Exercise20mins,CNSmeds,cBMI,cCESD,cNeuroticism,cChronCondNumb,cAnyStressWide_sum,pa_mlm2,cPAreactpost);

run;

If Nummissing is 1 or more then that record was excluded from your model. So you could filter on the Nummissing variable to select or exclude records that are missing. The value tells you how many variable for any record were missing.

mkeintz · Posted 05-27-2022 10:37 PM

This is a simple program to determine the frequency of each observed pattern of missing values (untested)::

data missing_patterns;
  set jh.final3;
  array vars Norepinephrine      cage      ClinicSex       marriedmidus   work
             race_orig           cedu      cHHtotalIncome  EverSmokeReg   Exercise20mins 
             CNSmeds             cBMI      cCESD           cNeuroticism   cChronCondNumb  
             cAnyStressWide_sum  pa_mlm2   cPAreactpost ;
  do over vars;
    if missing(vars) then vars=1;
    else vars=0;
  end;
run;
proc summary data=missing_patterns  nway n noprint;
  class Norepinephrine cage ClinicSex marriedmidus work race_orig  cedu 
             cHHtotalIncome EverSmokeReg Exercise20mins CNSmeds cBMI cCESD cNeuroticism  
             cChronCondNumb  cAnyStressWide_sum pa_mlm2 cPAreactpost ;
  output out=pattern_frequency (drop=_type_);
run;
proc print data=pattern_frequency;
run;

You should have a frequency of 521 for the all-zero pattern.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Heejeong · Posted 05-31-2022 02:54 PM

Thank you all very much for your helpful comments, I tried all of them and have a few comments and follow-up questions:

1) I first wanted to clarify my question. As I write up my manuscript, I want to be able to provide my readers with how I reached the final analytic sample. And below is a simpler version (copy-pasted from a published manuscript) of the type of information I want to deliver in the manuscript, how our sample size was reduced down to the final sample of 521 participants.

"Affective reactivity was calculated for all participants who had both stressor days (i.e., days on which a stressor occurred) and nonstressor days; 43 participants were excluded because they experienced stressors every day, and 70 were excluded because they experienced no stressors during the study. An additional 16 participants were excluded for missing data on income, leaving a final sample size of 872 for the primary analyses. Procedures were approved by Institutional Review Boards at participating sites, and all participants provided informed consent."

2) Attached is the output I get when I run the syntax below (the original sample size is 2,266 participants). These "patterns of missing data" are super helpful and insightful but with so many patterns, I am having a hard time figuring out how I can write up the above section. Would the best method be summing all values in the "freq" column for each variable for each row that has a missing value? For example, for the variable "cPAreactpost," the total number of missing participants would be calculated by adding all frequency values in the yellow highlighted cells, which would add up to 482 participants who have missing values on this variable?

ods select MissPattern;
proc mi data=JH.Final3  nimpute=0;
var cage ClinicSex marriedmidus work race_orig  cedu cHHtotalIncome EverSmokeReg Exercise20mins CNSmeds cBMI cCESD cNeuroticism  cChronCondNumb  cAnyStressWide_sum pa_mlm2 cPAreactpost;
run;

Thank you again for all your help! I apologize for not being to upload the data because it's too big even in a zip file.

Thank you again!

Reeza · Posted 05-31-2022 03:01 PM

It's rare to be able to say it as simply as that.

For you it will be something like:

82 observations had missing values for Var1, Var2, Var3,

83 observations had missing values for Var2, Var4.

You can go through the output to get that information.

Heejeong · Posted 05-31-2022 04:59 PM

Hi @Reeza,

Thank you so much for confirming that it's often not as simple as the example I showed you and for writing out an example of how I should think about reporting the missing values!

Heejeong · Posted 05-31-2022 05:44 PM

@Reeza , I did wanted to post one quick follow-up question! I wanted to ensure that the way I am thinking of the following output and the missing data patterns are correct!

The last four variables in the excel sheet (csrna_mlm2, csrpa_mlm2, cAnyStressWide_sum, Norepinephrine) are my main IVs and DV. So I'm just going to add up a total number of participants missing data for these main variables of interest and write something like the below in my manuscript. I calculated 343 by adding up the people highlighted in orange in the excel sheet below and wanted to make sure that I am understanding this output correctly. Thank you SO much for all your helpful advice again!

"From the original sample of 1,255 participants, a total of 343 participants were excluded for missing data on the main variables of interest (negative affect reactivity, positive affect reactivity, total number of stressors, and levels of norepinephrine). Further, an additional 391 participants were excluded for missing data on covariates, leaving a final sample size of 521for the final analyses."

Reeza · Posted 05-31-2022 06:07 PM

That seem statistically correct. As a reviewer, I'd be curious as to why over half your sample was excluded from the analysis? Were variables missing at random or not missing at random?
https://www.ncbi.nlm.nih.gov/books/NBK493614/

Heejeong · Posted 06-01-2022 12:28 PM

Thank you for this additional comment about the missing data pattern!

I will think about this more as this is an important question.

Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Re: Determining missing values for the final analytic sample

Registration is open