Hello,
I am running a PROC GLM and have a total of 1255 participants in my sample. After conducting my GLM below, however, the final number of observations is 521 due to missing values across many covariates, predictor and the dependent variable.
proc glm data=JH.Final3 ;
model Norepinephrine = cage ClinicSex marriedmidus work race_orig cedu cHHtotalIncome EverSmokeReg Exercise20mins CNSmeds cBMI cCESD cNeuroticism cChronCondNumb cAnyStressWide_sum pa_mlm2 cPAreactpost;
;run;
I am trying to write my Methods section and show how I reached my final analytic sample (e.g., "XX participants were excluded for missing data on income, XX for missing data on education....fill this out for all variables in the model....leaving a final sample size of 521." Since I have so many covariates, is there a syntax that can help me determine these numbers? I used the below syntax to see the sample size without missing variables for all variables, but it doesn't tell me how many missing vars exist for each covariate. But the sample size never goes down below 921 so I can't quite tell how I am left with 521 participants in the final sample. Thanks for any thoughts anyone might have on this!
data JH.Final3_nomiss;
set JH.Final3;
where not missing(cAnyStressWide_sum);
run;
*921 valid csrna_mlm2
*921 valid csrpa_mlm2
*1011 valid na_mlm2
*1011 valid pa_mlm2
*1243 valid Norepinephrine
*1054 valid cage
*1054 valid ClinicSex
*1054 valid marriedmidus
*1054 valid work
*1051 valid race_orig
*1051 valid cedu
*1032 valid cHHtotalIncome
*767 valid EverSmokeReg
*1255 valid Exercise20mins
*1137 valid CNSmeds
*1254 valid cBMI
*1246 valid cCESD
*1050 valid cNeuroticism
*1255 valid cChronCondNumb
*1011 valid cAnyStressWide_sum
If you have 100 observations, but 5 missing values for Variable 1 and 5 missing values for Variable 2 - in different rows then you have a total of up to 10 excluded records. You could have as low as 5. It depends on how they overlap within the rows, 5 means they overlap entirely, 10 means they do not overlap at all.
This blog posts illustrates how to see the pattern of missing data and then links to an interesting method to visualize it.
https://blogs.sas.com/content/iml/2016/04/18/patterns-of-missing-data-in-sas.html
If you have 100 observations, but 5 missing values for Variable 1 and 5 missing values for Variable 2 - in different rows then you have a total of up to 10 excluded records. You could have as low as 5. It depends on how they overlap within the rows, 5 means they overlap entirely, 10 means they do not overlap at all.
This blog posts illustrates how to see the pattern of missing data and then links to an interesting method to visualize it.
https://blogs.sas.com/content/iml/2016/04/18/patterns-of-missing-data-in-sas.html
You really need to provide 1) example data and 2) what you expect to see as result for your example.
Your data step makes no sense in terms of what you are asking as single variables are not what you want.
Try this:
data JH.Final3_missinfo; set JH.Final3; nummissing =cmiss(Norepinephrine,cage,ClinicSex,marriedmidus,work,race_orig,cedu,cHHtotalIncome,EverSmokeReg,Exercise20mins,CNSmeds,cBMI,cCESD,cNeuroticism,cChronCondNumb,cAnyStressWide_sum,pa_mlm2,cPAreactpost); run;
If Nummissing is 1 or more then that record was excluded from your model. So you could filter on the Nummissing variable to select or exclude records that are missing. The value tells you how many variable for any record were missing.
This is a simple program to determine the frequency of each observed pattern of missing values (untested)::
data missing_patterns;
set jh.final3;
array vars Norepinephrine cage ClinicSex marriedmidus work
race_orig cedu cHHtotalIncome EverSmokeReg Exercise20mins
CNSmeds cBMI cCESD cNeuroticism cChronCondNumb
cAnyStressWide_sum pa_mlm2 cPAreactpost ;
do over vars;
if missing(vars) then vars=1;
else vars=0;
end;
run;
proc summary data=missing_patterns nway n noprint;
class Norepinephrine cage ClinicSex marriedmidus work race_orig cedu
cHHtotalIncome EverSmokeReg Exercise20mins CNSmeds cBMI cCESD cNeuroticism
cChronCondNumb cAnyStressWide_sum pa_mlm2 cPAreactpost ;
output out=pattern_frequency (drop=_type_);
run;
proc print data=pattern_frequency;
run;
You should have a frequency of 521 for the all-zero pattern.
Thank you all very much for your helpful comments, I tried all of them and have a few comments and follow-up questions:
1) I first wanted to clarify my question. As I write up my manuscript, I want to be able to provide my readers with how I reached the final analytic sample. And below is a simpler version (copy-pasted from a published manuscript) of the type of information I want to deliver in the manuscript, how our sample size was reduced down to the final sample of 521 participants.
"Affective reactivity was calculated for all participants who had both stressor days (i.e., days on which a stressor occurred) and nonstressor days; 43 participants were excluded because they experienced stressors every day, and 70 were excluded because they experienced no stressors during the study. An additional 16 participants were excluded for missing data on income, leaving a final sample size of 872 for the primary analyses. Procedures were approved by Institutional Review Boards at participating sites, and all participants provided informed consent."
2) Attached is the output I get when I run the syntax below (the original sample size is 2,266 participants). These "patterns of missing data" are super helpful and insightful but with so many patterns, I am having a hard time figuring out how I can write up the above section. Would the best method be summing all values in the "freq" column for each variable for each row that has a missing value? For example, for the variable "cPAreactpost," the total number of missing participants would be calculated by adding all frequency values in the yellow highlighted cells, which would add up to 482 participants who have missing values on this variable?
ods select MissPattern;
proc mi data=JH.Final3 nimpute=0;
var cage ClinicSex marriedmidus work race_orig cedu cHHtotalIncome EverSmokeReg Exercise20mins CNSmeds cBMI cCESD cNeuroticism cChronCondNumb cAnyStressWide_sum pa_mlm2 cPAreactpost;
run;
Thank you again for all your help! I apologize for not being to upload the data because it's too big even in a zip file.
Thank you again!
It's rare to be able to say it as simply as that.
For you it will be something like:
82 observations had missing values for Var1, Var2, Var3,
83 observations had missing values for Var2, Var4.
You can go through the output to get that information.
Hi @Reeza,
Thank you so much for confirming that it's often not as simple as the example I showed you and for writing out an example of how I should think about reporting the missing values!
@Reeza , I did wanted to post one quick follow-up question! I wanted to ensure that the way I am thinking of the following output and the missing data patterns are correct!
The last four variables in the excel sheet (csrna_mlm2, csrpa_mlm2, cAnyStressWide_sum, Norepinephrine) are my main IVs and DV. So I'm just going to add up a total number of participants missing data for these main variables of interest and write something like the below in my manuscript. I calculated 343 by adding up the people highlighted in orange in the excel sheet below and wanted to make sure that I am understanding this output correctly. Thank you SO much for all your helpful advice again!
"From the original sample of 1,255 participants, a total of 343 participants were excluded for missing data on the main variables of interest (negative affect reactivity, positive affect reactivity, total number of stressors, and levels of norepinephrine). Further, an additional 391 participants were excluded for missing data on covariates, leaving a final sample size of 521for the final analyses."
Thank you for this additional comment about the missing data pattern!
I will think about this more as this is an important question.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.