Hi everyone,
Currently, I'm trying my best to perform multiple imputation on my original dataset with 1000 observations with missing data. Missing data was not coded to be 99 of 999, but coded to be " " or ".". I created 20 imputed datasets using proc mi, and used proc genmod to compute parameter estimates. Then I used proc mianalyze to pool these estimates. So I followed all steps within multiple imputation.
However, I now have one giant imputed dataset with 20 x 1000 = 20 thousand observations. If I run my analyses on this giant dataset, everything suddenly becomes significant due to large sample size. How do get 1 most optimal imputed dataset with 1000 observations (just like the original) from these 20 imputed datasets?
I would appreciate it if someone could help me out!
Here you find my syntax. If you have any further feedback on it or tips for me, please let me know.
proc mi data=cko.blazib nimpute=20 seed=54321 out=cko.mi1_blazib; class nac gesl ses_cbs stage histo ps cci; var nac gesl leeft ses_cbs stage histo ps cci ckd_epi bmi; fcs logistic (ps = nac gesl leeft ses_cbs stage histo cci ckd_epi bmi / link=logit) nbiter =200 ; fcs logistic (cci = nac gesl leeft ses_cbs stage histo ps ckd_epi bmi / link=logit) nbiter =200 ; fcs logistic (ses_cbs = nac gesl leeft stage histo ps cci ckd_epi bmi / link=logit) nbiter =200 ; fcs regpmm (ckd_epi = nac gesl leeft ses_cbs stage histo ps cci bmi) nbiter =200 ; fcs regpmm (bmi = nac gesl leeft ses_cbs stage histo ps cci ckd_epi) nbiter =200 ; fcs plots=trace(mean std); run; proc genmod data=cko.mi1_blazib; class nac gesl ses_cbs stage histo ps cci; model nac(event="1") = gesl leeft ses_cbs bmi stage histo ps cci ckd_epi bmi /dist=binomial link=logit; by _imputation_; ods output ParameterEstimates=cko.gm_fcs; run; proc mianalyze parms(classvar=level)=cko.gm_fcs; class gesl ses_cbs stage histo ps cci; modeleffects INTERCEPT gesl leeft ses_cbs stage histo ps cci ckd_epi bmi; run;
I think what you are missing is the EDF= option in the Proc MIANALYZE statement. Sometimes when the there is only a modest proportion of missing data, the computed degrees of freedom can be much larger than the complete data DF leading to inflated p-values. Setting the EDF= option to the complete data DF will invoke an adjustment that is explained in the documentation.
Thanks for your reply! Can you please elaborate? I don't think the EDF option will yield 1 optimal imputed dataset based from the 20 imputed datasets. I need a procedure that combines these 20 into 1 optimal dataset, and I just can't seem to find the procedure for that.
Sorry for the confusion as I read your question to be pertaining to the output from MIANALYZE.
You are right, there is not an optimal data set that is produced as that is not the purpose of multiple imputation. The purpose of MI is to enable you to obtain valid statistical inferences that properly reflect the uncertainty due to missing values; for example, valid confidence intervals for parameter estimates.
If you are looking for a single imputed data set then you should not use multiple imputation but some single imputation method.
I don't know if this is relevant, but in the GENMOD step you specified BMI twice. I try to make sure that my MODEL statement (PROC GENMOD) and MODELEFFECTS stmt (PROC MIANALYZE) are in the same order. Might not make any difference....
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.