BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
David_M
Obsidian | Level 7

I'm a new SAS user and I finally managed to impute a 110 mixed variable dataset using proc mi using the code below:

 

proc mi data=F0_Data NIMPUTE=50 out=F0_Imputed_Data seed=54321;
    
     class Q4_diseaseF0 Q5_MSD_SymptomsF0 Q6a1F0 Q6b1F0 Q6c1F0 Q6d1F0 Q6e1F0 Q6f1F0 Q7a1F0 Q7b1F0 Q7c1F0 Q7d1F0 Q7e1F0 Q7f1F0 Q7g1F0 ... ;

     var Q3_BMI_CategoryF0 Q4_diseaseF0 Q5_MSD_SymptomsF0 Q6a1F0 Q6b1F0 Q6c1F0 Q6d1F0 Q6e1F0 Q6f1F0 Q7a1F0 Q7b1F0 Q7c1F0 Q7d1F0 Q7e1F0 Q7f1F0 Q7g1F0 ... ; /* up to 110 variables */
/* Use FCS for mixed types */
/* Use linear regression for continuous variables */ fcs reg(Q3_BMI_CategoryF0 Q10_raw_sleep_scoreF0 Q11_13_sleep_deficitF0 Q16bF0 Q16cF0 Q19_CESDF0 Q20_WRFQF0 ...) /* Use logistic regression for binary variables*/ logistic(Q4_diseaseF0 Q7a1F0 Q7b1F0 Q7c1F0 Q7d1F0 Q7e1F0 Q7f1F0 Q7g1F0 ... /details LIKELIHOOD=AUGMENT link=logit descending) /* Use logistic regression for ordinal variables*/ logistic(Q5_MSD_SymptomsF0 Q6a1F0 Q6b1F0 Q6c1F0 Q6d1F0 Q6e1F0 ... /details LIKELIHOOD=AUGMENT link=logit descending) /* Use discriminat analysis for nominal variables*/ discrim(Q16aF0 Q16dF0 Q24_Job_title_codeF0 Q26_SHIFTCAT_F0 ... /CLASSEFFECTS=INCLUDE); run;

1) I need to pick the best /most optimum imputed dataset from the 50 generated using Proc MiAnalyze (and other procedures?), but I have no clue how to properly use it in my mixed variables case. BTW, as explained earlier, I have 110 mixed variables from over 400 survey respondents with no missing values at this point in time. Proc MiAnalyze SAS examples are not helpful.

 

2) Secondly, many of the imputed values for non-continuous variables are floating point, which doesn't make sense. How do I round up or down  while leaving the continuous values as is. It seems the "round=0.1" option for example, applies to all variables, which is not desirable.

 

Thanks,

David

 

1 ACCEPTED SOLUTION

Accepted Solutions
SAS_Rob
SAS Employee

Regarding your first question, there is no optimal data set.  The purpose of using Proc MI is to develop several data sets so as to avoid the problems of bias that is typically associated with single imputation.  To use only one of those data sets would still leave you with most of those issues.  It might be helpful to review the Overview section of the MI documentation to familiarize yourself with its purpose.

SAS Help Center: Overview: MI Procedure

Multiple imputation inference involves three distinct phases:

  1. The missing data are filled in m times to generate m complete data sets.

  2. The m complete data sets are analyzed by using standard procedures.

  3. The results from the m complete data sets are combined for the inference.

MIANALYZE would be used in the 3rd step.  Proc MI is used as the 1st step.  The 2nd step would be to run whatever analytical procedure you are planning to run (such as for a regression you would use Proc REG).

 

Regarding your second question, if the data is gathered as discrete measurements, then it may be that you need to use the CLASS statement as already suggested.  If instead you want the round the measurements to a certain precision, then you could use the ROUND= option, but you will need to be explicit in mapping it to the variables.

So for example, if you want to round variables A and C to the nearest tenth and not round B and D, you would use the following:

proc mi data=yourdata out=outdata round=.1 . .1 .;

var a b c d;

run;

Note how the ordering of the values on the ROUND= option matches the ordering of the variables on the VAR statement.  

View solution in original post

5 REPLIES 5
Ksharp
Super User
1)
If you "need to pick the best /most optimum imputed dataset" ,there has nothing to do with PROC MIANALYZE.
PROC MIANALYZE is just combining these 50 imputed datasets into ONE model.
If you want to pick up the best imputed dataset, why not running your model(e.g. PROC MIXED) on these 50 datasets and found out the smallest AIC,BIC of model ?

2) For this scenario, you need to put this variable in CLASS statement to let PROC MI know it is a category variable not a continuous variable.
David_M
Obsidian | Level 7

Thank you for this important clarification, @Ksharp!  Now, coming to think of it, picking the best dataset that gives the smallest AIC, BIC, etc , is a form of selection bias or p-hacking that cherry-picks the results that best support my hypothesis. Doesn't choosing one best dataset ignore the uncertainty involved in the imputation process that is captured in the other 49 datasets. My concern is that this would lead to standard errors that are too small and confidence intervals that are too narrow, which could result in you incorrectly finding a statistically significant result (Type I error).

 

Would it be possible to perform a RMSE type operation on all 25 imputed values of a particular cell and use that value as the optimum value for the missing cell? For example, get all 25 imputed values for empty cell 5 for variable BMI_F0, compute their root mean square and replace the empty cell with the RMSE value. This would be done for all cells with missing values.

Ksharp
Super User
Sorry. I am not expert about multiple imputed method. Maybe others could give you some help.
SAS_Rob
SAS Employee

Regarding your first question, there is no optimal data set.  The purpose of using Proc MI is to develop several data sets so as to avoid the problems of bias that is typically associated with single imputation.  To use only one of those data sets would still leave you with most of those issues.  It might be helpful to review the Overview section of the MI documentation to familiarize yourself with its purpose.

SAS Help Center: Overview: MI Procedure

Multiple imputation inference involves three distinct phases:

  1. The missing data are filled in m times to generate m complete data sets.

  2. The m complete data sets are analyzed by using standard procedures.

  3. The results from the m complete data sets are combined for the inference.

MIANALYZE would be used in the 3rd step.  Proc MI is used as the 1st step.  The 2nd step would be to run whatever analytical procedure you are planning to run (such as for a regression you would use Proc REG).

 

Regarding your second question, if the data is gathered as discrete measurements, then it may be that you need to use the CLASS statement as already suggested.  If instead you want the round the measurements to a certain precision, then you could use the ROUND= option, but you will need to be explicit in mapping it to the variables.

So for example, if you want to round variables A and C to the nearest tenth and not round B and D, you would use the following:

proc mi data=yourdata out=outdata round=.1 . .1 .;

var a b c d;

run;

Note how the ordering of the values on the ROUND= option matches the ordering of the variables on the VAR statement.  

David_M
Obsidian | Level 7

So, if you have lots of variables like I have (110) in the var statement, a mixture of continuous, binary, ordinal and nominals, the round option would be 110 long? I only want to round the latter three  types to 1 unit, leaving the continuous ones alone.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 281 views
  • 3 likes
  • 3 in conversation