Solved: Re: How to use proc mianalyze after proc mi in a mixed variables case?

David_M · Posted 09-05-2025 06:25 PM

I'm a new SAS user and I finally managed to impute a 110 mixed variable dataset using proc mi using the code below:

proc mi data=F0_Data NIMPUTE=50 out=F0_Imputed_Data seed=54321;
    
     class Q4_diseaseF0 Q5_MSD_SymptomsF0 Q6a1F0 Q6b1F0 Q6c1F0 Q6d1F0 Q6e1F0 Q6f1F0 Q7a1F0 Q7b1F0 Q7c1F0 Q7d1F0 Q7e1F0 Q7f1F0 Q7g1F0 ... ;

     var Q3_BMI_CategoryF0 Q4_diseaseF0 Q5_MSD_SymptomsF0 Q6a1F0 Q6b1F0 Q6c1F0 Q6d1F0 Q6e1F0 Q6f1F0 Q7a1F0 Q7b1F0 Q7c1F0 Q7d1F0 Q7e1F0 Q7f1F0 Q7g1F0 ... ; /* up to 110 variables */
                /* Use FCS for mixed types */

		/* Use linear regression for continuous variables */
     fcs reg(Q3_BMI_CategoryF0 Q10_raw_sleep_scoreF0 Q11_13_sleep_deficitF0 Q16bF0 Q16cF0 Q19_CESDF0 Q20_WRFQF0 ...)

		/* Use logistic regression for binary variables*/	
     logistic(Q4_diseaseF0 Q7a1F0 Q7b1F0 Q7c1F0 Q7d1F0 Q7e1F0 Q7f1F0 Q7g1F0 ...	/details LIKELIHOOD=AUGMENT link=logit descending)

		/* Use logistic regression for ordinal variables*/	
     logistic(Q5_MSD_SymptomsF0 Q6a1F0 Q6b1F0 Q6c1F0 Q6d1F0 Q6e1F0 ...	/details LIKELIHOOD=AUGMENT link=logit descending)

		/* Use discriminat analysis for nominal variables*/	
     discrim(Q16aF0 Q16dF0 Q24_Job_title_codeF0 Q26_SHIFTCAT_F0  ... /CLASSEFFECTS=INCLUDE);
run;

1) I need to pick the best /most optimum imputed dataset from the 50 generated using Proc MiAnalyze (and other procedures?), but I have no clue how to properly use it in my mixed variables case. BTW, as explained earlier, I have 110 mixed variables from over 400 survey respondents with no missing values at this point in time. Proc MiAnalyze SAS examples are not helpful.

2) Secondly, many of the imputed values for non-continuous variables are floating point, which doesn't make sense. How do I round up or down while leaving the continuous values as is. It seems the "round=0.1" option for example, applies to all variables, which is not desirable.

Thanks,

David

SAS_Rob · Posted 09-08-2025 10:56 AM

Regarding your first question, there is no optimal data set. The purpose of using Proc MI is to develop several data sets so as to avoid the problems of bias that is typically associated with single imputation. To use only one of those data sets would still leave you with most of those issues. It might be helpful to review the Overview section of the MI documentation to familiarize yourself with its purpose.

SAS Help Center: Overview: MI Procedure

Multiple imputation inference involves three distinct phases:

The missing data are filled in m times to generate m complete data sets.
The m complete data sets are analyzed by using standard procedures.
The results from the m complete data sets are combined for the inference.

MIANALYZE would be used in the 3rd step. Proc MI is used as the 1st step. The 2nd step would be to run whatever analytical procedure you are planning to run (such as for a regression you would use Proc REG).

Regarding your second question, if the data is gathered as discrete measurements, then it may be that you need to use the CLASS statement as already suggested. If instead you want the round the measurements to a certain precision, then you could use the ROUND= option, but you will need to be explicit in mapping it to the variables.

So for example, if you want to round variables A and C to the nearest tenth and not round B and D, you would use the following:

proc mi data=yourdata out=outdata round=.1 . .1 .;

var a b c d;

run;

Note how the ordering of the values on the ROUND= option matches the ordering of the variables on the VAR statement.

View solution in original post

Ksharp · Posted 09-06-2025 03:48 AM

1)
If you "need to pick the best /most optimum imputed dataset" ,there has nothing to do with PROC MIANALYZE.
PROC MIANALYZE is just combining these 50 imputed datasets into ONE model.
If you want to pick up the best imputed dataset, why not running your model(e.g. PROC MIXED) on these 50 datasets and found out the smallest AIC,BIC of model ?

2) For this scenario, you need to put this variable in CLASS statement to let PROC MI know it is a category variable not a continuous variable.

David_M · Posted 09-06-2025 07:52 AM

Thank you for this important clarification, @Ksharp! Now, coming to think of it, picking the best dataset that gives the smallest AIC, BIC, etc , is a form of selection bias or p-hacking that cherry-picks the results that best support my hypothesis. Doesn't choosing one best dataset ignore the uncertainty involved in the imputation process that is captured in the other 49 datasets. My concern is that this would lead to standard errors that are too small and confidence intervals that are too narrow, which could result in you incorrectly finding a statistically significant result (Type I error).

Would it be possible to perform a RMSE type operation on all 25 imputed values of a particular cell and use that value as the optimum value for the missing cell? For example, get all 25 imputed values for empty cell 5 for variable BMI_F0, compute their root mean square and replace the empty cell with the RMSE value. This would be done for all cells with missing values.

Ksharp · Posted 09-07-2025 07:25 AM

Sorry. I am not expert about multiple imputed method. Maybe others could give you some help.

SAS_Rob · Posted 09-08-2025 10:56 AM

Regarding your first question, there is no optimal data set. The purpose of using Proc MI is to develop several data sets so as to avoid the problems of bias that is typically associated with single imputation. To use only one of those data sets would still leave you with most of those issues. It might be helpful to review the Overview section of the MI documentation to familiarize yourself with its purpose.

SAS Help Center: Overview: MI Procedure

Multiple imputation inference involves three distinct phases:

The missing data are filled in m times to generate m complete data sets.
The m complete data sets are analyzed by using standard procedures.
The results from the m complete data sets are combined for the inference.

MIANALYZE would be used in the 3rd step. Proc MI is used as the 1st step. The 2nd step would be to run whatever analytical procedure you are planning to run (such as for a regression you would use Proc REG).

Regarding your second question, if the data is gathered as discrete measurements, then it may be that you need to use the CLASS statement as already suggested. If instead you want the round the measurements to a certain precision, then you could use the ROUND= option, but you will need to be explicit in mapping it to the variables.

So for example, if you want to round variables A and C to the nearest tenth and not round B and D, you would use the following:

proc mi data=yourdata out=outdata round=.1 . .1 .;

var a b c d;

run;

Note how the ordering of the values on the ROUND= option matches the ordering of the variables on the VAR statement.

David_M · Posted 09-08-2025 12:58 PM

So, if you have lots of variables like I have (110) in the var statement, a mixture of continuous, binary, ordinal and nominals, the round option would be 110 long? I only want to round the latter three types to 1 unit, leaving the continuous ones alone.

SAS_Rob · Posted 09-09-2025 08:44 AM

Yes, when you have a mix of variables that you want to round and not round, you will need to enumerate them according to the order on the VAR statement.

I should mention as well that in the overwhelming majority of cases with a nominal variable it is considered more proper to use the CLASS statement than to try and use the ROUND= option.

David_M · Posted 09-09-2025 04:39 PM

Thanks @Sas , but all my nominal, ordinal and binary variables are in the CLASS statement. So will their imputed values be rounded per the round option or rounded up/down to the nearest integer? The imputed continuous should be computed per the round option, I expect.

SAS_Rob · Posted 09-10-2025 09:29 AM

If the variables are already on the CLASS statement then there is no need to use the ROUND= option for those variables as well. It will ignore any attempt to round them. The ROUND= option only has an effect for variables that are on the VAR statement but not on the CLASS statement.

David_M · Posted 09-10-2025 09:58 AM

Excellent. Thank you!

Registration is open