Multiply imputed categorical data: How to get frequency and percent es...

EricVanceMartin · Posted 06-30-2020 03:27 PM

I have a dataset with only a few continuous variables and a large number of ordinal categorical variables.

I have successfully run PROC MI with predictive mean matching for continuous variables and discriminant functions for ordinal categorical variables.

Is there a native way to get frequency and percent estimates for the imputed data for the categorical variables without turning them into binary dummies?

I have seen multiple questions for this online, but no answers.

Bonus question: You can use trace plots to look at convergence for continuous variables. But is there any output you can use to examine convergence related to the categorical variables (again, unless converted to dummies, I guess).

Thanks.

SteveDenham · Posted 07-01-2020 09:39 AM

I think there is an inbetween step here where you analyze the imputations by imputation. What procedure are you using for this analysis? That may inform how MIANALYZE gets results. From there, you may have to do some post-processing to get percent estimates.

SteveDenham

EricVanceMartin · Posted 07-01-2020 10:30 AM

Thanks so much for your reply, Steve.

Yes, I think I'm doing this in the standard way.

1. Run MI

2. Run some PROC by imputation

3. Combine imputations with MIANALYZE.

The question is: What PROC goes in Step 2, and what output of this PROC is fed to MIANALYZE? I have read that PROC FREQ does not work for this, though I know PROC MEANS does. What I am doing now is converting each ordinal to binary dummies, running PROC MEANS by imputation, then combining the means with MIANALYZE, giving me the proportion of each value and a standard error.... I think 🤔

I would love to know your assessment of this approach, and a more direct way--leaving the variables in their ordinal form--if it exists.

Also, any thoughts about judging the convergence when using the imputation methods for categorical data?

Thank you!

SteveDenham · Posted 07-01-2020 12:45 PM

Using PROC MEANS is a good approach, if you had only two categorical variables then another would be PROC UNIVARIATE. You can then post-process the output to get percentages based on the counts. So then the question becomes "Could you use the CLASS statement in PROC MEANS and avoid the need to code up a lot of binary variables?" I think you should try, as it ought to reduce the amount of post-processing.

Now as far as judging convergence, the best I can come up with is to look at the relative efficiencies. If it is less than 0.99, you should probably look at a different method, but that doesn't say anything about the convergence. There might be a way to use the OUTITER=<dsn> to look at various values through multiple iterations. The documentation says that the dataset type is COV, but I am not sure what that implies in this case. If it is a square matrix, you could look at stabilization of the eigenvalues from iteration to iteration. What would probably be better is to use SGPLOT to generate something where you can look graphically for trends.

SteveDenham

EricVanceMartin · Posted 07-01-2020 01:12 PM

Thanks so much, Steve! It's good to think I chose reasonable workarounds.

@SteveDenham wrote:

So then the question becomes "Could you use the CLASS statement in PROC MEANS and avoid the need to code up a lot of binary variables?" I think you should try, as it ought to reduce the amount of post-processing.

Oops. I just wrote and am executing 5,000 lines of code. 😄

I appreciate your ideas about judging the MI results. I'll experiment with them. Thanks again.

SAS_Rob · Posted 07-06-2020 03:28 PM

It shouldn't be necessary to make the conversion and use Proc MEANS. You could just use Proc SURVEYFREQ instead which gives standard errors for both the percentages and the frequencies. You could do something similar to the example below.

/* Getting Started Example
Generate Data */

proc format;
value ResponseCode 1 = 'Very Unsatisfied'
2 = 'Unsatisfied'
3 = 'Neutral'
4 = 'Satisfied'
5 = 'Very Satisfied';
run;

proc format;
value UserCode 1 = 'New Customer'
0 = 'Renewal Customer';
run;

proc format;
value SchoolCode 1 = 'Middle School'
2 = 'High School';
run;

proc format;
value DeptCode 0 = 'Faculty'
1 = 'Admin/Guidance';
run;

data SIS_Survey;
format Response ResponseCode.;
format NewUser UserCode.;
format SchoolType SchoolCode.;
format Department DeptCode.;
do _imputation_=1 to 2;
drop j;
retain seed1 111;
retain seed2 222;
retain seed3 333;

State = 'GA';

NewUser = 1;
do School=1 to 71;