I'm conducting a common (principal axis) factor analysis, but I have missing data. I decided to run multiple imputations. A response by Rob suggested the following
1. Run Multiple Imputation
2. Develop factor scores for each of the M imputations
3. Run M logit models, using the respective factor scores for each data set as predictors.
4. Combine the estimates from the logit models using Proc MIANALYZE
However, since I intend to use the factor scores as an index of industrial differentiation to assign industries into sectors, I'm unsure if I must run an mlogit model before combining the factor scores.
Thank you for your advice,
Kim
Hi Kim,
Don't worry about taking some time to reply - we all have other things that demand our attention.
I have thought about this some more, and spent some time reading the PLS examples and documentation. One thing I found out is that PLS takes the standard SAS approach to regression for missing response (= left hand side) variables. If you include them in the analysis dataset, the output will have the imputed/predicted values included in the results from an OUTPUT statement. This would reduce the effort to a single analysis. In that analysis, I would definitely use the missing=em(maxiter=n) option in the PROC PLS statement, as opposed to the default NONE or AVG.
And if you do that, then you have an imputed dataset on which you can do a confirmatory factor analysis. At that point, I would follow your professor's advice and use only one imputed dataset. If you use the random cross-validation method (making sure to specify a fixed seed), I believe you will have imputed the missing values as well as possible, and you avoid having to model average the variance-covariance matrix across several randomly selected datasets.
Good luck. If you get stuck using PROC PLS, post your whole log and look for a response from @PaigeMiller who has a lot of experience with PLS.
SteveDenham
Steps 3 and 4 were unique to what the other person who posted the question asked and would not apply in your case.
The problem that you are facing specifically is that you are looking for a single set of factor scores and there is really no way to do with multiply imputed data since they do not have a standard error associated with them. Unfortunately I have not seen this dealt with anywhere in the literature and don't have a solid suggestion.
Yes, neither have I. I've searched for a solution/lit for a while. If using imputed data to get Factor Scores is impossible, is there another analysis besides FA where I would get the same effect? Or should I find another way to deal with my missing data that isn't MI. I do not want to delete missing cases.
Thank you Rob for responding.
This is purely spitballing. To start off - what is the partition like for complete cases vs cases with at least one missing variable? If the missing case subsample is not a large part of the data, you might try using PROC PLS in a two stage manner. The first step would be to use the complete cases. In this step, I would set the left hand side to the variables that are missing in the missing case subsample but present in the complete case subsample and get out the factors etc. In the second step, I would add in the cases with at least one missing variable and get the predicted values for the missing data. The Getting Started section of the documentation for PROC PLS shows an example. Once you have the missing values estimated by PLS, you can complete the data set and then move on to the factor analysis. This is a "single imputation". There might be a way to bootstrap this to get several imputations (N).to get several imputed factor matrixes. Combining those doesn't look like an easy task, but an average variance and covariance over the N matrixes might be possible.
SteveDenham
Hi Kim,
Don't worry about taking some time to reply - we all have other things that demand our attention.
I have thought about this some more, and spent some time reading the PLS examples and documentation. One thing I found out is that PLS takes the standard SAS approach to regression for missing response (= left hand side) variables. If you include them in the analysis dataset, the output will have the imputed/predicted values included in the results from an OUTPUT statement. This would reduce the effort to a single analysis. In that analysis, I would definitely use the missing=em(maxiter=n) option in the PROC PLS statement, as opposed to the default NONE or AVG.
And if you do that, then you have an imputed dataset on which you can do a confirmatory factor analysis. At that point, I would follow your professor's advice and use only one imputed dataset. If you use the random cross-validation method (making sure to specify a fixed seed), I believe you will have imputed the missing values as well as possible, and you avoid having to model average the variance-covariance matrix across several randomly selected datasets.
Good luck. If you get stuck using PROC PLS, post your whole log and look for a response from @PaigeMiller who has a lot of experience with PLS.
SteveDenham
@SteveDenham I think you may have a good idea, and I'm interested to see how this works. However, there is one step that I am not able to grasp from your explanation.
Specifically, in factor analysis there is no such thing as predictor variables or response variables; while in PLS there are y-variables (response) and x-variables (predictors). So, I'm missing the connection between factor analysis and PLS.
@PaigeMiller , I think we are stuck back at my first idea. There I wanted to make any of the variables that had at least one missing value the response variables, and all of the rest predictor variables, then run PLS on only the complete records to get the loadings needed to fill in the missing values. This step could be looped with different subsets of complete records to get several matrixes of complete records. And then these would be combined to get a single matrix to use as input to PROC FACTOR.
Now it turns out that the MISSING= option gives some flexibility there, but the point is to impute the missing values based on all of the predictors and the relationships between the various response variables. By using the OUTPUT out=out_dsn predicted=pred, SAS will produce a data set with no missing variables. One just has to combine the predicted values in the data set with the already existing values in the original data set.so that there are complete values for every record. That dataset could then be used for a confirmatory factor analysis.
Of course, this all depends on there being enough records with no missing variables to get a good initial fit.
SteveDenham
Thanks, @SteveDenham I think I understand now.
There I wanted to make any of the variables that had at least one missing value the response variables, and all of the rest predictor variables, then run PLS on only the complete records to get the loadings needed to fill in the missing values. This step could be looped with different subsets of complete records to get several matrixes of complete records. And then these would be combined to get a single matrix to use as input to PROC FACTOR.
I actually did this a long time ago in MATLAB and in a PLS context, where there was a specific response variable which was never missing, only some of the predictor variables were missing. There's no reason why it wouldn't work for factor analysis, but it will take a fair bit of programming.
@PaigeMiller & @SteveDenham , Instead of dividing the variables into 2 groups like this, why not specify all of the variables both as predictors and responses? When X=Y, PLS on complete data reduces to PCA in the sense that you get the same scores and loadings from each matrix and they match PC scores and loadings. So I have been using the missing=EM syntax, with incomplete data tables and a high number of iterations, to obtain EM-PCA single imputations. You don't even need to make copies of your variables as PROC PLS does not complain when the same variable names appear on the left and right sides of the model statement.
I don't do much PCA these days, but I like your suggestion!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.