Hello all,
I am conducting principal components analysis (PCA) on data collected using a pretty lengthy questionnaire. I was able to run regression models with PCA scores as predictors using complete case analysis without an issue. However, the initial questionnaire data had a number of missing responses and I would like to incorporate multiple imputation into my analysis. I believe that it would be best to impute the raw questionnaire values but if I do imputation first, I will have a number of imputed datasets and am unclear of how to conduct PCA with more than one dataset.
Has anyone had experience conducting PCA following multiple imputation? Can you provide an example or direction?
How are you planning to do the multiple imputation? There are many possibilities and it's really impossible to advise you unless you give us more details.
Are you planning to do multiple imputation using PROC MI, which would result multiple data sets? Something else? You would combine the data sets for the next analysis, but of course, there are a number of issues that need to be addressed. The first one that comes to mind is: does the imputation change the correlations between the x-variables, whih would then change the principal components model.
Also, the spelling is "principal components".
So I'm going to take a detour here, and describe how I would do this analysis, and it does not use Principal Components.
If you are trying to predict the Y variable response, Principal Components doesn't fit, in my opinion. Why? Because Principal Components finds dimensions (or new variables, whatever you want to call them) that are found regardless of the Y-variable. In other words, it does not try to find dimensions that are predictive, and so there's no guarantee you get the most predictive variables. The alternative is Partial Least Squares (PLS), which DOES find dimensions that are predictive of Y. Seems to me that's better than what Principal Components gives you. Now the two methods MAY find the same dimensions or variables, but (more than likely) they MAY NOT find the same dimensions or variables, in which case PLS produces superior predictions.
The benefit of using PROC PLS in SAS is not only the above, but it has built-in missing value handling. If you specify MISSING=EM, then PROC PLS will do the imputation of the missing values and find dimensions or variables in a way that is appropriate to the PLS algorithm.
So it seems to me that Partial Least Squares, and not Principal Components, gets you to the end result you want (predicting the Y variable) in a superior fashion, with most likely better fit, and built-in and appropriate handling of missing values.
Thank you, this is extremely informative! I decided on PCA because I wanted to see if there were any distinguishable “patterns” in the data - it is a food frequency questionnaire which I realize did not explain earlier.
@CDB1 wrote:
Thank you, this is extremely informative! I decided on PCA because I wanted to see if there were in distinguishable “patterns” in the data - it is a food frequency questionnaire which I realize did not explain earlier.
Partial Least Squares does this as well. The same types of graphics in PCA to search for patterns (plots of scores, for example) can be performed with PLS.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.