Multiple Imputation followed by Principal Components Analysis

CDB1 · Posted 01-21-2019 07:17 PM

Hello all,

I am conducting principal components analysis (PCA) on data collected using a pretty lengthy questionnaire. I was able to run regression models with PCA scores as predictors using complete case analysis without an issue. However, the initial questionnaire data had a number of missing responses and I would like to incorporate multiple imputation into my analysis. I believe that it would be best to impute the raw questionnaire values but if I do imputation first, I will have a number of imputed datasets and am unclear of how to conduct PCA with more than one dataset.

Has anyone had experience conducting PCA following multiple imputation? Can you provide an example or direction?

PaigeMiller · Posted 01-21-2019 07:57 PM

How are you planning to do the multiple imputation? There are many possibilities and it's really impossible to advise you unless you give us more details.

Are you planning to do multiple imputation using PROC MI, which would result multiple data sets? Something else? You would combine the data sets for the next analysis, but of course, there are a number of issues that need to be addressed. The first one that comes to mind is: does the imputation change the correlations between the x-variables, whih would then change the principal components model.

Also, the spelling is "principal components".

--
Paige Miller

PaigeMiller · Posted 01-22-2019 07:00 AM

So I'm going to take a detour here, and describe how I would do this analysis, and it does not use Principal Components.

If you are trying to predict the Y variable response, Principal Components doesn't fit, in my opinion. Why? Because Principal Components finds dimensions (or new variables, whatever you want to call them) that are found regardless of the Y-variable. In other words, it does not try to find dimensions that are predictive, and so there's no guarantee you get the most predictive variables. The alternative is Partial Least Squares (PLS), which DOES find dimensions that are predictive of Y. Seems to me that's better than what Principal Components gives you. Now the two methods MAY find the same dimensions or variables, but (more than likely) they MAY NOT find the same dimensions or variables, in which case PLS produces superior predictions.

The benefit of using PROC PLS in SAS is not only the above, but it has built-in missing value handling. If you specify MISSING=EM, then PROC PLS will do the imputation of the missing values and find dimensions or variables in a way that is appropriate to the PLS algorithm.

So it seems to me that Partial Least Squares, and not Principal Components, gets you to the end result you want (predicting the Y variable) in a superior fashion, with most likely better fit, and built-in and appropriate handling of missing values.

--
Paige Miller

CDB1 · Posted 01-21-2019 09:00 PM

Thank you for your quick response and correction! I will be untilizing proc mi which will result in 10 datasets. I am thinking I could combine the datasets and check the correlations.

CDB1 · Posted 01-21-2019 09:06 PM

I was curious about the other issues that you mentioned, if you wouldn’t mind discussing them.

Thank you again!

CDB1 · Posted 01-22-2019 07:24 AM

Thank you, this is extremely informative! I decided on PCA because I wanted to see if there were any distinguishable “patterns” in the data - it is a food frequency questionnaire which I realize did not explain earlier.

PaigeMiller · Posted 01-22-2019 08:13 AM

@CDB1 wrote:
Thank you, this is extremely informative! I decided on PCA because I wanted to see if there were in distinguishable “patterns” in the data - it is a food frequency questionnaire which I realize did not explain earlier.

Partial Least Squares does this as well. The same types of graphics in PCA to search for patterns (plots of scores, for example) can be performed with PLS.

--
Paige Miller

CDB1 · Posted 01-22-2019 10:01 AM

Okay, great to know. I appreciate your help and insight.

Multiple Imputation followed by Principal Components Analysis

Re: Multiple Imputation followed by Principal Components Analysis

Re: Multiple Imputation followed by Principal Components Analysis

Re: Multiple Imputation followed by Principal Components Analysis

Re: Multiple Imputation followed by Principal Components Analysis

Re: Multiple Imputation followed by Principal Components Analysis

Re: Multiple Imputation followed by Principal Components Analysis

Re: Multiple Imputation followed by Principal Components Analysis

SAS Innovate 2025: Save the Date