BookmarkSubscribeRSS Feed
CDB1
Fluorite | Level 6

Hello all,

 

I am conducting principal components analysis (PCA) on data collected using a pretty lengthy questionnaire. I was able to run regression models with PCA scores as predictors using complete case analysis without an issue. However, the initial questionnaire data had a number of missing responses and I would like to incorporate multiple imputation into my analysis. I believe that it would be best to impute the raw questionnaire values but if I do imputation first, I will have a number of imputed datasets and am unclear of how to conduct PCA with more than one dataset.

 

Has anyone had experience conducting PCA following multiple imputation? Can you provide an example or direction? 

7 REPLIES 7
PaigeMiller
Diamond | Level 26

How are you planning to do the multiple imputation? There are many possibilities and it's really impossible to advise you unless you give us more details.

 

Are you planning to do multiple imputation using PROC MI, which would result multiple data sets? Something else? You would combine the data sets for the next analysis, but of course, there are a number of issues that need to be addressed. The first one that comes to mind is: does the imputation change the correlations between the x-variables, whih would then change the principal components model.

 

Also, the spelling is "principal components".

--
Paige Miller
PaigeMiller
Diamond | Level 26

So I'm going to take a detour here, and describe how I would do this analysis, and it does not use Principal Components.

 

If you are trying to predict the Y variable response, Principal Components doesn't fit, in my opinion. Why? Because Principal Components finds dimensions (or new variables, whatever you want to call them) that are found regardless of the Y-variable. In other words, it does not try to find dimensions that are predictive, and so there's no guarantee you get the most predictive variables. The alternative is Partial Least Squares (PLS), which DOES find dimensions that are predictive of Y. Seems to me that's better than what Principal Components gives you. Now the two methods MAY find the same dimensions or variables, but (more than likely) they MAY NOT find the same dimensions or variables, in which case PLS produces superior predictions.

 

The benefit of using PROC PLS in SAS is not only the above, but it has built-in missing value handling. If you specify MISSING=EM, then PROC PLS will do the imputation of the missing values and find dimensions or variables in a way that is appropriate to the PLS algorithm.

 

So it seems to me that Partial Least Squares, and not Principal Components, gets you to the end result you want (predicting the Y variable) in a superior fashion, with most likely better fit, and built-in and appropriate handling of missing values.

--
Paige Miller
CDB1
Fluorite | Level 6
Thank you for your quick response and correction! I will be untilizing proc mi which will result in 10 datasets. I am thinking I could combine the datasets and check the correlations.
CDB1
Fluorite | Level 6
I was curious about the other issues that you mentioned, if you wouldn’t mind discussing them.

Thank you again!
CDB1
Fluorite | Level 6

Thank you, this is extremely informative! I decided on PCA because I wanted to see if there were any distinguishable “patterns” in the data - it is a food frequency questionnaire which I realize did not explain earlier.

PaigeMiller
Diamond | Level 26

@CDB1 wrote:
Thank you, this is extremely informative! I decided on PCA because I wanted to see if there were in distinguishable “patterns” in the data - it is a food frequency questionnaire which I realize did not explain earlier.

Partial Least Squares does this as well. The same types of graphics in PCA to search for patterns (plots of scores, for example) can be performed with PLS.

--
Paige Miller
CDB1
Fluorite | Level 6
Okay, great to know. I appreciate your help and insight.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 875 views
  • 0 likes
  • 2 in conversation