BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
kim_trevino1
Fluorite | Level 6

I'm conducting a common (principal axis) factor analysis, but I have missing data. I decided to run multiple imputations. A response by Rob suggested the following 

1. Run Multiple Imputation
2. Develop factor scores for each of the M imputations
3. Run M logit models, using the respective factor scores for each data set as predictors.
4. Combine the estimates from the logit models using Proc MIANALYZE

 

However, since I intend to use the factor scores as an index of industrial differentiation to assign industries into sectors, I'm unsure if I must run an mlogit model before combining the factor scores. 

 

 

Thank you for your advice, 

 

Kim 

 

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

Hi Kim,

Don't worry about taking some time to reply - we all have other things that demand our attention.

 

I have thought about this some more, and spent some time reading the PLS examples and documentation.  One thing I found out is that PLS takes the standard SAS approach to regression for missing response (= left hand side) variables.  If you include them in the analysis dataset, the output will have the imputed/predicted values included in the results from an OUTPUT statement.  This would reduce the effort to a single analysis. In that analysis, I would definitely use the missing=em(maxiter=n) option in the PROC PLS statement, as opposed to the default NONE or AVG.

 

And if you do that, then you have an imputed dataset on which you can do a confirmatory factor analysis. At that point, I would follow your professor's advice and use only one imputed dataset. If you use the random cross-validation method (making sure to specify a fixed seed), I believe you will have imputed the missing values as well as possible, and you avoid having to model average the variance-covariance matrix across several randomly selected datasets.

 

Good luck.  If you get stuck using PROC PLS, post your whole log and look for a response from @PaigeMiller who has a lot of experience with PLS.

 

SteveDenham

View solution in original post

11 REPLIES 11
SAS_Rob
SAS Employee

Steps 3 and 4 were unique to what the other person who posted the question asked and would not apply in your case. 

The problem that you are facing specifically is that you are looking for a single set of factor scores and there is really no way to do with multiply imputed data since they do not have a standard error associated with them.  Unfortunately I have not seen this dealt with anywhere in the literature and don't have a solid suggestion.

kim_trevino1
Fluorite | Level 6

Yes, neither have I. I've searched for a solution/lit for a while. If using imputed data to get Factor Scores is impossible, is there another analysis besides FA where I would get the same effect? Or should I find another way to deal with my missing data that isn't MI. I do not want to delete missing cases. 

 

Thank you Rob for responding. 

SteveDenham
Jade | Level 19

This is purely spitballing. To start off - what is the partition like for complete cases vs cases with at least one missing variable?  If the missing case subsample is not a large part of the data, you might try using PROC PLS in a two stage manner. The first step would be to use the complete cases.  In this step, I would set the left hand side to the variables that are missing in the missing case subsample but present in the complete case subsample and get out the factors etc. In the second step, I would add in the cases with at least one missing variable and get the predicted values for the missing data.  The Getting Started section of the documentation for PROC PLS shows an example.  Once you have the missing values estimated by PLS, you can complete the data set and then move on to the factor analysis.  This is a "single imputation". There might be a way to bootstrap this to get several imputations (N).to get several imputed factor matrixes. Combining those doesn't look like an easy task, but an average variance and covariance over the N matrixes might be possible.

 

SteveDenham

kim_trevino1
Fluorite | Level 6
Hello Steve,
I'm sorry for my delayed response. At the moment I live 6-9 hours ahead of the USA.

I have 176 cases total and 80 percent of those cases are complete. I read the Proc PLS getting started example Spectrometric Calibration.
I want to make sure I'm understanding. I have watched tutorials on PLS but I haven't used the analysis before.

So, I will create two data subsets, one with the missing data set to the left (SAS Sample 24663: Shift non-missing values left in each observation of a data set).
Using my complete subsample, I would go through the process of extracting factors as illustrated in The Getting Started.

Then I would take the predicted PLS model and use it on subsample with the missing data to get predicted values. This is fine because the procedure is not calculating the PLS.
Then I will look into bootstrapping and combining those. (I have not looked into this yet.)
I am I correct in my understanding of the process?

I also saw that Proc PLS has a missing statement MISSING=EM(MAXITER=n), or AVG. Since you didn't mention this step, should I understand that I shouldn't use this data procedure?

I also spoke with a professor that suggested that I do a single imputation, then run FA. Will I have a standard error with a single imputation?

Thank you,

Kim
SteveDenham
Jade | Level 19

Hi Kim,

Don't worry about taking some time to reply - we all have other things that demand our attention.

 

I have thought about this some more, and spent some time reading the PLS examples and documentation.  One thing I found out is that PLS takes the standard SAS approach to regression for missing response (= left hand side) variables.  If you include them in the analysis dataset, the output will have the imputed/predicted values included in the results from an OUTPUT statement.  This would reduce the effort to a single analysis. In that analysis, I would definitely use the missing=em(maxiter=n) option in the PROC PLS statement, as opposed to the default NONE or AVG.

 

And if you do that, then you have an imputed dataset on which you can do a confirmatory factor analysis. At that point, I would follow your professor's advice and use only one imputed dataset. If you use the random cross-validation method (making sure to specify a fixed seed), I believe you will have imputed the missing values as well as possible, and you avoid having to model average the variance-covariance matrix across several randomly selected datasets.

 

Good luck.  If you get stuck using PROC PLS, post your whole log and look for a response from @PaigeMiller who has a lot of experience with PLS.

 

SteveDenham

kim_trevino1
Fluorite | Level 6
Thank you so much. This really helps!
Have a great rest of your week.
Kim
PaigeMiller
Diamond | Level 26

@SteveDenham I think you may have a good idea, and I'm interested to see how this works. However, there is one step that I am not able to grasp from your explanation.


Specifically, in factor analysis there is no such thing as predictor variables or response variables; while in PLS there are y-variables (response) and x-variables (predictors). So, I'm missing the connection between factor analysis and PLS. 

--
Paige Miller
SteveDenham
Jade | Level 19

@PaigeMiller , I think we are stuck back at my first idea.  There I wanted to make any of the variables that had at least one missing value the response variables, and all of the rest predictor variables, then run PLS on only the complete records to get the loadings needed to fill in the missing values. This step could be looped with different subsets of complete records to get several matrixes of complete records.  And then these would be combined to get a single matrix to use as input to PROC FACTOR.

 

Now it turns out that the MISSING= option gives some flexibility there, but the point is to impute the missing values based on all of the predictors and the relationships between the various response variables.  By using the OUTPUT out=out_dsn predicted=pred, SAS will produce a data set with no missing variables.  One just has to combine the predicted values in the data set with the already existing values in the original data set.so that there are complete values for every record.  That dataset could then be used for a confirmatory factor analysis.

 

Of course, this all depends on there being enough records with no missing variables to get a good initial fit.

 

SteveDenham

PaigeMiller
Diamond | Level 26

Thanks, @SteveDenham I think I understand now.

 

There I wanted to make any of the variables that had at least one missing value the response variables, and all of the rest predictor variables, then run PLS on only the complete records to get the loadings needed to fill in the missing values. This step could be looped with different subsets of complete records to get several matrixes of complete records.  And then these would be combined to get a single matrix to use as input to PROC FACTOR.


I actually did this a long time ago in MATLAB and in a PLS context, where there was a specific response variable which was never missing, only some of the predictor variables were missing. There's no reason why it wouldn't work for factor analysis, but it will take a fair bit of programming.

 

 

--
Paige Miller
IanWakeling
Barite | Level 11

@PaigeMiller & @SteveDenham , Instead of dividing the variables into 2 groups like this, why not specify all of the variables both as predictors and responses? When X=Y, PLS on complete data reduces to PCA in the sense that you get the same scores and loadings from each matrix and they match PC scores and loadings. So I have been using the missing=EM syntax, with incomplete data tables and a high number of iterations, to obtain EM-PCA single imputations. You don't even need to make copies of your variables as PROC PLS does not complain when the same variable names appear on the left and right sides of the model statement.

PaigeMiller
Diamond | Level 26

I don't do much PCA these days, but I like your suggestion!

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 1690 views
  • 3 likes
  • 5 in conversation