In this post I will talk about the conceptual basis for exploratory factor analysis (EFA). I’ll be describing regression models related to EFA using matrix algebra. You’d benefit from a working knowledge of matrix algebra, but you won’t necessarily need to follow the mathematics to understand the concepts.
In my last post I talked about the relationship between Pearson correlation coefficients and simple linear regression slope coefficients. Specifically, I explained and demonstrated how correlation coefficients are identical to simple regression coefficients that result from first standardizing both the Y and the X variables. If you missed that discussion, it might be helpful to read it before continuing here. And, if you are interested in learning in detail about exploratory factor analysis, you can look at our Course: Multivariate Statistics for Understanding Complex Data (sas.com)
At first glance, factor analysis might seem to be unrelated to linear regression. However, a simple EFA example using PROC FACTOR will illustrate how factor analysis can be thought of as a series of linear regressions. I’ll use the baseball data set from the SASHELP library. You can find documentation for PROC FACTOR at SAS Help Center: The FACTOR Procedure,
PROC FACTOR is the primary tool in SAS for performing exploratory factor analysis. I'll start here with some basic options with a single factor.
Note: You can start exploratory factor analysis with a correlation matrix instead of the data matrix of observations by variables.
ods output factorpattern=Loadings corr=XCorrs ResCorrUniqueDiag=U;
ods select factorpattern pathdiagram corr ResCorrUniqueDiag;
proc factor data=sashelp.baseball
n=1
method=ml
corr
residuals
;
var c:;
pathdiagram decp=4;
run;
Let me jump to the path diagram in the program output.
Baseball statistics factor structure.
The equation most people recognize as linear regression is y_{i}=β_{0}+β_{1}x_{1}+...+β_{0}x_{k+}ε_{i}. In the equation, y_{i} is the response variable value for individual i, x_{1} through x_{k} are the explanatory variables, β_{0} is the y-intercept, β_{1} through β_{k} are the regression coefficients, and ε_{i} is the error associated with individual i. The path diagram displays 6 regression equations.
"Where are the y-intercepts?" you might ask. The equations are for standard normal distributed variables. By default, we use standardized manifest variables. We simply presume the factor to be normally distributed with a mean of zero and a variance of one. If you read my previous post, you will see how, when all regression variables are on a standard normal scale, the y-intercept will be zero.
At this point, you might be wondering what Factor1 is. A factor in factor analysis represents a latent construct. We use the term latent variable for a variable which, while not itself directly measured, can be indirectly inferred by the presence of a group of manifest variables that it causes. The classic example is the concept of “intelligence”. We don’t measure intelligence directly. We might indirectly measure it using any number of intelligence tests. Each test might be composed of items that can be objectively scored. Because we use manifest variables to indicate the presence of latent variables, factor analysts often refer to those manifest variables as indicator variables.
Because we don't actually measure a latent variable, we can set its mean and variance to anything we want. A common practice is to set the mean of a factor to zero and a variance of one. We then assume a standard normal distribution.
Another common practice is to name the factor something that represents its apparent latent construct. The naming is usually based on looking at the concept shared by the manifest variables that load highly onto it. In this case, all of the manifest variables have high loadings on Factor1. They all seem to measure career batting statistics. Therefore, I might call this factor "Career Batting".
Whenever I think of latent variables and how manifest variables indirectly indicate their presence, I think about the 1990 movie, "Ghost", in which main character, Sam, has been killed. Sam's ghost is trying to communicate with Sam's girlfriend, Molly. Let's think of Sam's ghost as a latent variable. Molly cannot see him or sense him in any way. However, Sam's ghost has learned some tricks, including sliding a penny from the floor up a door. I think of the behavior of the penny as a manifest variable. Eventually, the physical evidence, which Molly can see, convinces her that the ghost does, in fact, exist. The latent variable (the ghost) is the causal agent of the physical behavior (the manifest variables). However, we must start with the manifest variables in order to infer the presence of the latent variable.
So, I guess this is how you ruin a perfectly good classic romantic movie. You interpret it as an exercise in machine learning. I apologize to so many people - my wife among them.
Correlations |
|||||||
CrAtBat | CrHits | CrHome | CrRuns | CrRbi | CrBB | ||
CrAtBat | Career Times at Bat | 1.00000 | 0.99489 | 0.79222 | 0.98069 | 0.94741 | 0.90035 |
CrHits | Career Hits | 0.99489 | 1.00000 | 0.77573 | 0.98208 | 0.94254 | 0.88452 |
CrHome | Career Home Runs | 0.79222 | 0.77573 | 1.00000 | 0.82093 | 0.92799 | 0.80619 |
CrRuns | Career Runs | 0.98069 | 0.98208 | 0.82093 | 1.00000 | 0.94314 | 0.92677 |
CrRbi | Career RBIs | 0.94741 | 0.94254 | 0.92799 | 0.94314 | 1.00000 | 0.88500 |
CrBB | Career Walks | 0.90035 | 0.88452 | 0.80619 | 0.92677 | 0.88500 | 1.00000 |
There is a high degree of correlation among these variables. Without at least modest correlations among your variables, exploratory factor analysis would be as fruitless as my backyard peach tree after the birds and squirrels get their beaks and claws on them.
The next table is the Factor Pattern matrix.
Factor Pattern | ||
Factor1 | ||
CrAtBat | Career Times at Bat | 0.99782 |
CrHits | Career Hits | 0.99655 |
CrHome | Career Home Runs | 0.79667 |
CrRuns | Career Runs | 0.98476 |
CrRbi | Career RBIs | 0.95008 |
CrBB | Career Walks | 0.90125 |
The factor pattern matrix in this case also serves as the factor structure matrix. A factor structure matrix contains the simple correlations between the manifest variables and the factors. They are the same as the regression coefficients we saw on the path diagram. I explained how this could be true in a previous post. Correlations are the same as regression coefficients when all variables are on a standard normal scale, with a mean of zero and a variance of one.
The final table produced by this program is the Residual Correlations table with uniquenesses on the diagonal.
Residual Correlations With Uniqueness on the Diagonal | |||||||
CrAtBat | CrHits | CrHome | CrRuns | CrRbi | CrBB | ||
CrAtBat | Career Times at Bat | 0.00435 | 0.00051 | -0.00271 | -0.00193 | -0.00059 | 0.00106 |
CrHits | Career Hits | 0.00051 | 0.00688 | -0.01819 | 0.00072 | -0.00426 | -0.01362 |
CrHome | Career Home Runs | -0.00271 | -0.01819 | 0.36532 | 0.03640 | 0.17109 | 0.08820 |
CrRuns | Career Runs | -0.00193 | 0.00072 | 0.03640 | 0.03024 | 0.00754 | 0.03926 |
CrRbi | Career RBIs | -0.00059 | -0.00426 | 0.17109 | 0.00754 | 0.09735 | 0.02875 |
CrBB | Career Walks | 0.00106 | -0.01362 | 0.08820 | 0.03926 | 0.02875 | 0.18775 |
If you look closely, the diagonal values on this table match the uniquenesses (numbers on the double-headed arrows) reported in the path diagram.
The matrix version of the linear regression equations in both linear regression and EFA is Y=XB+E. Let’s look at what each of the elements of the equation represents in EFA.
In EFA, the Y matrix contains all the measured variables used for factoring (the manifest variables). The B matrix contains all the regression coefficients (the factor loadings). The E matrix contains the errors (the uniquenesses). The X matrix contains the factors (the latent variables), which are said to “cause” the manifest variables. This is just a preview of the matrix algebra that I will explain in the next post.
In this post, I have introduced the fundamental concepts of exploratory factor analysis with the aid of linear regression and correlation. I showed a simple example using one factor and described how we can interpret the basic output from PROC FACTOR as a series of regression equations. I have yet to fully explain how the coefficients are estimated. For that, I will need to go into a little more detail about the matrices that I introduced at the end of this post. For those interested in moving to the next level of understanding of exploratory factory analysis, look out for my next post.
📢
ANNOUNCEMENT
The early bird rate has been extended! Register by March 18 for just $695 - $100 off the standard rate.
Check out the agenda and get ready for a jam-packed event featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.