In this post, I will explain the matrices and matrix operations used to detect and estimate factors in exploratory factor analysis (EFA). It’s not necessary, but it might be helpful to read my first two blog entries, in sequence. The first two blogs were How Correlation Relates to Linear Regression and Factor Analysis and The Relationship between Factor Analysis and Regression. My goal here is to help you through the fog of matrices and variables that can’t be measured (latent variables, measured as factors) that I encountered when first learning factor analysis.
This is the last of the trilogy of factor analysis articles that I have posted this year. My hope is that you will think of it alongside the Lord of the Rings trilogy and not the Beverly Hills Chihuahua trilogy, or even worse, the Jeepers Creepers trilogy. I’ve been using the expression “Jeepers Creepers” since I was 6 and that movie franchise has led me to shift to “Jeez Louise”. I hope there are no horror franchises ruining that one.
Let’s get back to our discussion about factor analysis and linear regression. In my first post, I described and illustrated how in simple regression (with a single dependent and a single independent variable), if you z-score standardize both variables, the regression coefficient will also be the correlation between the variables. In my second blog post, I introduced a structural diagram of an exploratory factor analysis system with one factor. I described how the diagram illustrated graphically each of the linear regression equations. There are always k linear regressions, where k is the number of “manifest” (measured) variables in your data set.
This is where my audience will diverge (oh, yeah, I forgot about the Divergent Trilogy, didn’t I?). One segment will enter the realm of the Lord of the Rings - if you consider a bit of matrix algebra adventurous. Another segment will shrug this off as a mildly annoying exercise not unlike a tiny dog barking and biting their heels. The third segment will consider this all a ridiculous exercise in gratuitous horror.
Okay, here it is. This is the equation for factor analysis, as I have previously introduced it:
Y=XB+E
The Y matrix contains all the measured variables used for factoring (the manifest variables). The B matrix contains all the regression coefficients (the factor loadings). The E matrix contains the errors (the uniquenesses). The X matrix contains the factors (the latent variables), which are said to “cause” the manifest variables.
Of all these matrices, the only one whose values are given to start is the Y matrix. So, for those of you who are worried about having a single equation with one known quantity and three unknowns, then congratulations. You are right. However, we will place some constraints on some estimates to make this workable.
How do we go about solving for B? First, we will not be working with raw data, but rather correlation matrices, which we have now learned are simply standardized versions of variance-covariance matrices.
Here is the key to the solution. Factor analysis assumes that manifest variables exhibit correlations simply because they are correlated with the set of common factors. So, if A has a 0.5 correlation with Factor1 and B has a 0.8 correlation with Factor1 and there are no other factors, then the correlation between A and B should be 0.5 * 0.8 = 0.4.
Let’s look at a structural diagram of an EFA run in PROC FACTOR.
ods output factorpattern=BPrime;
proc factor data=sashelp.baseball corr method=ml;
var nRbi nRuns nHits;
pathdiagram designheight=400 nodelabel=varname;
run;
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Factor Pattern | ||
Factor1 | ||
nRBI | RBIs in 1986 | 0.82403 |
nRuns | Runs in 1986 | 0.94721 |
nHits | Hits in 1986 | 0.96248 |
Note: the ordering of the manifest variables in the path diagram in SAS is alphabetical and that ordering cannot be modified, except through renaming or use of variable labels.
The coefficients for predicting nHits, nRBI and nRuns from Factor1 are 0.96, 0.82, and 0.95, respectively. They are displayed as regression coefficients in the path diagram and elements of the factor pattern matrix. These are also the correlations between those variables and Factor1. Why is this the case? Well, Factor1 has no known measurement scale, so we can arbitrary set one. It is customary to set the mean to zero and the variance to one, just like a z-standardized variable. So, now we have all of the manifest variables and the factor on a standardized scale, so we can say that those regression coefficients are also correlations.
Note: The factor pattern matrix is the B matrix of our factor analysis equations. That will soon 'B' important.
Given this, we would expect the correlation between nRBI and nRuns to be about 0.82 * 0.95 = 0.7790, within rounding error, between nRBI and nHits to be about 0.82 * 0.96 = 0.7872, and between nRuns and nHits to be about 0.95 * 0.96 = 0.9120.
If I want to calculate all of those expected inter-manifest variable correlations in one matrix operation, I multiply the transpose of the B matrix (B’),
[0.82 |
0.95 |
0.96] |
by the B matrix, [0.82 0.95 0.96].
In the SAS code above I saved the B’ matrix (vector because there is only one factor) as a SAS dataset named BPrime. Then I transposed the matrix to obtain B.
proc print data=BPrime;
run;
Obs | Variable | Label | Factor1 |
1 | nRBI | RBIs in 1986 | 0.82403 |
2 | nRuns | Runs in 1986 | 0.94721 |
3 | nHits | Hits in 1986 | 0.96248 |
proc transpose data=BPrime out=B;
id Variable;
run;
proc print data=B;
run;
Obs | _NAME_ | nRBI | nRuns | nHits |
1 | Factor1 | 0.82403 | 0.94721 | 0.96248 |
If you have SAS IML, you can do the matrix algebra more directly, but in this case, I perform the matrix multiplication using a DATA step to produce the B’B matrix.
data BprimeB;
if _n_=1 then set B;
set BPrime;
_nRBI=Factor1*nRBI;
_nRuns=Factor1*nRuns;
_nHits=Factor1*nHits;
run;
proc print data=BPrimeB;
var Variable _nRBI _nRuns _nHits;
run;
Obs | Variable | _nRBI | _nRuns | _nHits |
1 | nRBI | 0.67903 | 0.78053 | 0.79311 |
2 | nRuns | 0.78053 | 0.89720 | 0.91167 |
3 | nHits | 0.79311 | 0.91167 | 0.92636 |
The off-diagonal elements of the table are the expected correlations that we calculated above. Let’s compare this table with the correlation matrix obtained using the CORR option in PROC FACTOR.
Correlations | ||||
nRBI | nRuns | nHits | ||
nRBI | RBIs in 1986 | 1.00000 | 0.78053 | 0.79311 |
nRuns | Runs in 1986 | 0.78053 | 1.00000 | 0.91167 |
nHits | Hits in 1986 | 0.79311 | 0.91167 | 1.00000 |
Voila! As promised, the expected correlations between the manifest variables are approximated by the appropriate elements of B’B.
We have one thing left to do. We need interpret the elements on the positive diagonal of B’B. Well, if the ones on a diagonal of the correlation matrix are just the variances of the standardized manifest variables, then the diagonal elements of the B’B matrix represent the parts of that variances that are shared with the variable’s factors. What’s left over? The error or (in factor analysis terms) the uniquenesses. The uniquenesses are displayed on the path diagram, just above the boxes for the manifest variables.
The uniquenesses are interpreted as the parts of the variances of that manifest variables that are not shared with the factor. They have their own variance covariance matrix. That matris contains the uniquenesses (unique variances) on the diagonal and unique covariances off the diagonal. In factor analysis, we assume there is no covariance among uniquenesses. Therefore, there are zeroes off the diagonal.
Using this logic and factor analysis assumptions, we end up with this equation: R = B’B + U, where R is the covariance matrix of the manifest variables. In practice we typically use the standardized covariance matrix – the correlation matrix.
So, for our example, here are the matrices, along with the matrix equation R = B’B + U:
Correlations | ||||
nRBI | nRuns | nHits | ||
nRBI | RBIs in 1986 | 1.00000 | 0.78053 | 0.79311 |
nRuns | Runs in 1986 | 0.78053 | 1.00000 | 0.91167 |
nHits | Hits in 1986 | 0.79311 | 0.91167 | 1.00000 |
=
Variable | _nRBI | _nRuns | _nHits |
nRBI | 0.67903 | 0.78053 | 0.79311 |
nRun | 0.78053 | 0.89720 | 0.91167 |
nHits | 0.79311 | 0.91167 | 0.92636 |
+
Variable | _nRBI | _nRuns | _nHits |
nRBI | 0.32097 | 0.00000 | 0.000000 |
nRun | 0.00000 | 0.10280 | 0.000000 |
nHit | 0.00000 | 0.00000 | 0.073637 |
In exploratory factor analysis, we start with R, a correlation matrix of the manifest variables. Then we make guesses about the U matrix values. We subtract U from R to obtain an approximation for the reduced covariance matrix, B’B. Then we select from among several factor extraction methods to find the B matrix that best reproduces the off-diagonal elements using B’B.
Note: In this example, B’B + U = R, exactly. When we only have 3 manifest variables for one factor and we use maximum likelihood, that will always happen. You can try a fourth variable and see that B’B + U does not exactly equal R.
In this post, I have walked through the matrix algebra involved in a basic exploratory factor analysis with one latent factor and three manifest variables. I have explained why the specific matrix operations are performed by relating the process to my earlier blog topics about correlation and regression.
My goal in this trilogy of posts on exploratory factor analysis has been to offer a more intuitive explanation of not only what a factor represents, but how factors are built. I have not attempted to explain the factor extraction methodologies, including principal, maximum likelihood, and non-parametric. Nor have I attempted to explain EFA with more than one factor. When there is more than one factor, you will likely need to rotate the B matrix to obtain a Factor Pattern matrix that can be more easily interpreted.
If you REALLY want to learn more, please look at the Course: Multivariate Statistics for Understanding Complex Data (sas.com) .
Find more articles from SAS Global Enablement and Learning here.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.