Exploratory Factor Analysis: From Correlations to Latent Factors-Part 1

In today’s data-driven world, we often collect a large number of variables through surveys, research studies, and behavioral data. While having more data is useful, it can quickly become overwhelming to analyze and interpret. This is where Common Factor Analysis becomes helpful. It helps reduce this complexity by identifying a small number of hidden patterns, called factors, that explain relationships among variables.

Because these factors are usually unknown in advance, the method is typically applied in an exploratory way and is known as Exploratory Factor Analysis (EFA). In psychology and behavioral sciences, the term factor analysis is often used as a shorthand for common factor analysis.

What Is Exploratory Factor Analysis?

Exploratory Factor Analysis is used when you don’t have a predefined idea of how variables should group together. It explores the data to determine:

How many underlying factors exist.
Which variables load (or contribute) to which factors.
How strongly each variable is associated with a factor.

For example, imagine you conducted a customer satisfaction survey with 20 different questions. Instead of analyzing each question separately, EFA might reveal that these questions actually reflect three underlying dimensions such as Service Quality, Pricing Perception, and Brand Trust.

What are Factors

When multiple variables are measured or respondents are asked several related questions, it is often observed that some variables exhibit systematic relationships with one another. These interrelationships can be organized into a correlation matrix containing the pairwise correlation coefficients among all variables in the dataset.

Think of the correlation matrix as a table showing how strongly each variable is related to every other variable. The diagonal elements are all 1's, since every variable perfectly correlates with itself. The off-diagonal elements, however, are where the real story lies, they show the correlation coefficients between pairs of variables or questions.

Now, if you notice clusters of high correlations among certain variables, that’s a hint they may be tapping into the same underlying theme or construct. These hidden patterns are what we call factors or latent variables. A factor cannot be measured directly but is inferred from patterns of association among observed variables. It represents a hypothetical construct that is assumed to influence multiple observable (manifest) variables, which collectively provide evidence for the factor’s existence.

By identifying these factors, factor analysis helps us simplify complex data. Instead of dealing with dozens of interrelated variables, we can summarize them into a few meaningful factors. This process known as achieving parsimony allows us to explain the maximum amount of shared variance in the data using the smallest number of explanatory constructs. In other words, factor analysis finds the simplest way to explain what’s really going on beneath the surface of your data. In applied research, these factors often correspond to theoretical constructs—such as personality traits, consumer attitudes, or cognitive abilities—that are not directly observable but are inferred through patterns in the observed variables. By revealing these latent structures, EFA provides a rigorous foundation for subsequent modeling, measurement, or theory development.

Let’s bring these ideas to life with a practical example. Suppose we want to understand the underlying factors that characterize the development and quality of different regions. One way to explore this is by analyzing observable indicators that describe demographic size, economic activity, and living conditions. We capture these aspects through variables such as Population, Employment, Services, School, and HouseValue. Each of these variables represents a measurable characteristic of a region, but individually they do not tell the full story. For example, Population and Employment reflect the scale and economic activity of an area, while Services captures the availability of infrastructure that supports day-to-day life. On the other hand, School quality and HouseValue are indicators of social infrastructure and overall living standards. Together, these observed variables are designed to tap into broader, unobserved dimensions such as economic vitality and quality of life that influence how developed or attractive a region may be.

Once we collect responses, the next step is to calculate the correlation coefficients between each pair of variables and organize them into a correlation matrix. This matrix allows us to see how strongly each variable relates to the others and it’s often the first step in uncovering the latent patterns. Any significant correlations are highlighted in bold, making it easier to spot patterns.

When we examine the matrix, two clear clusters of interrelated variables emerge. The first cluster where Population and Employment show a very strong association, indicating a shared underlying dimension related to economic scale and activity. The second one where School and HouseValue are highly correlated, pointing to a separate dimension reflecting social infrastructure and living standards. The variable Services exhibits moderate to strong correlations with both groups, suggesting that it plays a bridging role between economic activity and quality of life.

You can also obtain an initial factor pattern plot to illustrate how each observed variable relates to the underlying factors identified through exploratory factor analysis. In the graph below, Factor 1 is strongly associated with School, HouseValue, and Services, while Factor 2 is driven mainly by Population and Employment. The Services variable loads primarily on Factor 1 but also shows a weak association with Factor 2, indicating that it acts as a bridge between the two underlying dimensions.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The Common Factor Model

A common factor is an unobserved, or latent, variable that influences more than one observed variable. It explains the shared variation seen among those variables. When we use the term factor without any qualifier, we usually mean a common factor. These common factors help explain why certain variables are correlated with each other. In contrast, a unique factor represents influences that are specific to a single observed variable. It captures variation that is not shared with other variables, including measurement error or characteristics unique to that variable. In the common factor model, each observed variable is assumed to have its own unique factor in addition to the common factors. Together, this model separates what variables have in common from what is unique to each, making it easier to understand the underlying structure of the data.

Observed Variable= Common Factor(s) + Unique Factor

An observed variable like ‘V₁’ can be expressed mathematically as a linear combination (weighted sum) of the underlying latent factors plus a unique component.

V₁= b₁₁F₁ + b₁₂F₂ +……………+ b_1mF_m + e₁, where

b_1j: the factor loading showing how strongly variable V₁ is related to Factor_j.

F_j: the common factors that represent shared variance across variables.

e₁: the unique factor (specific variance + measurement error) that is not explained by the common factors.

Communality

Communality indicates how much a variable overlap with the common factors. In other words, it represents the proportion of a variable’s variance that is accounted for by the common factors. The difference between the correlations predicted by the common factor model and the actual observed correlations is known as the residual correlation. Examining these residual correlations provides a useful way to assess how well the common factor model fits the data. Smaller residuals indicate that the factors explain the relationships among variables effectively.

An important assumption of common factor analysis is that the common factors are not simply linear combinations of the observed variables. Even if data are available for the entire population, the true values of the common factors cannot be calculated directly. Instead, they remain latent constructs. Although common factor scores cannot be computed exactly, they can be estimated using various statistical methods. These estimated factor scores provide practical representations of the latent factors and are often used for further analysis and interpretation.

Steps in performing Exploratory Factor Analysis

When performing exploratory factor analysis using a sample covariance or correlation matrix, the process typically involves four closely related steps:

Determining the number of factors- In exploratory factor analysis, the first step is usually to decide how many factors are needed to adequately explain the correlations in the multivariate data. If you already have strong domain knowledge or a theoretical understanding of the underlying structure, you may choose to fix the number of factors in advance and proceed with the remaining steps of the analysis. When such prior knowledge is not available, the number of factors is typically determined by examining the eigenvalues of either the sample correlation matrix or a reduced correlation matrix. The sample correlation matrix has ones along its diagonal, representing the total variance of each variable. In contrast, a reduced correlation matrix replaces these diagonal values with estimates of communalities, reflecting only the shared variance explained by common factors. A widely used approach in exploratory factor analysis is to estimate initial communalities using squared multiple correlations (SMCs). Since SMCs are usually less than one, they provide more realistic estimates of shared variance than the original diagonal values. The eigenvalues computed from this reduced correlation matrix are then used to guide the selection of the number of factors. Overall, the various methods for determining the number of common factors such as examining eigenvalues or their cumulative contributions are all based on assessing how much shared variance is explained by successive factors.

Extracting the factors- Once the number of factors has been determined, the next step is to extract the common factors. This step involves estimating the factor pattern matrix, which contains the factor loadings showing how strongly each observed variable is associated with each factor. Various methods can be used for factor extraction, including both iterative and non-iterative approaches. At this initial stage of estimation, the extracted common factors are typically assumed to be uncorrelated with one another. After extraction, the focus shifts to interpreting the factors. Interpretation involves assigning meaningful labels to the factors based on which observed variables have the largest loadings on them. In other words, factors are named according to the variables they most strongly influence. Because factor interpretation relies on judgment, it is inherently subjective. To make interpretation clearer and less subjective, factors are often rotated. Rotation applies a mathematical transformation to the factor pattern matrix that redistributes the loadings, helping each variable load more clearly on one factor and improving the overall interpretability of the solution.

Rotating the factors- Rotating factors can be thought of as choosing a more convenient coordinate system for viewing and interpreting the extracted factors. Rotation does not alter the underlying solution; instead, it aims to simplify interpretation by producing a factor pattern matrix in which variables have loadings that are either close to zero or +-1. Most rotation methods achieve this by optimizing a simplicity criterion that encourages a clear separation of variables across factors. With orthogonal rotation, the factors remain uncorrelated, whereas oblique rotation allows factors to be correlated and often yields more realistic and interpretable results, especially in social and behavioral research. When factors are correlated, interpretation becomes more complex, as no single measure fully captures a factor’s contribution to a variable. In such cases, interpretation requires examining the pattern matrix along with the factor structure and reference structure. Importantly, rotation does not affect the statistical explanatory power of the factors. All rotations are statistically equivalent, so the choice of rotation should be guided by interpretability and theoretical considerations rather than statistical criteria.

Estimating factor scores- In factor analysis, factor scores are not directly observed because factors themselves are latent variables. However, once a factor solution has been obtained, factor scores can be estimated using the observed data along with the estimated model parameters, such as factor loadings. These estimated factor scores provide a numerical value for each observation on each factor. They are often used in subsequent analyses, for example as predictors in regression models or as inputs for clustering and classification. It is important to note that factor scores are estimates, not exact values, and different estimation methods can yield slightly different results.

Exploratory Factor Analysis versus Principal Component Analysis

Although Exploratory Factor Analysis (EFA) and Principal Component Analysis (PCA) are often used together and can look similar, they serve different purpose.

Aspect	Exploratory Factor Analysis (EFA)	Principal Component Analysis (PCA)
What it is	factors are not linear combinations of the observed variables; instead, the observed variables are modelled as linear combinations of underlying latent factors.	linear combination of the observed variables, with weights chosen to maximize variance
Primary goal	Identify underlying latent factors	Reduce dimensionality of data
What it explains	Common (shared) variance among variables	Total variance in the data
Nature of output	Factors are unobserved constructs	Components are mathematical combinations
Key question answered	What hidden concepts drive correlations?	How can we summarize the data efficiently?
Typical use cases	Theory building, behavioral & social sciences	Data compression, preprocessing, visualization

How Factor Analysis Is Similar to Principal Component Analysis

Both factor analysis and principal component analysis can be used as dimension reduction techniques, meaning they help condense a large set of variables into a smaller, more manageable number. This is why both methods are widely used when analyzing data from multi-item questionnaires in the social sciences. In practice, they allow researchers to reduce a large number of survey questions into a smaller set of meaningful scales or dimensions.

At this stage, you should have a clear picture of what factor analysis is, why it is used, and how it differs methodologically and conceptually from PCA.

With the theoretical foundation and comparative understanding now established, we are ready to move into the practical phase. In the next part of this series, we will walk through a hands-on demonstration using the EFA procedure to perform Exploratory Factor Analysis.

References:

SAS Documentation
O’Rourke, Norm, and Larry Hatcher. A Step-by-Step Approach to Using SAS® for Factor Analysis and Structural Equation Modeling, Second Edition.

Find more articles from SAS Global Enablement and Learning here.