Principal component analysis (PCA) is a widely used data summarization and dimension reduction technique used with multivariate data. The principal components (PCs) are new variables constructed as weighted linear combinations of a set of original variables. These components, particularly the first few, can summarize the multivariate correlation structure of the original data, usually with far fewer components than the original variable number. Researchers may be interested in determining if the data structure summarized by PCs changes across groups or treatments. This can be assessed by comparison of the direction of components as described by the variable loadings. In this post, I’ll demonstrate how to create bootstrap confidence intervals on PC loadings to test hypotheses about changes in data structure across groups.
Data summarization using PCA
Principal components can summarize the data structure and correlations among the many quantitative variables. The first component explains the most information in the original data set. It represents the direction in the data where most variation exists, i.e. it has the largest variance or eigenvalue of any possible linear combination of the original variables. Because the first few PCs tend to explain most of the variation, often the remaining PCs can be discarded with minimal loss of information. The assumption is that early components summarize meaningful information, while variation due to noise, experimental error, or measurement inaccuracies is contained in later components. For a brief introduction to PCA, please see my previous post How many principal components should I keep? Part 1: common approaches.
If we assume that the first PC summarizes the most meaningful correlation structure in the data, researchers may be interested in determining if the first component changes across groups of interest. Changes in a component can occur in its size or direction. The size of a component is measured by the magnitude of its eigenvalue. The direction of a component can be described by the vector of variable coefficients, also called loadings. This post focuses on assessing changes in directions of PC1, but the methodology could easily be adapted to assessing changes in eigenvalues as well.
There is no commonly used formal test for changes in PC direction, but one can be constructed by putting confidence intervals (CI) on the PC loadings. The idea here is that we want to distinguish meaningful differences in the direction of PC across groups from differences due to random chance. If PC1 loadings for a group A are outside the confidence interval for the PC1 loadings of group B, the pattern of covariance has significantly changed across these groups. These confidence intervals can be constructed through bootstrapping.
Bootstrap confidence intervals
Bootstrapping is a resampling technique often used to construct confidence intervals on statistics of interest. Unlike traditional confidence intervals, bootstrap CIs require no assumptions about the underlying distribution of the data, such as normality. Bootstrap data sets are replicate random samples drawn with replacement from the original data. Some observations will show up several times in a bootstrapped data, others not at all. Resampling is typically done many times, creating say 100-1000 replicate data sets, and the statistic of interest is calculated on each replicate. This produces an empirical distribution of the statistic, and a 95% confidence interval can be constructed by finding the 2.5th and 97.5th percentiles of this distribution.
To illustrate creating bootstrap CI on eigenvector loadings, I used the SAS 9 procedure PROC PRINCOMP to analyze Fisher’s famous iris data set (sashelp.iris). These data contain 4 variables measuring floral morphology for 3 species of iris. The variables are sepal length, sepal width, petal length, and petal width. Fifty flowers of each of Iris setosa, Iris virginica, and Iris versicolor were measured for a total of n=150.
Research in iris evolution suggest that I. virginica and I. versicolor are more recently derived species compared to the more ancestral I. setosa. I thought it would be interesting to see if the structure of PC1 has changed in Virginica and Versicolor relative to Setosa in these data. To assess this, I computed bootstrap confidence intervals on the Setosa floral trait PC1 loadings and compared them with the directions of PC1 in the other 2 species. If the loadings of Virginica and Versicolor for PC1 are not within the 95% CI of Setosa, we can say that the structure of the floral trait correlations is significantly different between the species.
First, I subseted the data to the Setosa species. Next, the SURVEYSELECT procedure was used to create 1000 bootstrap samples. The SURVEYSELECT documentation refers to bootstrapping as “unrestricted random sampling” and bootstrapping is achieved through the METHOD=URS option. While many of the options have obvious functions, the OUTHITS option is necessary for getting the correct sample size in each replicate (N=50). Omitting OUTHITS results in one row in the output data per unique row sampled from the input data (i.e., the duplicate rows sampled with replacement were left out).
Eigenvectors were calculated for the 1000 bootstrap replicates by PROC PRINCOMP. PROC UNIVARITE was used to find the lower and upper 95% confidence interval limits which were estimated as the 2.5th and 97.5th percentiles of the distributions for each component. Here is the code:
proc princomp data=sashelp.iris;
by species;
var PetalLength PetalWidth SepalLength SepalWidth;
ods select Eigenvectors;
run;
data setosa versicolor virginica;
/*only used Setosa, but can create CI for any of these species */
set sashelp.iris;
if species="Setosa" then output Setosa;
else if species="Versicolor" then output Versicolor;
else output Virginica;
run;
%let reps=1000;
proc surveyselect data=setosa method=urs reps=&reps seed=12538 out=setosa_reps sampsize=50 outhits;
/*without OUTHITS option, you can get smaller samples than you want */
run;
ods select none;
proc princomp data=setosa_reps;
by replicate;
var PetalLength PetalWidth SepalLength SepalWidth;
ods output Eigenvectors=ev_setosa;
run;
proc sort data=ev_setosa out=ev_setosa_sort;
by variable;
run;
proc univariate data=ev_setosa_sort;
by variable;
var prin1 prin2 prin3 prin4;
output out=SetosaPCAbootCI pctlgroup=byvar pctlpts=2.5 97.5
pctlpre=prin1_ prin2_ prin3_ prin4_ pctlname=P025 P975;
run;
ods select all;
proc print data=SetosaPCAbootCI;
title 'Setosa bootstrap 95% CI on loadings';
var variable prin1_P025 prin1_P975;
run;
Below are the PC1 loadings for the three iris species. Keep in mind that 4 PCs are constructed for each species, but we are only viewing the first eigenvectors.
And here are the lower and upper confidence interval limits for the variable loadings of the Setosa PC1:
What we see above is that the PC1 has a different direction in both Versacolor and Virginica than PC1 in Setosa. The loadings for petal length for Versacolor (0.53) and Virginica (0.55) are each outside the 95% confidence interval of Setosa (0.02, 0.51). Versacolor is significantly different from Setosa in all the loadings while Virginica differs in 2 of the 4 loadings.
Another thing we can see from the Setosa confidence intervals is that none of the intervals contain zero. This means that each variable contributes significantly to the first principal component. If all confidence intervals contained zero, that is, if there were no significant loadings, then we could conclude that this component was entirely due to noise and the axis is not meaningful. If only one Setosa loading was significant, the PC axis would not be summarizing correlation structure among variables and again is not meaningful. In such a situation, the component would depend on only a single variable. This can happen when the PCA is based on covariances instead of correlations and the original variables have very different scales.
Use caution with this approach to generating confidence intervals for eigenvector loadings. Two situations can complicate interpretation of the confidence intervals. First, if two components (e.g., PC1 and PC2) explain a similar amount of variability, they can swap positions in bootstrap replicates. So, the confidence intervals on the loadings for PC1 would be based on some PC2 that have been mixed in among the replicates. I imagine this could be more likely if the PC are summarizing more noise than signal, so make sure to address which PC axes are significant before testing for changes in direction. For an approach to determining which compont axes are significant, please see my previous post How many principal components should I keep? Part 2: randomization-based significance tests.
Second, it is possible that a component can reverse direction in replicates, essentially having the nearly the same loadings in magnitude, but multiplied by negative one. Reversals would increase the chance that the loading CIs include the value zero, despite the eigenvector pointing in the same direction. If this is a concern, searching for negative correlations between bootstrap replicate components and the original PC could identify if this problem exists.
It is not difficult to assess changes in PCA loadings through SAS programming. The bootstrapped CIs created here are not limited to use in PCA; they can be used for nearly any statistic without making distributional assumptions. Hopefully this approach will be a useful tool to have in your statistical toolkit.
Further reading
My previous posts on determining how many PCs to retain for analyses:
How many principal components should I keep? Part 1: common approaches
How many principal components should I keep? Part 2: randomization-based significance tests
Rick Wicklin has a great series of SAS blogs on bootstrapping here: The essential guide to bootstrapping in SAS - The DO Loop
For more information on PCA and other multivariate techniques, try the SAS course Multivariate Statistics for Understanding Complex Data. You can access this course as part of a SAS Learning Subscription.
Find more articles from SAS Global Enablement and Learning here.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.