How many principal components should I keep? Part 2: randomization-based significance tests

Principal component analysis (PCA) is an analytical technique for summarizing the information in many quantitative variables. Often it is used for dimension reduction, by discarding all but the first few principal component axes. These first few usually explain most of the variability in the original variables. How can we determine the number of principle component axes to keep for further analysis? In a previous post (How many principal components should I keep? Part 1: common approaches), I described some of the common approaches to determining the number of PC axes to retain. These approaches are all subjective and do not distinguish components that summarize meaningful correlations among variables from components that capture only sampling error. In this post, I will demonstrate how to construct a significance test for the number of PC axes to retain for analysis that avoids these limitations.

How many components to retain from PCA?

In a previous post (How many principal components should I keep? Part 1: common approaches), I described some of the common approaches to determining the number of PC axes to retain. These included keeping components that account for a set percentage of variability, retaining components with eigenvalues greater than 1, and components up to the "elbow" of a Scree plot. These approaches are easy to implement and may be adequate for many research goals. But each of these approaches is somewhat subjective, and more importantly, they are unable to distinguish principal components that represent correlations among variables from sampling error. For example, if we create data that is made of pure noise with no true correlation among variables, PCA will still produce eigenvalues and eigenvectors.

To illustrate this, I used PCA on randomly generated data with no structure. The code below creates a data set randnumbers (n=100) with four variables X1 through X4, each taking on integer values between 1 and 100. These variables are randomly generated with no underlying structure or associations among them. Any correlation among them is purely due to sampling error (random chance). I used PROC PRINCOMP on these data to carry out PCA on the randnumbers data.

data randnumbers;
	call streaminit(99);
	do i=1 to 100;
		x1=rand("integer", 1, 100);
		x2=rand("integer", 1, 100);
		x3=rand("integer", 1, 100);
		x4=rand("integer", 1, 100);
		output;
	end;
run;

proc princomp data=randnumbers;
	var x1 x2 x3 x4;
run;

The PCA on the random data produced the following eigenvalues and Scree plot:

01_TE_taelna-PCA-blog2.1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

02_TE_taelna-PCA-blog2.2.png

Saving PCs that explain at least 75% of the variation would result in retaining PC1 through PC3, which collectively explain almost 80% of the variability in the original predictors. Retaining components with eigenvalues >1 results in treating only PC1 as meaningful. Using the Scree plot bend method suggests retaining the first 2 PCs. But we know that none of these components represents true structure in the underlying data.

Randomization tests to determine the significant number of principal components

An alternative approach is to use a randomization test to determine the number of principal components to keep. The idea here is that we want to distinguish meaningful PCs from noise. That is, we want to know if the PCs explain more variability than similar data with no real structure that summarize no real correlations. To do this, the values of each variable can be randomly shuffled, and the PCA can be recalculated for the new structureless data set. All the original data is retained in these permuted data sets, but the values of each of the variables are randomized relative to one another. This randomization followed by PCA is repeated many times (say 100 to 1000 times) and the eigenvalues from each randomly permuted data set are retained.

This produces a null distribution of eigenvalues that can be used to test the hypothesis that the original data has no structure, that is, that the correlations represented by components is entirely due to sampling error. To carry out the test using alpha=0.95, the first eigenvalue of the real data PCA would be compared to the 95^th percentile of the null distribution of the first eigenvalue. If it is smaller than the 95^th percentile of the distribution of null distribution of PC1 eigenvalues, then the first PC represents sampling error and not true structure in the original data and none of the PC axes should be retained. If the first PC eigenvalue is greater than 95% of the null distribution, it represents significantly more correlation among variables than you would expect from random chance. If the first PC is significant, then proceed to testing PC2 the same way (and so on). The test could be constructed with other alphas, such as alpha=0.01 by comparing to the 99^th percentile of the null distribution.

Here is how I conducted a randomization test with the PCA on Fisher’s iris data. These data include 4 variables measuring floral morphology for 3 species of iris. The variables are sepal length, sepal width, petal length, and petal width. Fifty flowers of each of 3 Iris species were measured for a total of n=150. These data are included with SAS in sashelp.iris.

First, I added a variable called sort to the data which has the consecutive values 1 through 150, then used PROC PRINCOMP to carry out PCA. The results were shown in my previous post (How many principal components should I keep? Part 1: common approaches).

In the macro program irisRand, I used PROC PLAN to randomly shuffle the integers values of sort, while keeping the original order of the rest of the variables. The output data were sorted by sort and only sort and one floral variable were retained. This was repeated for each of the 4 floral variables and the randomly shuffled values were merged by the macro program null_dist_PCA. PROC PRINCOMP was run on these randomized data and the eigenvalues were saved, transposed, and appended to a data set null_EVs. This was repeated 1000 times and PROC UNIVARIATE was used to find the 99^th and 95^th percentiles of the null distribution of each of the 4 principal components.

Here is the SAS code:

/* create variable sort for randomization */
data iris;
	set sashelp.iris;
	sort=_n_;
run;

/*find eigenvalues for original iris data */
proc princomp data=iris;
	var SepalLength SepalWidth PetalLength PetalWidth;
run;

/*randomize the order of sort variable */
%macro irisRand (trait=, traitno=);
	proc plan;
		factors sort=150/noprint;
		output data=iris out=randomized;
		run;

/* sort by sort number to randomize the rows. Drop everything but one trait*/
	proc sort data=randomized out=random&traitno (keep=&trait sort);
		by sort;
	run;

%mend irisRand;

/* randomize values of 4 traits and merge the data sets, 
   carry out PCA and save eigenvalues, transpose and append 
   eigenvalues, repeat &iter=1000 times */
  
%macro null_dist_PCA (iter=);
	%Do j=1 %to &iter;
		%irisRand (trait=sepallength, traitno=1);
		%irisRand (trait=sepalwidth, traitno=2);
		%irisRand (trait=petallength, traitno=3);
		%irisRand (trait=petalwidth, traitno=4);

		data randomized_iris;
			merge random1-random4;
			drop sort;
		run;

		ods select none;
		proc princomp data=randomized_iris;
			var SepalLength SepalWidth PetalLength PetalWidth;
			ods output Eigenvalues=ev(keep=eigenvalue);
		run;
		ods select all;
		
		proc transpose data=ev out=ev_transposed (drop=_name_) prefix=eigen;
		run;

		proc append data=ev_transposed base=null_EVs;
		run;

	%end;
%mend null_dist_PCA;

/*prevents overflow of log*/
proc printto log='randomization_log.txt' new;
run;

%null_dist_PCA (iter=1000);

/*resets log to original destination*/
proc printto log=log;
run;

proc univariate data=null_EVs noprint;
	var eigen1 - eigen4;
	output out=crit pctlgroup=byvar pctlpts=99 95
     pctlpre=eigenvalue1_ 
		eigenvalue2_ eigenvalue3_ eigenvalue4_ pctlname=P99 P95;
run;

proc print data=crit;
run;

Here are the eigenvalues from the original iris data:

03_TE_taelna-PCA-blog1.4.png

And here are the 99th and 95th percentile of the null distribution for each of the four eigenvalues:

04_TE_taelna-PCA-blog2.4.png

So, using the cutoffs for statistical significance of the 99^th percentile (or 95^th), only the first eigenvalue is significant. That is, we cannot reject the null hypothesis that the correlations summarized in PC2 through PC4 are due to random noise. This is a different result than what would be achieved from retaining PCs that explained at least 75% of the variability in the original predictors. And importantly, it is less subjective than several other approaches (despite the choice of alpha itself being subjective). Of course, the decision on which method to use will depend on the specifics of your research question. Regardless of your favorite method, it can be useful to have this tool in your statistical toolkit.

For more information on PCA and other multivariate techniques, try the SAS course Multivariate Statistics for Understanding Complex Data. You can access this course as part of a SAS learning subscription (linked below).

See you at the next SAS class!

Links:

Course: Multivariate Statistics for Understanding Complex Data (sas.com)

SAS Learning Subscription | SAS

Find more articles from SAS Global Enablement and Learning here.

How many principal components should I keep? Part 2: randomization-based significance tests

SAS Innovate 2025: Register Now

Free course: Data Literacy Essentials

Get Started