How many principal components should I keep? Part 1: common approaches

2 Likes

Principal component analysis (PCA) is an analytical technique for summarizing the information in many quantitative variables. Often it is used for dimension reduction, by discarding all but the first few principal component axes. The first few components usually explain most of the variability in the original variables. How can we determine the number of principal component axes to keep? There are several commonly used approaches which I will describe in this post. The number of axes used often depends on a percentage of variability explained, component axes with eigenvalues greater than 1, or visual inspection of a Scree plot. In a follow up post, I will describe how to construct a significance test for the number of principal component axes to retain for further analysis.

What is PCA?

PCA is a widely used summarization and dimension reduction technique used with multivariate data. It starts with a set of original variables to be summarized. New derived variables are constructed that are weighted linear combinations of these original variables. These derived variables are called principal components [PCs], and the number of PCs created are the same as the number of original variables.

For example, let’s look at a scatter plot of the heights and weights of 19 students from the sashelp.class data set:

01_TE_sashelp.class_.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Height and weight are positively correlated. A principal component analysis of these data would result in 2 principal component axes, the first going through the widest spread of the scatterplot and the second axis perpendicular to the first:

02_TE_sashelp.class_.pca-axes-hand-drawn.png

These new variables, the PC axes, are functions of the original variables, in this case PC1= 0.71*height+0.71*weight and PC2=0.71*height−0.71*weight. When the PCs are functions of only a few variables as in this case, they can be interpreted. The first PC which is based on positive weights or loadings of the height and weight variables could be interpreted as a measure of overall size or stature. In one sense, PC1 could be considered the “real” variable, and height and weight could be considered two ways of measuring overall size. The second PC could potentially describe how slim vs stocky the students are for their given size, but we can’t directly see that from the data. When the PC are functions of more than a few variables they generally become uninterpretable.

With a third variable such as age included in the data, a third PC would be produced. This third PC axis would be perpendicular to the first two. When PCs are used for further analysis, usually only the first few PCs are retained, as they explain most of the variability in the original predictors. Running PCA on height, weight, and age shows that the first two components explain over 96% of the variation in the three original variables. This is the basis for using PCA for dimension reduction—there is minimal loss of information when PC3 is not used in future analyses.

03_TE_taelna-PCA-blog1.3.png

Dimension reduction through PCA can be an early step in developing machine learning models. For data scientists with many predictors, PCA can be used for replacing many of the original inputs with relatively few PCs. Additionally, using PCs in place of the original inputs has the added benefit of removing any potential collinearity problems. Collinearity describes strong correlations among sets of predictors, and it can lead to unstable parameter estimates with large standard errors. This is not a problem with PCs because they are constructed in a way that ensures all PCs are uncorrelated.

How principal components are constructed

The first principal component, PC1, is constructed such that it has the greatest variance of any possible linear transformation of the original variables. Further PCs are constructed such that PC2 accounts for the second greatest proportion of the variability among the original variables, PC3 accounts for the third greatest proportion and so on. The variances of the PCs are called eigenvalues. The PCs have the property that the correlation among any principal components is exactly zero.

PCs can be constructed based on correlations among variables or based on covariances. Using correlations is a much more common approach. PCA based on covariances is highly sensitive to the scaling of variables. The original variable with the greatest variance will typically be associated with the first PC axis to the exclusion of all other variables. To remove the effect of scale, correlations are preferred unless the original variables are already on a similar scale. Principal component analysis can be carried out programmatically using the SAS 9 procedure PROC PRINCOMP or the SAS Viya procedure PROC PCA.

Because the first few PCs tend to explain the majority of variation, often the remaining PCs can be discarded with minimal loss of information. The assumption is that early components summarize meaningful information, while variation due to noise, experimental error, or measurement inaccuracies is contained in later components. When PCs are discarded in this manner, the number of dimensions (variables) can be greatly reduced. And unlike the original variables, it is impossible to have collinearity problems with the PCs since they are perfectly uncorrelated. But how many of the PCs should be kept for further analysis? If too few components are retained, future analyses using the components will suffer from loss of the relevant information. If too many components are retained, the additional noise can distort the underlying pattern of correlations among variables that the components are supposed to summarize.

How many PCs to retain?

One method to decide how many PC to retain for analysis is to pick a proportion of variance to explain that sounds reasonable. Would you be content to keep PCs until 75% of the variability in the original variables is accounted for? How about at least 90%? You may not even need to decide advance, and simply look at your PCA results and pick the number of PC axes that save most of the original variability. While being arbitrary, it may be reasonable to do this for some research goals.

Another approach is to retain all PCs that have eigenvalues greater than 1. When PCA is based on correlations, the sum of the variances of PCs (the eigenvalues) will equal the number of predictors being summarized. This means the average of the eigenvalues equals 1. So, using eigenvalue>1 as a threshold for considering a PC to be meaningful amounts to keeping all PCs that are account for more variability than the average.

A third approach involves examining a plot of the variance of each PC vs the PC number, called a Scree plot. “Scree” means an accumulation of rocks at the base of a mountain or cliff. If you picture the profile of a cliff, bending to meet the ground, this is the typical shape of a Scree plot and where it gets its name (thanks, Wikipedia). Typically, the curve connecting the components will have a bend, making an “elbow” shape. This point at which the curve flattens out indicates the maximum number of PCs to retain, and those below the elbow are discarded. What’s the justification for this? Recall that the eigenvalues decrease sequentially from the first component to the last. So, the Scree plot typically shows a steep decline that asymptotically approaches zero. Where the graph straightens out, the components are explaining a relatively small proportion of variance.

To illustrate these approaches, I used the SAS 9 procedure PROC PRINCOMP to analyze Fisher’s famous iris data set. These data contain 4 variables measuring floral morphology for 3 species of iris. The variables are sepal length, sepal width, petal length, and petal width. Fifty flowers of each of 3 iris species were measured for a total of n=150. These data are included with SAS in sashelp.iris.

proc princomp data=sashelp.iris plots=all;
	var _numeric_;
run;

The eigenvalues table shows the variances (eigenvalues) of the four principal components, along with their differences, their proportion of the total variance, and the cumulative variance explained:

04_TE_taelna-PCA-blog1.4.png

If we want to save components that represent at least 75% of the variability, the first two PCs would be retained. These account for over 95% of the variability in the original 4 predictors. If instead, eigenvalues >1 were retained, only the PC1 with an eigenvalue of 2.9 would be kept. All remining PCs have eigenvalues below 1.

The Scree plot shows a steep decline in eigenvalue from PC1 to PC2 and a flattening out of the curve between PC3 and PC4:

05_TE_taelna-PCA-blog1.5.png

Depending on where one sees the elbow, examining this Scree plot suggests keeping 3 or possibly only 2 PC for further analysis. Note that some authors suggest discarding the elbow point as well and only retaining the PCs before the bend.

There are several other graphs that can be produced to help interpret the principal components. To learn more generally about the graphical output of PROC PRINCOMP, see Rick Wicklin’s blog How to interpret graphs in a principal component analysis - The DO Loop (sas.com).

An advantage of these approaches is that they are all easy to implement. If the goal is dimension reduction, any of these approaches may be reasonable. But when focus of the PCA is summary of the data with meaningful axes of variation, the choices presented here are might not be satisfactory. They rely on subjective, somewhat arbitrary choices. An additional downside of these methods is that none of them consider that the correlation captured by the principal components could be entirely due to sampling error. In my next post, I’ll demonstrate an approach for constructing a test for significant principal components that can overcome these limitations.

For more information on PCA and other multivariate techniques, try the SAS course Multivariate Statistics for Understanding Complex Data. You can access this course as part of a SAS learning subscription (linked below).

See you at the next SAS class!

Links:

How to interpret graphs in a principal component analysis - The DO Loop (sas.com).

Course: Multivariate Statistics for Understanding Complex Data (sas.com)

SAS Learning Subscription | SAS

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library