# The Principal Components of Principal Component Analysis

by on ‎05-05-2017 02:09 PM (1,357 Views)

• ,

• ,
• ### Tips and Tricks

I have been intrigued and fascinated by Principal Component Analysis for some time, but haven’t had the need to really learn it (or so I thought).  I have recently been moved into another position at work, and lo and behold my boss and I were talking and she said that the data we’re looking at might benefit from PCA.  So, I decided it was time I buckle down and get a handle on this topic.

First, a brief definition, coming from the SAS Documentation on PCA:

[PCA]…is appropriate when you have obtained measures on a number of observed variables and wish to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables…

OK, so far so good!  The next section talks about how PCA is effective at reducing redundancy in the variables, where redundant means some of the variables are correlated with one another, possibly because they are measuring the same thing.  The example in the SAS documentation of what variable redundancy is shows a 7-item job satisfaction questionnaire; if you’ve ever taken one, you know the types of questions (“My pay is fair”, “My supervisor treats me with consideration”, etc.).

On closer inspection, you realise that Questions 1 – 4 all deal with satisfaction with the supervisor, and 5 – 7 deal with the employee’s feelings about their pay.  If we run a correlation, we may see a very clear split between the two groups of questions, indicating variable redundancy.

So, let’s look at an example using some open data I found.

Get the Data

The data is looking at victims of crimes from the FBI Database, and can be found here.

How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.

I had to do some cleanup to the data (removing footnotes, unmerging cells, deleting the rows / column with totals, and moving the “Crimes Against Property” and “Crimes Against Person” to a new column).   As usual, I used the Data Import task to bring the data into SAS University Edition.

The Results

The code itself is fairly straightforward:

``````proc princomp data=work.import plots= score(ellipse ncomp=3);
id type;
run;``````

When I run it, I get a bunch of tables and graphs and I’m going to go through what I think appear to be the key ones.  The first is the Correlation Matrix:

This is showing us the correlations between the original variables.  Before conducting a PCA< you want to check the correlations between the variables.  If any of the correlations are too high (greater than 0.9) you may need to remove one of the variables from the analysis, as the two variables seem to be measuring the same thing.  If the correlations are too low, say below 0.1, then one or more of the variables might be its own Principal Component, which does not help us reduce the number of variables.  In our example, we should probably remove the “Unknown” age category as it’s extremely high correlation to Adult will cause us problems.

The next table has what are called Eigenvalues for the Correlation Matrix; the first time I heard this, I was stumped – never having come across the term before, I had no idea what it was.  Basically, it’s a way to show variance – the first component in the list will always have the highest variance, followed by the second highest, etc.

I have to admit I get the concept of what an eigenvalue is, but I am not comfortable enough with these concepts to offer a solid explanation.  I am hoping someone can provide some assistance in the comments, because I would really like to better understand this concept.  I am also struggling to figure out what Component 1 is – based on the proportion column, it explains 87% of the variance, but I can’t figure out how to know if it’s Adult / Unknown, Adult / Child, etc.

I’m going to turn my attention to one of the graphs that the PROC PRINCOMP created. It is a scatter plot with a 95% Ellipse, comparing Components 1 and 2 (the percentages are the proportion of the variation we saw in the second image above).

Very clearly, Larceny/Theft and Assault are “extreme” outliers, and Sex Offenses is an outlier (within the 95% but clearly outside the cluster).  When I sort the data by Adult victims and then by Juvenile victims, it’s very apparent these three have the highest volume.

I'm sure I'm barely scratching the surface of PCA, and I look forward to seeing what comments and addtional information people can post!

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

Need data for learning?

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

Click Analytics U, then select "Subscribe" from the Options menu.

Happy Learning!

by
on ‎05-05-2017 02:43 PM

DarthPathos wrote:

The next table has what are called Eigenvalues for the Correlation Matrix; the first time I heard this, I was stumped – never having come across the term before, I had no idea what it was.  Basically, it’s a way to show variance – the first component in the list will always have the highest variance, followed by the second highest, etc.

I have to admit I get the concept of what an eigenvalue is, but I am not comfortable enough with these concepts to offer a solid explanation.  I am hoping someone can provide some assistance in the comments, because I would really like to better understand this concept.  I am also struggling to figure out what Component 1 is – based on the proportion column, it explains 87% of the variance, but I can’t figure out how to know if it’s Adult / Unknown, Adult / Child, etc.

Your original three variables have a certain amount of variability. When you look at component 1, it captures 87% of the total variability of the original three variables. (Or another way of saying this is that if you tried to model all three of your original variables with just component 1, 87% of the variability is explained and the other 13% is the residual, unexplained by component 1). This is the same idea that any other modeling technique uses ... explaining variability.

Component 2 explains another 12.48% of the variability of the original three variables, and Component 1 and 2 together explain 99.58% of the variability of the orginal three variables.

by
on ‎05-07-2017 08:13 PM

Your uncertainty is what i am confused about - I don't know how to align the "Component" with the variable.  I've done a bit of reading, including the SAS Documentation and the article by Leslie Smith which is a great introduction to the math behind PCA, but I get lost in the formulae and math (I love math but am not a mathematician or statistician, or ever even took anything in University beyond Introduction to Stats).

Have a great day!

Chris

by
on ‎05-08-2017 07:57 AM

Sometimes, the language of PCA itself can be a barrier to learning, and that might be the problem here.

I don't know how to align the "Component" with the variable

I don't know what you mean by "align", but the relationship is what I described before, where the component is a linear combination of all three original variables, with the loadings/eigenvectors being the weights in the linear combinations.

by
on ‎05-08-2017 09:24 PM

Hey Paige - Gotcha - this makes more sense now and I'll go back and re-read everything with this newfound understanding :-)  Thanks for clarifying and I hope you have a great evening

Chris

Contributors