I have been intrigued and fascinated by Principal Component Analysis for some time, but haven’t had the need to really learn it (or so I thought). I have recently been moved into another position at work, and lo and behold my boss and I were talking and she said that the data we’re looking at might benefit from PCA. So, I decided it was time I buckle down and get a handle on this topic.
First, a brief definition, coming from the SAS Documentation on PCA:
[PCA]…is appropriate when you have obtained measures on a number of observed variables and wish to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables…
OK, so far so good! The next section talks about how PCA is effective at reducing redundancy in the variables, where redundant means some of the variables are correlated with one another, possibly because they are measuring the same thing. The example in the SAS documentation of what variable redundancy is shows a 7-item job satisfaction questionnaire; if you’ve ever taken one, you know the types of questions (“My pay is fair”, “My supervisor treats me with consideration”, etc.).
On closer inspection, you realise that Questions 1 – 4 all deal with satisfaction with the supervisor, and 5 – 7 deal with the employee’s feelings about their pay. If we run a correlation, we may see a very clear split between the two groups of questions, indicating variable redundancy.
So, let’s look at an example using some open data I found.
Get the Data
The data is looking at victims of crimes from the FBI Database, and can be found here.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
I had to do some cleanup to the data (removing footnotes, unmerging cells, deleting the rows / column with totals, and moving the “Crimes Against Property” and “Crimes Against Person” to a new column). As usual, I used the Data Import task to bring the data into SAS University Edition.
The code itself is fairly straightforward:
proc princomp data=work.import plots= score(ellipse ncomp=3); id type; run;
When I run it, I get a bunch of tables and graphs and I’m going to go through what I think appear to be the key ones. The first is the Correlation Matrix:
This is showing us the correlations between the original variables. Before conducting a PCA< you want to check the correlations between the variables. If any of the correlations are too high (greater than 0.9) you may need to remove one of the variables from the analysis, as the two variables seem to be measuring the same thing. If the correlations are too low, say below 0.1, then one or more of the variables might be its own Principal Component, which does not help us reduce the number of variables. In our example, we should probably remove the “Unknown” age category as it’s extremely high correlation to Adult will cause us problems.
The next table has what are called Eigenvalues for the Correlation Matrix; the first time I heard this, I was stumped – never having come across the term before, I had no idea what it was. Basically, it’s a way to show variance – the first component in the list will always have the highest variance, followed by the second highest, etc.
I have to admit I get the concept of what an eigenvalue is, but I am not comfortable enough with these concepts to offer a solid explanation. I am hoping someone can provide some assistance in the comments, because I would really like to better understand this concept. I am also struggling to figure out what Component 1 is – based on the proportion column, it explains 87% of the variance, but I can’t figure out how to know if it’s Adult / Unknown, Adult / Child, etc.
I’m going to turn my attention to one of the graphs that the PROC PRINCOMP created. It is a scatter plot with a 95% Ellipse, comparing Components 1 and 2 (the percentages are the proportion of the variation we saw in the second image above).
Very clearly, Larceny/Theft and Assault are “extreme” outliers, and Sex Offenses is an outlier (within the 95% but clearly outside the cluster). When I sort the data by Adult victims and then by Juvenile victims, it’s very apparent these three have the highest volume.
I'm sure I'm barely scratching the surface of PCA, and I look forward to seeing what comments and addtional information people can post!
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.