I am trying to do dimension reduction using Principal Component Analysis. The dataset have 25 variables and 300K obs. The data is for segmentation using 2-stage clustering (K-means clustering then Linkage clustering)
What's the good practices for deciding the number retained factors. Is criteria #1 good enough?
Criteria #1: eigenvalue>1
-> 5 Factors with 54% variation explained. Is the variation explained too low? Should i use eigenvalue>0.7 and
Criteria #2: eigenvalue>0.7 and Variation explained > 0.7
-> 10 Factors with 78% Variation explained
ods graphics on;
proc factor data=myData preplot plots=(scree initloadings preloadings loadings) method=principal rotate=varimax
scree score;
var _numeric_
run;
ods graphics off;
# | Eigenvalue | Difference | Proportion | Cumulative |
1 | 4.59175884 | 3.04985582 | 0.2551 | 0.2551 |
2 | 1.54190302 | 0.12646714 | 0.0857 | 0.3408 |
3 | 1.41543588 | 0.23647927 | 0.0786 | 0.4194 |
4 | 1.17895661 | 0.09521203 | 0.0655 | 0.4849 |
5 | 1.08374458 | 0.16097769 | 0.0602 | 0.5451 |
6 | 0.92276689 | 0.04595209 | 0.0513 | 0.5964 |
7 | 0.8768148 | 0.00522994 | 0.0487 | 0.6451 |
8 | 0.87158485 | 0.06006623 | 0.0484 | 0.6935 |
9 | 0.81151862 | 0.05330799 | 0.0451 | 0.7386 |
10 | 0.75821063 | 0.06929076 | 0.0421 | 0.7807 |
11 | 0.68891987 | 0.05741897 | 0.0383 | 0.819 |
12 | 0.6315009 | 0.02247774 | 0.0351 | 0.8541 |
13 | 0.60902316 | 0.01814238 | 0.0338 | 0.8879 |
14 | 0.59088079 | 0.03822865 | 0.0328 | 0.9207 |
Honestly, I think the answer is totally subjective here. I don't believe that there is a universally accepted answer. The scree plot might indicate 7 factors.
However, I would say that if you (for example) choose the 5 factor solution, but find that factor 6 has a clear interpretation that makes sense in your application, that's a (again subjective) reason to include factor 6.
As far as the question about is 54% of the explained variability enough ... again there is no universal answer here, especially since every situation is different. For some data in some fields of application, 54% might be fantastic, while in other fields of application 54% might be poor.
Honestly, I think the answer is totally subjective here. I don't believe that there is a universally accepted answer. The scree plot might indicate 7 factors.
However, I would say that if you (for example) choose the 5 factor solution, but find that factor 6 has a clear interpretation that makes sense in your application, that's a (again subjective) reason to include factor 6.
As far as the question about is 54% of the explained variability enough ... again there is no universal answer here, especially since every situation is different. For some data in some fields of application, 54% might be fantastic, while in other fields of application 54% might be poor.
Useful advice is available here:
Hey,
an additional criteria would be the parallel analysis by Horn (1965; https://link.springer.com/article/10.1007%2FBF02289447).
Bye,
Daniel
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.