Hello,
I am rather new to SAS and trying to do a cluster analysis with my data.
I have respondents (SAMPLEID), items (NSS__) that they eat on a regular basis and their amounts. Each item is assigned to a code that's why you can see NSS11, NSS25 etc..
It looks like this:
SAMPLEID | NSS11 | NSS13 | NSS14 | NSS15 | NSS16 | NSS20 | NSS25 | NSS30 | NSS31 | NSS35 | NSS41 | NSS42 | NSS46 | NSS47 | NSS48 | NSS50 | NSS51 | NSS120 | NSS125 | NSS132 | NSS135 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
111111 | 5 | 88 | 64 | 0 | 0 | 0 | 0 | 356 | 0 | 25 | 670 | 0 | 0 | 0 | 0 | 270 | 168 | 369 | 0 | 0 | 0 | |
111112 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 860 | 0 | 0 | 300 | 35 | 67 | 800 | 0 | 0 | 250 | 0 | 150 | 45 | 0 | |
111113 | 0 | 100 | 70 | 0 | 0 | 0 | 0 | 450 | 0 | 45 | 805 | 0 | 0 | 0 | 0 | 300 | 280 | 450 | 0 | 0 | 0 | |
111114 | 45 | 0 | 0 | 570 | 64 | 79 | 31 | 0 | 680 | 0 | 0 | 73 | 790 | 64 | 82 | 0 | 0 | 0 | 97 | 53 | 68 | |
111115 | 50 | 2 | 0 | 400 | 38 | 120 | 52 | 0 | 500 | 0 | 0 | 50 | 600 | 53 | 75 | 0 | 0 | 0 | 50 | 2 | 90 |
I have a much larger dataset with many more variables. I would like to create clusters based on similarities in eating patterns. So here, for example, 111111 and 111113 will be in a cluster1, while 111114 and 111115 will be in a cluster2, etc.. based on similarities in their diet.
I tried FASTCLUS procedure for number of clusters from 3 to 10 and it seems that only 1 cluster has the majority of respondents, while others have only a few. Given the size of the dataset I would expect at least a few dominant dietary patterns (e.g. vegetarian or omnivore).
I tried ACECLUS procedure but after some time it gave me an error "Eigenvector computation failed"...
I am using McCarthy's paper on "Methodological approach to performing cluster analysis with SAS" - but it shows the example with only 3 variables in clustering a number of countries by similarities.
I wonder if I am doing the right procedures..?
I would greatly appreciate any suggestions or ideas!
Thank you very much,
Anastasia
Things that might affect your cluster calculations are the size of your data set, having too many variables and having too few observations are common issues. Depending on your variables you may find combining variables to be a method that works.
There is also PROC CLUSTER that you can look into.
Hello Reeza,
Thank you for the prompt response.
Number of respondents is around 11,000 and variables - around 2,000... Do you think this could be an issue..?
The paper that I mentioned in my post says to use ACECLUS before the actual PROC CLUSTER.. So I was confused, whether I can do this with my data.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.