Cluster analysis - questions

a2veeram · Posted 01-21-2015 06:10 PM

Hello,

I am rather new to SAS and trying to do a cluster analysis with my data.

I have respondents (SAMPLEID), items (NSS__) that they eat on a regular basis and their amounts. Each item is assigned to a code that's why you can see NSS11, NSS25 etc..

It looks like this:

SAMPLEID	NSS11	NSS13	NSS14	NSS15	NSS16	NSS20	NSS25	NSS30	NSS31	NSS35	NSS41	NSS42	NSS46	NSS47	NSS48	NSS50	NSS51	NSS120	NSS125	NSS132	NSS135
111111	5	88	64	0	0	0	0	356	0	25	670	0	0	0	0	270	168	369	0	0	0
111112	0	0	0	0	0	0	0	860	0	0	300	35	67	800	0	0	250	0	150	45	0
111113	0	100	70	0	0	0	0	450	0	45	805	0	0	0	0	300	280	450	0	0	0
111114	45	0	0	570	64	79	31	0	680	0	0	73	790	64	82	0	0	0	97	53	68
111115	50	2	0	400	38	120	52	0	500	0	0	50	600	53	75	0	0	0	50	2	90

I have a much larger dataset with many more variables. I would like to create clusters based on similarities in eating patterns. So here, for example, 111111 and 111113 will be in a cluster1, while 111114 and 111115 will be in a cluster2, etc.. based on similarities in their diet.

I tried FASTCLUS procedure for number of clusters from 3 to 10 and it seems that only 1 cluster has the majority of respondents, while others have only a few. Given the size of the dataset I would expect at least a few dominant dietary patterns (e.g. vegetarian or omnivore).

I tried ACECLUS procedure but after some time it gave me an error "Eigenvector computation failed"...

I am using McCarthy's paper on "Methodological approach to performing cluster analysis with SAS" - but it shows the example with only 3 variables in clustering a number of countries by similarities.

I wonder if I am doing the right procedures..?

I would greatly appreciate any suggestions or ideas!

Thank you very much,

Anastasia

Reeza · Posted 01-21-2015 06:34 PM

Things that might affect your cluster calculations are the size of your data set, having too many variables and having too few observations are common issues. Depending on your variables you may find combining variables to be a method that works.

There is also PROC CLUSTER that you can look into.

a2veeram · Posted 01-21-2015 06:40 PM

Hello Reeza,

Thank you for the prompt response.

Number of respondents is around 11,000 and variables - around 2,000... Do you think this could be an issue..?

The paper that I mentioned in my post says to use ACECLUS before the actual PROC CLUSTER.. So I was confused, whether I can do this with my data.

Cluster analysis - questions

Re: Cluster analysis - questions

Re: Cluster analysis - questions

Catch up on SAS Innovate 2026