Solved: Re: Best way to aggregate by cluster?

tebteb · Posted 08-10-2015 11:47 AM

I have a large data set with about 10000 clusters (each with about 5-10 data points). There are about 30 variables in the dataset. I need to aggregate by cluster. Variables will aggregate differently (mostly count or mean). I do not want to retain any duplicate, non-aggregated data -- just one datapoint for each cluster.

What is the simplest way to do this? I know I could do these with proc means and creating all new variables, but thought there might be a better way?

ballardw · Posted 08-10-2015 01:47 PM

Do you currently have a cluster identification variable, or combinations of variables that uniquely identify the cluster? If not, that would likely be the first step.

I'm not sure what your issue is with proc means/ summary would be. If each of those variables to be aggregated is only going to have one aggregate done (ie. Var1 Var2 Var3 are counts with N and Var4 Var5 Var6 are means) then you would only get one variable is asked correctly:

proc summary data= have nway;

class ClusterID;

var <your variables>;

output out=want n(<variables for count>) = mean (<variables for mean>) =;

run;

The output count or mean will have the name of the input variable. You will have to specify names, or use AUTONAME if you want multiple statistics for the same variable.

View solution in original post

PhilC · Posted 08-10-2015 01:34 PM

"Clustering", perhaps you should try to post under SAS communities > SAS Statistical Procedures? Unless you are writing you're own algorithm or using a non-conventional distance metric, I wouldn't use PROC MEANS.

ballardw · Posted 08-10-2015 01:47 PM

Do you currently have a cluster identification variable, or combinations of variables that uniquely identify the cluster? If not, that would likely be the first step.

I'm not sure what your issue is with proc means/ summary would be. If each of those variables to be aggregated is only going to have one aggregate done (ie. Var1 Var2 Var3 are counts with N and Var4 Var5 Var6 are means) then you would only get one variable is asked correctly:

proc summary data= have nway;

class ClusterID;

var <your variables>;

output out=want n(<variables for count>) = mean (<variables for mean>) =;

run;

The output count or mean will have the name of the input variable. You will have to specify names, or use AUTONAME if you want multiple statistics for the same variable.

tebteb · Posted 08-10-2015 09:16 PM

Thanks! I have a cluster ID variable, just knew there was a way to do it more elegantly than what I was coming up with (which was long and messy).

tebteb · Posted 08-12-2015 02:00 PM

Ok -- thought that was the solution but it is not what I want to do. I need an entirely new data set with just the aggregated variables. PROC MEANS, etc, does not get me this, unless I am missing something.

Best way to aggregate by cluster?

Re: Best way to aggregate by cluster?

Re: Best way to aggregate by cluster?

Re: Best way to aggregate by cluster?

Re: Best way to aggregate by cluster?

Re: Best way to aggregate by cluster?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away