Help using Base SAS procedures

Best way to aggregate by cluster?

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

Best way to aggregate by cluster?

I have a large data set with about 10000 clusters (each with about 5-10 data points).  There are about 30 variables in the dataset.  I need to aggregate by cluster.  Variables will aggregate differently (mostly count or mean). I do not want to retain any duplicate, non-aggregated data -- just one datapoint for each cluster.

What is the simplest way to do this? I know I could do these with proc means and creating all new variables, but thought there might be a better way?


Accepted Solutions
Solution
‎08-10-2015 01:47 PM
Super User
Posts: 10,460

Re: Best way to aggregate by cluster?

Do you currently have a cluster identification variable, or combinations of variables that uniquely identify the cluster? If not, that would likely be the first step.

I'm not sure what your issue is with proc means/ summary would be. If each of those variables to be aggregated is only going to have one aggregate done (ie. Var1 Var2 Var3 are counts with N and Var4 Var5 Var6 are means) then you would only get one variable is asked correctly:

proc summary data= have nway;

     class ClusterID;

     var <your variables>;

     output out=want  n(<variables for count>) =      mean (<variables for mean>) =;

run;

The output count or mean will have the name of the input variable. You will have to specify names, or use AUTONAME if you want multiple statistics for the same variable.

View solution in original post


All Replies
Regular Contributor
Posts: 156

Re: Best way to aggregate by cluster?

"Clustering", perhaps you should try to post under SAS communities > SAS Statistical Procedures?  Unless you are writing you're own algorithm or using a non-conventional distance metric, I wouldn't use PROC MEANS.

Solution
‎08-10-2015 01:47 PM
Super User
Posts: 10,460

Re: Best way to aggregate by cluster?

Do you currently have a cluster identification variable, or combinations of variables that uniquely identify the cluster? If not, that would likely be the first step.

I'm not sure what your issue is with proc means/ summary would be. If each of those variables to be aggregated is only going to have one aggregate done (ie. Var1 Var2 Var3 are counts with N and Var4 Var5 Var6 are means) then you would only get one variable is asked correctly:

proc summary data= have nway;

     class ClusterID;

     var <your variables>;

     output out=want  n(<variables for count>) =      mean (<variables for mean>) =;

run;

The output count or mean will have the name of the input variable. You will have to specify names, or use AUTONAME if you want multiple statistics for the same variable.

New Contributor
Posts: 3

Re: Best way to aggregate by cluster?

Thanks!  I have a cluster ID variable, just knew there was a way to do it more elegantly than what I was coming up with (which was long and messy).

New Contributor
Posts: 3

Re: Best way to aggregate by cluster?

Ok -- thought that was the solution but it is not what I want to do.  I need an entirely new data set with just the aggregated variables.  PROC MEANS, etc, does not get me this, unless I am missing something.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 340 views
  • 0 likes
  • 3 in conversation