I have a large data set with about 10000 clusters (each with about 5-10 data points). There are about 30 variables in the dataset. I need to aggregate by cluster. Variables will aggregate differently (mostly count or mean). I do not want to retain any duplicate, non-aggregated data -- just one datapoint for each cluster.
What is the simplest way to do this? I know I could do these with proc means and creating all new variables, but thought there might be a better way?
Do you currently have a cluster identification variable, or combinations of variables that uniquely identify the cluster? If not, that would likely be the first step.
I'm not sure what your issue is with proc means/ summary would be. If each of those variables to be aggregated is only going to have one aggregate done (ie. Var1 Var2 Var3 are counts with N and Var4 Var5 Var6 are means) then you would only get one variable is asked correctly:
proc summary data= have nway;
class ClusterID;
var <your variables>;
output out=want n(<variables for count>) = mean (<variables for mean>) =;
run;
The output count or mean will have the name of the input variable. You will have to specify names, or use AUTONAME if you want multiple statistics for the same variable.
"Clustering", perhaps you should try to post under SAS communities > SAS Statistical Procedures? Unless you are writing you're own algorithm or using a non-conventional distance metric, I wouldn't use PROC MEANS.
Do you currently have a cluster identification variable, or combinations of variables that uniquely identify the cluster? If not, that would likely be the first step.
I'm not sure what your issue is with proc means/ summary would be. If each of those variables to be aggregated is only going to have one aggregate done (ie. Var1 Var2 Var3 are counts with N and Var4 Var5 Var6 are means) then you would only get one variable is asked correctly:
proc summary data= have nway;
class ClusterID;
var <your variables>;
output out=want n(<variables for count>) = mean (<variables for mean>) =;
run;
The output count or mean will have the name of the input variable. You will have to specify names, or use AUTONAME if you want multiple statistics for the same variable.
Thanks! I have a cluster ID variable, just knew there was a way to do it more elegantly than what I was coming up with (which was long and messy).
Ok -- thought that was the solution but it is not what I want to do. I need an entirely new data set with just the aggregated variables. PROC MEANS, etc, does not get me this, unless I am missing something.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.