BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tebteb
Calcite | Level 5

I have a large data set with about 10000 clusters (each with about 5-10 data points).  There are about 30 variables in the dataset.  I need to aggregate by cluster.  Variables will aggregate differently (mostly count or mean). I do not want to retain any duplicate, non-aggregated data -- just one datapoint for each cluster.

What is the simplest way to do this? I know I could do these with proc means and creating all new variables, but thought there might be a better way?

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Do you currently have a cluster identification variable, or combinations of variables that uniquely identify the cluster? If not, that would likely be the first step.

I'm not sure what your issue is with proc means/ summary would be. If each of those variables to be aggregated is only going to have one aggregate done (ie. Var1 Var2 Var3 are counts with N and Var4 Var5 Var6 are means) then you would only get one variable is asked correctly:

proc summary data= have nway;

     class ClusterID;

     var <your variables>;

     output out=want  n(<variables for count>) =      mean (<variables for mean>) =;

run;

The output count or mean will have the name of the input variable. You will have to specify names, or use AUTONAME if you want multiple statistics for the same variable.

View solution in original post

4 REPLIES 4
PhilC
Rhodochrosite | Level 12

"Clustering", perhaps you should try to post under SAS communities > SAS Statistical Procedures?  Unless you are writing you're own algorithm or using a non-conventional distance metric, I wouldn't use PROC MEANS.

ballardw
Super User

Do you currently have a cluster identification variable, or combinations of variables that uniquely identify the cluster? If not, that would likely be the first step.

I'm not sure what your issue is with proc means/ summary would be. If each of those variables to be aggregated is only going to have one aggregate done (ie. Var1 Var2 Var3 are counts with N and Var4 Var5 Var6 are means) then you would only get one variable is asked correctly:

proc summary data= have nway;

     class ClusterID;

     var <your variables>;

     output out=want  n(<variables for count>) =      mean (<variables for mean>) =;

run;

The output count or mean will have the name of the input variable. You will have to specify names, or use AUTONAME if you want multiple statistics for the same variable.

tebteb
Calcite | Level 5

Thanks!  I have a cluster ID variable, just knew there was a way to do it more elegantly than what I was coming up with (which was long and messy).

tebteb
Calcite | Level 5

Ok -- thought that was the solution but it is not what I want to do.  I need an entirely new data set with just the aggregated variables.  PROC MEANS, etc, does not get me this, unless I am missing something.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1065 views
  • 0 likes
  • 3 in conversation