BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tahos
Fluorite | Level 6

Hi, I am currently running into problem with proc means and big datasets. I am using proc means to calculate sums by group like this: 

proc means data=x7 noprint;
	 	class id;
	 	var amount1-amount&end.;
		where id ne "" or id ne ".";
	 	output out=out1 sum=;
	run;

My dataset x7 is approximately 70 million rows and the output should be around 3 million rows (the number of different id's). The procedure works correctly with a smaller dataset (around half million rows). However, the output with the bigger dataset is only 2 rows (sum row and some random id). Is there a more efficient way to do this? Or what is the problem here? I can not use proc sql as I want to calculate sums for a number of variables with the number not being fixed. 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Sort the dataset by id first, then use

by id;

instead of the CLASS statement. With CLASS, SAS needs to build a structure of 3 million * (&end * 8 + (defined length of id)) in memory, while the BY needs to keep only the variables for the current group.

Rule of thumb: CLASS is for category variables of low cardinality.

 

With a long dataset structure, the problem of not being able to use SQL without macro coding (or CALL EXECUTE) would go away. See Maxims 33 & 19.

View solution in original post

6 REPLIES 6
ballardw
Super User

Provide the log from running the code. Copy the log and paste into a code box opened with the </>.

 

Large numbers of CLASS values may be an issue. So it may be worth trying a SORT by ID and then use BY ID instead of the CLASS statement.

 

If you only want the output for ID you should use the NWAY option on the Proc statement. Otherwise the procedure will also provide an overall summary, _type_=0, record in the output when using CLASS.

 

 

Kurt_Bremser
Super User

Sort the dataset by id first, then use

by id;

instead of the CLASS statement. With CLASS, SAS needs to build a structure of 3 million * (&end * 8 + (defined length of id)) in memory, while the BY needs to keep only the variables for the current group.

Rule of thumb: CLASS is for category variables of low cardinality.

 

With a long dataset structure, the problem of not being able to use SQL without macro coding (or CALL EXECUTE) would go away. See Maxims 33 & 19.

Astounding
PROC Star

First, clean up the basic PROC MEANS errors in your program.

 

This statement includes all values for ID:

where id ne "" or id ne ".";

Instead, switch to:

where id not in (" ",  ".");

Second (and as others have mentioned), you will get many levels of summarization in your output data set.  Add the NWAY option:

proc means data=x7 noprint nway;

In the long run, it will be necessary for you to understand what it does, so I leave it to you to look it up. 

JimLoughlin
Quartz | Level 8

Since you are not using the Missing option in the Proc Means statement, any ID where values for all variables in the VAR statement are missing will not appear in the output data set.  Try adding the Missing option to see if you get the expected number of records.

 

I have used Proc Means with a class statement on data sets as large as yours without any issues.

 

 

Kurt_Bremser
Super User

@JimLoughlin wrote:

 

I have used Proc Means with a class statement on data sets as large as yours without any issues.

Depends on the dataset structure; if &end was equal to 1000, you'd need a little more than 20 GB of RAM, and you won't get that in a typical workspace server.

(8 bytes * 1000 * 3,000,000 = 24,000,000,000)

tahos
Fluorite | Level 6
Hi, thank you for your solutions. Somehow this information was hard to find by googling. Also thank you for pointing out the useless where clause, I did not catch that myself!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1210 views
  • 4 likes
  • 5 in conversation