Hi, I am currently running into problem with proc means and big datasets. I am using proc means to calculate sums by group like this:
proc means data=x7 noprint; class id; var amount1-amount&end.; where id ne "" or id ne "."; output out=out1 sum=; run;
My dataset x7 is approximately 70 million rows and the output should be around 3 million rows (the number of different id's). The procedure works correctly with a smaller dataset (around half million rows). However, the output with the bigger dataset is only 2 rows (sum row and some random id). Is there a more efficient way to do this? Or what is the problem here? I can not use proc sql as I want to calculate sums for a number of variables with the number not being fixed.
Sort the dataset by id first, then use
by id;
instead of the CLASS statement. With CLASS, SAS needs to build a structure of 3 million * (&end * 8 + (defined length of id)) in memory, while the BY needs to keep only the variables for the current group.
Rule of thumb: CLASS is for category variables of low cardinality.
With a long dataset structure, the problem of not being able to use SQL without macro coding (or CALL EXECUTE) would go away. See Maxims 33 & 19.
Provide the log from running the code. Copy the log and paste into a code box opened with the </>.
Large numbers of CLASS values may be an issue. So it may be worth trying a SORT by ID and then use BY ID instead of the CLASS statement.
If you only want the output for ID you should use the NWAY option on the Proc statement. Otherwise the procedure will also provide an overall summary, _type_=0, record in the output when using CLASS.
Sort the dataset by id first, then use
by id;
instead of the CLASS statement. With CLASS, SAS needs to build a structure of 3 million * (&end * 8 + (defined length of id)) in memory, while the BY needs to keep only the variables for the current group.
Rule of thumb: CLASS is for category variables of low cardinality.
With a long dataset structure, the problem of not being able to use SQL without macro coding (or CALL EXECUTE) would go away. See Maxims 33 & 19.
First, clean up the basic PROC MEANS errors in your program.
This statement includes all values for ID:
where id ne "" or id ne ".";
Instead, switch to:
where id not in (" ", ".");
Second (and as others have mentioned), you will get many levels of summarization in your output data set. Add the NWAY option:
proc means data=x7 noprint nway;
In the long run, it will be necessary for you to understand what it does, so I leave it to you to look it up.
Since you are not using the Missing option in the Proc Means statement, any ID where values for all variables in the VAR statement are missing will not appear in the output data set. Try adding the Missing option to see if you get the expected number of records.
I have used Proc Means with a class statement on data sets as large as yours without any issues.
@JimLoughlin wrote:
I have used Proc Means with a class statement on data sets as large as yours without any issues.
Depends on the dataset structure; if &end was equal to 1000, you'd need a little more than 20 GB of RAM, and you won't get that in a typical workspace server.
(8 bytes * 1000 * 3,000,000 = 24,000,000,000)
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.