Solved: Re: Proc means with huge datasets

tahos · Posted 07-27-2020 04:53 AM

Hi, I am currently running into problem with proc means and big datasets. I am using proc means to calculate sums by group like this:

proc means data=x7 noprint;
	 	class id;
	 	var amount1-amount&end.;
		where id ne "" or id ne ".";
	 	output out=out1 sum=;
	run;

My dataset x7 is approximately 70 million rows and the output should be around 3 million rows (the number of different id's). The procedure works correctly with a smaller dataset (around half million rows). However, the output with the bigger dataset is only 2 rows (sum row and some random id). Is there a more efficient way to do this? Or what is the problem here? I can not use proc sql as I want to calculate sums for a number of variables with the number not being fixed.

Kurt_Bremser · Posted 07-27-2020 05:14 AM

Sort the dataset by id first, then use

by id;

instead of the CLASS statement. With CLASS, SAS needs to build a structure of 3 million * (&end * 8 + (defined length of id)) in memory, while the BY needs to keep only the variables for the current group.

Rule of thumb: CLASS is for category variables of low cardinality.

With a long dataset structure, the problem of not being able to use SQL without macro coding (or CALL EXECUTE) would go away. See Maxims 33 & 19.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

ballardw · Posted 07-27-2020 05:03 AM

Provide the log from running the code. Copy the log and paste into a code box opened with the </>.

Large numbers of CLASS values may be an issue. So it may be worth trying a SORT by ID and then use BY ID instead of the CLASS statement.

If you only want the output for ID you should use the NWAY option on the Proc statement. Otherwise the procedure will also provide an overall summary, _type_=0, record in the output when using CLASS.

Kurt_Bremser · Posted 07-27-2020 05:14 AM

Sort the dataset by id first, then use

by id;

instead of the CLASS statement. With CLASS, SAS needs to build a structure of 3 million * (&end * 8 + (defined length of id)) in memory, while the BY needs to keep only the variables for the current group.

Rule of thumb: CLASS is for category variables of low cardinality.

With a long dataset structure, the problem of not being able to use SQL without macro coding (or CALL EXECUTE) would go away. See Maxims 33 & 19.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 07-27-2020 10:22 AM

First, clean up the basic PROC MEANS errors in your program.

This statement includes all values for ID:

where id ne "" or id ne ".";

Instead, switch to:

where id not in (" ",  ".");

Second (and as others have mentioned), you will get many levels of summarization in your output data set. Add the NWAY option:

proc means data=x7 noprint nway;

In the long run, it will be necessary for you to understand what it does, so I leave it to you to look it up.

JimLoughlin · Posted 07-27-2020 04:46 PM

Since you are not using the Missing option in the Proc Means statement, any ID where values for all variables in the VAR statement are missing will not appear in the output data set. Try adding the Missing option to see if you get the expected number of records.

I have used Proc Means with a class statement on data sets as large as yours without any issues.

Kurt_Bremser · Posted 07-28-2020 02:43 AM

@JimLoughlin wrote:

I have used Proc Means with a class statement on data sets as large as yours without any issues.

Depends on the dataset structure; if &end was equal to 1000, you'd need a little more than 20 GB of RAM, and you won't get that in a typical workspace server.

(8 bytes * 1000 * 3,000,000 = 24,000,000,000)

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

tahos · Posted 07-28-2020 03:36 AM

Hi, thank you for your solutions. Somehow this information was hard to find by googling. Also thank you for pointing out the useless where clause, I did not catch that myself!

Ready to join fellow brilliant minds for the SAS Hackathon?