Solved: Proc Sql performance issue

Kirtid · Posted 06-17-2021 06:01 PM

I have a dataset- B which as 50000 records and around 1600 columns. I try to summerize this dataset B in such a way that the minimum value of all numeric field should be displayed in the output for every group. This dataset has only two datatypes - Number and Character, for all numeric field MIN function is applied and all Character fields will be grouped here.. This aggregation is done on all 1600 columns. With Proc sql, the performance is bit slow, could you please assist if there is an alternate approach that will have a better performance than using proc sql step here?.

Ex: out of 1600 fields- 1000 fields are numeric and 600 fields are character, then grouping is done on 600 fields and for every numeric field following in that group, the minimum value should be output in that group.

ballardw · Posted 06-17-2021 08:27 PM

@Kirtid wrote:
Thanks again for the response. I tried to frame a sample test data as below. There can be a case where date fields can be also used for grouping, as these are dynamic variables.

Input - Dataset b;
Input id name$ sale1 sale2 sale3 date1;
Cards;
1 aa 10 20 30
1 aa 5 7 10
2 bb 28 60 6
2 cc 10 5 5
2 cc 6 6 6
;
Run;

Field to which aggregate function needs to be mapped comes from another table as below.From this table the macro variables &aggfunc and &grouplist are created.

Field Func
ID group
Name group
Sale1 min
Sale2 min
Sale3 sum

%put &aggfunc;
ID,Name,Min(sale1) as sale1, min(sale2) as sale2, min(sale3) as sale3

%put &grouplist;
ID,Name

Proc sql;
Create table as sum_out as
Select &aggfunc., &grouplist. From B
Group by &grouplist;
Quit;

Output dataset result - sum_out
1 aa 5 7 40
2 bb 28 60 6
2 cc 6 5 11

This bit: Min(sale1) as sale1, min(sale2) as sale2, min(sale3) as sale3 is exactly what the Proc Summary code I showed earlier would do.

Either with

var _numeric_;

output out=summary min= ;

for all the numeric variables

or if all the variable names start with Sale.

var Sale: ; <= note that colon, that says use all variables whose names start with Sale.

output out=summary min= ;

or if you have a number of iterated list variable names:

var Sale1-sale200 received1-received200 delivered1-delivered200;

output out=summary min=;

would have 600 variables containing the min values of the 600 variables represented on that list.

If your date is accidentally included on a CLASS (or BY) statement and a VAR statement the worst that will happen is a message like this in the log:

WARNING: Variable Date already exists on file <output data set name goes here>

But there are lots of ways to make lists of variables.

DATA B;
Input id name$ sale1 sale2 sale3;
Cards;
1 aa 10 20 30
1 aa 5 7 10
2 bb 28 60 6
2 cc 10 5 5
2 cc 6 6 6
;

proc summary data=b nway;
   class id name;
   var sale: ;
   output out=sum_out (drop=_:) min=;
run;

Your sort of data step doesn't run well as you read a date not provided, so I dropped that.

The NWAY on the proc summary statement makes only the "all variables" combination of the Class or BY varaibles. You can, with a single pass, without nway get the statistics for all the variables across all the data, with each level of id regardless of name, each level of name regardless of Id and the Id,name combinations. There is a variable _type_ that is dropped above that indicates the specific combination for any record.

Caveat with this specific code: large numbers of Class variables with many levels may run into memory issues. In which case you want to use BY instead of class but that requires sorting prior to the Proc Summary. By default summary want to create an output data set so you need the OUTPUT statement. Proc means will generate output to the open ODS destination unless surpressed with the NOPRINT option.

View solution in original post

Reeza · Posted 06-17-2021 06:05 PM

Have you tried a proc means + merge instead? One benefit is the automatic variable lists.

ods select none;
proc means data=have stackods MIN;
class _character_;
var _numeric_;
ods output summary = summary_chars;
run;
ods select all;

Then merge the output with the main data. 

proc sql;
create table want as 
select *
from table1 natural join summary_char;
quit;

Kirtid · Posted 06-17-2021 06:28 PM

Hello Reeza,
Thanks for your reply. Sorry I have not tried it with proc means.. these 1600 fields are dynamic and I get them as macro variables so my current proc sql will look like as below -

Prod sql ;
Create table sum_out as
Select &aggfunction, &grouplist from
B group by &grouplist;
Quit;

&grouplist - the fields on which the grouping should be done
&aggfunction - fields on which aggregation happens, falling in a group. Here the aggregate function can be MIN, Avg, Max, sum etc

ballardw · Posted 06-17-2021 07:20 PM

Group by would be CLASS or BY variables in Proc Means/summary

Show what the aggfunction text actually looks like. Not for 1600 variables but maybe 5 or so we can see what you are doing.

And maybe a bit of what the output looks is supposed to look like.

In @Reeza's example the key word _character_ says to create groups based on the combinations of all of the character variables. _numeric_ says to use all of the numeric variables. The MIN on the Proc statement is the statistic.

Or a slight modification:

proc summary data=have nway;
class _character_;
var _numeric_;
ods output summary = summary_chars  min= ;
run;

Creates a differently structured data set than Reeza. The above would have the output min valued variables with the same name that have going into the procedure. Or use: min= /autoname ; and the statistic would be appended to the variable names.

If your macro variable &grouplist is actually not all of the character variables you could use &grouplist. instead of _character_. I would say that you could do similar with the aggfunction but the code you show means that you have all of the Min(var) as somename in there and all that would be needed were the starting variables. But if indeed you are doing all the numeric variables then the list variable _numeric_ is ever so much more dynamic that coding something to create SQL function calls for 1000+ variables (in fact you may save more clock cycles if you don't even have to create those macro variables).

Experience indicates that Proc Means/ Summary can be quite a bit faster on summarizing data as it doesn't drag in any of the potential involved with joins and other features of SQL.

Kirtid · Posted 06-17-2021 07:58 PM

Thanks again for the response. I tried to frame a sample test data as below. There can be a case where date fields can be also used for grouping, as these are dynamic variables.

Input - Dataset b;
Input id name$ sale1 sale2 sale3 date1;
Cards;
1 aa 10 20 30
1 aa 5 7 10
2 bb 28 60 6
2 cc 10 5 5
2 cc 6 6 6
;
Run;

Field to which aggregate function needs to be mapped comes from another table as below.From this table the macro variables &aggfunc and &grouplist are created.

Field Func
ID group
Name group
Sale1 min
Sale2 min
Sale3 sum

%put &aggfunc;
ID,Name,Min(sale1) as sale1, min(sale2) as sale2, min(sale3) as sale3

%put &grouplist;
ID,Name

Proc sql;
Create table as sum_out as
Select &aggfunc., &grouplist. From B
Group by &grouplist;
Quit;

Output dataset result - sum_out
1 aa 5 7 40
2 bb 28 60 6
2 cc 6 5 11

ballardw · Posted 06-17-2021 08:27 PM

@Kirtid wrote:
Thanks again for the response. I tried to frame a sample test data as below. There can be a case where date fields can be also used for grouping, as these are dynamic variables.

Input - Dataset b;
Input id name$ sale1 sale2 sale3 date1;
Cards;
1 aa 10 20 30
1 aa 5 7 10
2 bb 28 60 6
2 cc 10 5 5
2 cc 6 6 6
;
Run;

Field to which aggregate function needs to be mapped comes from another table as below.From this table the macro variables &aggfunc and &grouplist are created.

Field Func
ID group
Name group
Sale1 min
Sale2 min
Sale3 sum

%put &aggfunc;
ID,Name,Min(sale1) as sale1, min(sale2) as sale2, min(sale3) as sale3

%put &grouplist;
ID,Name

Proc sql;
Create table as sum_out as
Select &aggfunc., &grouplist. From B
Group by &grouplist;
Quit;

Output dataset result - sum_out
1 aa 5 7 40
2 bb 28 60 6
2 cc 6 5 11

This bit: Min(sale1) as sale1, min(sale2) as sale2, min(sale3) as sale3 is exactly what the Proc Summary code I showed earlier would do.

Either with

var _numeric_;

output out=summary min= ;

for all the numeric variables

or if all the variable names start with Sale.

var Sale: ; <= note that colon, that says use all variables whose names start with Sale.

output out=summary min= ;

or if you have a number of iterated list variable names:

var Sale1-sale200 received1-received200 delivered1-delivered200;

output out=summary min=;

would have 600 variables containing the min values of the 600 variables represented on that list.

If your date is accidentally included on a CLASS (or BY) statement and a VAR statement the worst that will happen is a message like this in the log:

WARNING: Variable Date already exists on file <output data set name goes here>

But there are lots of ways to make lists of variables.

DATA B;
Input id name$ sale1 sale2 sale3;
Cards;
1 aa 10 20 30
1 aa 5 7 10
2 bb 28 60 6
2 cc 10 5 5
2 cc 6 6 6
;

proc summary data=b nway;
   class id name;
   var sale: ;
   output out=sum_out (drop=_:) min=;
run;

Your sort of data step doesn't run well as you read a date not provided, so I dropped that.

The NWAY on the proc summary statement makes only the "all variables" combination of the Class or BY varaibles. You can, with a single pass, without nway get the statistics for all the variables across all the data, with each level of id regardless of name, each level of name regardless of Id and the Id,name combinations. There is a variable _type_ that is dropped above that indicates the specific combination for any record.

Caveat with this specific code: large numbers of Class variables with many levels may run into memory issues. In which case you want to use BY instead of class but that requires sorting prior to the Proc Summary. By default summary want to create an output data set so you need the OUTPUT statement. Proc means will generate output to the open ODS destination unless surpressed with the NOPRINT option.

ChrisNZ · Posted 06-17-2021 07:26 PM

> then grouping is done on 600 fields

Just a note:

Either most variables are redundant in that list, or the number of groups is stupidly high and most groups are empty.

Even if the cardinality for each variable is two, 2**600 is a crazy number of groups.

High-Performance SAS Coding - Third Edition

LinusH · Posted 06-18-2021 05:01 AM

Yeah, I never seen a fact/analytical base table that could justify more than 20-30 group by columns.

So my guess that much can be gained by reviewing/altering the data model.

Data never sleeps

Proc Sql performance issue

Re: Proc Sql performance issue

Re: Proc Sql performance issue

Re: Proc Sql performance issue

Re: Proc Sql performance issue

Re: Proc Sql performance issue

Re: Proc Sql performance issue

Re: Proc Sql performance issue

Re: Proc Sql performance issue

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away