Hello
I have a data set with 25 million rows and 15 columns.
I am using proc sql to calculate aggregation statistics .
The query is running very slowly (It takes long time to complete run the query).
My question- May anyone suggest another code that run quickly?
proc sql;
create table want as
select Z1,Z2,YYYYMM,
count(distinct customer_ID) as nr_customers,
count(*) as nr_rows,
sum(amount)/1000000 as amount_in_Millions
from have
group by Z1,Z2,YYYYMM
;
quit;
When dealing with optimization, a string point is to add
options msglevel=i stimer;
And provide the log.
To your code, to better understand what the potential bottlenecks.
When ti comes to you code, it's not much you can about it.
You could try PROC SUMMARY instead, it's quicker in some instances.
Assuming your data is in Base, state if this is not the case.
Make sure you can use as much RAM as possible, and potentially increasing MEMSIZE snd SORTSIZE.
Try PROC SUMMARY. I would strongly advise you abandon the idea of doing lots of statistics across lots of groups using PROC SQL, especially for HUGE data sets. PROC SUMMARY ought to run faster than PROC SQL in this task. Just because you can do it in PROC SQL does not mean you should do it in PROC SQL.
proc summary data=have nway;
class z1 z2 yyyymm customer_id;
var amount;
output out=sums sum= n=n;
run;
proc summary data=sums nway;
class z1 z2 yyyymm;
var amount n;
output out=want sum(amount n)=amount nr_rows;
run;
In the output data set WANT, the value of _FREQ_ is the number of unique customers for each grouping of Z1 Z2 YYYYMM.
Try a DATA step method:
proc sort
data=have (keep=z1 z2 yyyymm customer_id amount)
out=int
;
by z1 z2 yyyymm customer_id;
run;
data want;
set int;
by z1 z2 yyyymm customer_id;
retain nr_rows amount_in_millions;
if first.yyyymm
then do;
nr_rows = 0;
amount_in_millions = 0;
end;
if first.customer_id then nr_rows + 1;
amount_in_millions + amount;
if last.yyyymm
then do;
amount_in_millions = amount_in_millions / 1000000;
output;
end;
drop amount customer_id;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.