I have a dataset like several million obs, and I am doing 10000 simulations. The following is part of the code, it takes really long time to finish this computation. Is there any way to speed up the proc sql steps? Thanks a lot as always!
%macro sim_table;
%do i=1 %to 10000;
proc sql noprint;
create table sim_&i as
select *,
(select count(var1) from data_&i where id=a.id and year=a.year and month=a.month) as count1,
(select sum (var2) from data_&i where id=a.id and year=a.year and month=a.month) as total2
from data_&i as a;
quit;
proc append base = simulation data = sim_&i; run;
%end;
%mend;
%sim_table;
Splitting your data into 10000 datasets is not usually an efficient strategy. But if you want to go that route, this query would be more efficient:
%macro sim_table;
%do i=1 %to 10000;
proc sql noprint;
create table sim_&i as
select
*,
count(var1) as count1,
sum(var2) as total2
from data_&i
group by id, year, month;
quit;
proc append base = simulation data = sim_&i; run;
%end;
%mend;
%sim_table;
Splitting your data into 10000 datasets is not usually an efficient strategy. But if you want to go that route, this query would be more efficient:
%macro sim_table;
%do i=1 %to 10000;
proc sql noprint;
create table sim_&i as
select
*,
count(var1) as count1,
sum(var2) as total2
from data_&i
group by id, year, month;
quit;
proc append base = simulation data = sim_&i; run;
%end;
%mend;
%sim_table;
Thank you PG! I will try this one.
You are right, split data is never a good practice. Since I have to connect to public computer for the computation, and I have very limited storage quota, so I have to split original data into small ones, run simulation, then only combine outputs into single file. Otherwise, the simulation will result in huge dataset that exceed my quota 🙂
And why not cut your input/output possibly in half?
Don't "create table", Do "create view".
Regards,
Mark
Another way to speed this up is to create a data set view of, say 100 data sets, each distinguished by a SIM_NUM. The do a proc sql create view on those 100 and append the results all at once. The only substantial disk space required will be your SIMULATION data set.
To start, I'd suggest running it with
%let end=4
and
%let grpsize=2
and
options mprint;
to see the code it generates.
%let start=1;
%let end=10000;
%let grpsize=100;
%macro sim_by_grps;
%do grp=%eval(&start-1) %to %eval(&end-&grpsize) %by &grpsize;
data vt&grp /view=vt&grp;
set
%do I=%eval(&grp+1) %to %eval(&grp+&grpsize);
data_&I
%end;
open=defer indsname=indsn ;
sim_num=input(scan(indsn,-1,'_'),5.);
run;
proc sql noprint;
create view vtappend&grp as
select *, n(var1) as count1, sum(var2) as total2
from vt&grp
group by sim_num,id,year,month;
quit;
proc append base=simulation data=vtappend&grp;
run;
%end;
%mend sim_by_grps;
%sim_by_grps;
Thank you mkeintz for your inputs! I will spend some time to digest your advice.
Excellent.
And if the performance is still insufficient, then it's time to put the SQL "create view" aside (because it doesn't take advantage of the fact that the incoming dataset is sorted by SIM_NUM). You would then either do further DATA step programming, or run a "PROC MEAN ... NOPRINT NWAY;" with a "BY SIM_NUM;" and "CLASS ID YEAR MONTH:" among the related statements. The result can be easily merged with the original data to be appended.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.