BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Jonate_H
Quartz | Level 8

I have a dataset like several million obs, and I am doing 10000 simulations. The following is part of the code, it takes really long time to finish this computation. Is there any way to speed up the proc sql steps? Thanks a lot as always!

 

%macro sim_table;

%do i=1 %to 10000;

proc sql noprint;
create table sim_&i as 
select *,
(select count(var1) from data_&i where id=a.id and year=a.year and month=a.month) as count1,
(select sum  (var2) from data_&i where id=a.id and year=a.year and month=a.month) as total2
from data_&i as a;
quit;

proc append base = simulation data = sim_&i; run;

%end;
%mend;
%sim_table;
1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Splitting your data into 10000 datasets is not usually an efficient strategy. But if you want to go that route, this query would be more efficient:

 

%macro sim_table;

%do i=1 %to 10000;

	proc sql noprint;
	create table sim_&i as 
	select 
		*,
		count(var1) as count1,
		sum(var2) as total2
	from data_&i 
	group by  id, year, month;
	quit;

	proc append base = simulation data = sim_&i; run;

	%end;
%mend;

%sim_table; 

 

PG

View solution in original post

6 REPLIES 6
PGStats
Opal | Level 21

Splitting your data into 10000 datasets is not usually an efficient strategy. But if you want to go that route, this query would be more efficient:

 

%macro sim_table;

%do i=1 %to 10000;

	proc sql noprint;
	create table sim_&i as 
	select 
		*,
		count(var1) as count1,
		sum(var2) as total2
	from data_&i 
	group by  id, year, month;
	quit;

	proc append base = simulation data = sim_&i; run;

	%end;
%mend;

%sim_table; 

 

PG
Jonate_H
Quartz | Level 8

Thank you PG! I will try this one.

You are right, split data is never a good practice. Since I have to connect to public computer for the computation, and I have very limited storage quota, so I have to split original data into small ones, run simulation, then only combine outputs into single file. Otherwise, the simulation will result in huge dataset that exceed my quota 🙂

mkeintz
PROC Star

And why not cut your input/output possibly in half?

 

Don't "create table",   Do "create view".

 

 

Regards,

Mark

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

Another way to speed this up is to create a data set view of, say 100 data sets, each distinguished by a SIM_NUM.  The do a proc sql create view on those 100 and append the results all at once.  The only substantial disk space required will be your SIMULATION data set.

 

To start, I'd suggest running it with

    %let end=4

and

    %let grpsize=2

and

    options mprint;

to see the code it generates.

 

%let start=1;
%let end=10000;
%let grpsize=100;


%macro sim_by_grps;
  %do grp=%eval(&start-1) %to %eval(&end-&grpsize) %by &grpsize;
    data vt&grp /view=vt&grp;
      set 
        %do I=%eval(&grp+1) %to %eval(&grp+&grpsize);
        data_&I
        %end;
        open=defer indsname=indsn ;
      sim_num=input(scan(indsn,-1,'_'),5.);
    run;
    proc sql noprint;
      create view vtappend&grp as
        select *, n(var1) as count1, sum(var2) as total2 
        from vt&grp 
        group by sim_num,id,year,month;
    quit;
    proc append base=simulation data=vtappend&grp;
    run;
  %end;
%mend sim_by_grps;

%sim_by_grps;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Jonate_H
Quartz | Level 8

Thank you mkeintz for your inputs! I will spend some time to digest your advice.

mkeintz
PROC Star

Excellent.

 

And if the performance is still insufficient, then it's time to put the SQL "create view" aside (because it doesn't take advantage of the fact that the incoming dataset is sorted by SIM_NUM).  You would then either do further DATA step programming, or run a "PROC MEAN ... NOPRINT NWAY;" with a "BY SIM_NUM;" and "CLASS ID YEAR MONTH:" among the related statements.  The result  can be easily merged with the original data to be appended.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 3597 views
  • 1 like
  • 3 in conversation