DATA Step, Macro, Functions and more

How to speed up the proc sql

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 130
Accepted Solution

How to speed up the proc sql

[ Edited ]

I have a dataset like several million obs, and I am doing 10000 simulations. The following is part of the code, it takes really long time to finish this computation. Is there any way to speed up the proc sql steps? Thanks a lot as always!

 

%macro sim_table;

%do i=1 %to 10000;

proc sql noprint;
create table sim_&i as 
select *,
(select count(var1) from data_&i where id=a.id and year=a.year and month=a.month) as count1,
(select sum  (var2) from data_&i where id=a.id and year=a.year and month=a.month) as total2
from data_&i as a;
quit;

proc append base = simulation data = sim_&i; run;

%end;
%mend;
%sim_table;

Accepted Solutions
Solution
‎12-01-2016 03:04 AM
Respected Advisor
Posts: 4,659

Re: How to peed up the proc sql

[ Edited ]

Splitting your data into 10000 datasets is not usually an efficient strategy. But if you want to go that route, this query would be more efficient:

 

%macro sim_table;

%do i=1 %to 10000;

	proc sql noprint;
	create table sim_&i as 
	select 
		*,
		count(var1) as count1,
		sum(var2) as total2
	from data_&i 
	group by  id, year, month;
	quit;

	proc append base = simulation data = sim_&i; run;

	%end;
%mend;

%sim_table; 

 

PG

View solution in original post


All Replies
Solution
‎12-01-2016 03:04 AM
Respected Advisor
Posts: 4,659

Re: How to peed up the proc sql

[ Edited ]

Splitting your data into 10000 datasets is not usually an efficient strategy. But if you want to go that route, this query would be more efficient:

 

%macro sim_table;

%do i=1 %to 10000;

	proc sql noprint;
	create table sim_&i as 
	select 
		*,
		count(var1) as count1,
		sum(var2) as total2
	from data_&i 
	group by  id, year, month;
	quit;

	proc append base = simulation data = sim_&i; run;

	%end;
%mend;

%sim_table; 

 

PG
Frequent Contributor
Posts: 130

Re: How to peed up the proc sql

Thank you PG! I will try this one.

You are right, split data is never a good practice. Since I have to connect to public computer for the computation, and I have very limited storage quota, so I have to split original data into small ones, run simulation, then only combine outputs into single file. Otherwise, the simulation will result in huge dataset that exceed my quota Smiley Happy

Valued Guide
Posts: 797

Re: How to peed up the proc sql

And why not cut your input/output possibly in half?

 

Don't "create table",   Do "create view".

 

 

Regards,

Mark

 

 

Valued Guide
Posts: 797

Re: How to peed up the proc sql

Another way to speed this up is to create a data set view of, say 100 data sets, each distinguished by a SIM_NUM.  The do a proc sql create view on those 100 and append the results all at once.  The only substantial disk space required will be your SIMULATION data set.

 

To start, I'd suggest running it with

    %let end=4

and

    %let grpsize=2

and

    options mprint;

to see the code it generates.

 

%let start=1;
%let end=10000;
%let grpsize=100;


%macro sim_by_grps;
  %do grp=%eval(&start-1) %to %eval(&end-&grpsize) %by &grpsize;
    data vt&grp /view=vt&grp;
      set 
        %do I=%eval(&grp+1) %to %eval(&grp+&grpsize);
        data_&I
        %end;
        open=defer indsname=indsn ;
      sim_num=input(scan(indsn,-1,'_'),5.);
    run;
    proc sql noprint;
      create view vtappend&grp as
        select *, n(var1) as count1, sum(var2) as total2 
        from vt&grp 
        group by sim_num,id,year,month;
    quit;
    proc append base=simulation data=vtappend&grp;
    run;
  %end;
%mend sim_by_grps;

%sim_by_grps;

 

Frequent Contributor
Posts: 130

Re: How to peed up the proc sql

Thank you mkeintz for your inputs! I will spend some time to digest your advice.

Valued Guide
Posts: 797

Re: How to peed up the proc sql

Excellent.

 

And if the performance is still insufficient, then it's time to put the SQL "create view" aside (because it doesn't take advantage of the fact that the incoming dataset is sorted by SIM_NUM).  You would then either do further DATA step programming, or run a "PROC MEAN ... NOPRINT NWAY;" with a "BY SIM_NUM;" and "CLASS ID YEAR MONTH:" among the related statements.  The result  can be easily merged with the original data to be appended.

 

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 393 views
  • 1 like
  • 3 in conversation