Solved: Re: Merge 1000 datasets efficiently - Page 2

ChrisNZ · Posted 11-15-2017 06:20 PM

This got me thinking and I did a few more tests.

SAS gets very inefficient when processing hundreds of tables concurrently with a BY statement, and the elapse (and CPU) time grows exponentially as more tables are added.

Much better to split the processing as suggested in the first post.

This is even truer for the SET statement than for the merge statement.

%*********** C R E A T E   S A M P L E **********;
%let nbobs=2000;
%let nbtab=1000;

%* Create sample;
data %macro loop; %local i; %do i=1 %to &nbtab;
     HAVE&i.(rename=(J=J&i. K=K&i. L=L&i. M=M&i.) compress=no where=(I<3 or I>&i.))
     %end; %mend; 
     %loop;  
  do I=1 to &nbobs;
    J=I;K=I;L=I;M=I;
    output;
  end;
run;

%************  T E S T   M E R G E  B Y *************;
%* merge 100 tables ==> 0.7 seconds;
data WANT;
  merge HAVE1-HAVE100; 
  by I;
run;

%* merge 200 tables ==> 2 seconds;
data WANT;
  merge HAVE1-HAVE200;
  by I;
run;

%* merge 500 tables ==> 15 seconds;
data WANT;
  merge HAVE1-HAVE500;
  by I;
run;

%* merge 1000 tables ==> 65 seconds;
data WANT;
  merge HAVE1-HAVE1000; 
  by I;
run;

%* Break merge in 200-table blocks ==> total time 10 seconds;                                          
data T1; merge   HAVE1 - HAVE200;  by I; run;
data T2; merge HAVE201 - HAVE400;  by I; run;
data T3; merge HAVE401 - HAVE800;  by I; run;
data T4; merge HAVE601 - HAVE800;  by I; run;    
data T5; merge HAVE801 -HAVE1000;  by I; run;
data T6; merge T1-T5;  by I; run; 


%************** T E S T   S E T  B Y **************;       
%* set 50 tables ==> 4 seconds;
data WANT;
  set HAVE1-HAVE50; 
  by I;
run;
%* set 100 tables ==> 26 seconds;
data WANT;
  set HAVE1-HAVE100; 
  by I;
run;
%* set 200 tables ==> 180 seconds;
data WANT;
  set HAVE1-HAVE200; 
  by I;
run;

%* Break set in 40-table blocks ==> total time 9 seconds;                                          
data T1; set   HAVE1 -  HAVE40; by I; run;
data T2; set  HAVE49 -  HAVE80; by I; run;
data T3; set  HAVE89 - HAVE120; by I; run;
data T4; set HAVE129 - HAVE160; by I; run;    
data T5; set HAVE159 - HAVE200; by I; run;
data T6; set T1-T5;  by I; run;

@lingcx You have to test with your own data in your own environment,but you should find something similar.

If you go the UPDATE route, splitting should speed up the prior SET step. Multiple MERGEs may still be faster though.

High-Performance SAS Coding - Third Edition

AlanC · Posted 11-15-2017 07:10 PM

You could also parallel run the tasks and see if that improves performance. Here is general code to deal with it. try it in your application. The issue with this may be an I/O bound problem but worth a shot.

options noxwait noxsync;

%let sasexe_options = -nosplash ;

data _null_ ;

file "c:\temp\test1.cmd" ;

put "sas -sysin 'c:\temp\test1.sas' -nosplash -log 'c:\temp\test1.log'";

file "c:\temp\test2.cmd" ;

put "sas -sysin 'c:\temp\test2.sas' -nosplash -log 'c:\temp\test2.log'";

run;

systask kill test1 test2;

systask command "c:\temp\test1.cmd" taskname=test1;

systask command "c:\temp\test2.cmd" taskname=test2;

waitfor _all_ test1 test2;

https://github.com/savian-net

Re: Merge 1000 datasets efficiently

Re: Merge 1000 datasets efficiently

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away