topic Re: How can I do a loop to faster my data step in SAS Programming

How can I do a loop to faster my data step

maxjiang6999 — Wed, 28 Aug 2019 22:18:17 GMT

Hi there,

I have 9 datasets which are quite big and I just want to keep some variables from those datasets. At last, I want to append/merge those 9 files to create a master file. What I have done so far is:

DATA TEST1; SET DSN1; KEEP &outcome_vars ;RUN;
DATA TEST2; SET DSN2; KEEP &outcome_vars ;RUN;
DATA TEST3; SET DSN3; KEEP &outcome_vars ;RUN;
DATA TEST4; SET DSN4; KEEP &outcome_vars ;RUN;

.....

As you can tell, it's very tedious. Is there a more efficient way?

Regarding append/merge, I'm thinking using:

PROC APPEND BASE = TEST1 DATA = TEST2 OUT = TEST;
RUN;

PROC APPEND BASE = TEST1 DATA = TEST3 OUT = TEST;
RUN;

...

data master;

merge test1 test2 test3 test4;

run;

(I'm not sure if I can merge >2 files at once.)

Thanks,

Re: How can I do a loop to faster my data step

ChrisNZ — Wed, 28 Aug 2019 22:30:26 GMT

Have you tried this:

data MASTER;
  set TEST1 (keep=&vars) 
      TEST2 (keep=&vars) 
      TEST3 (keep=&vars) 
      TEST4 (keep=&vars) ;
run;

Re: How can I do a loop to faster my data step

ChrisNZ — Wed, 28 Aug 2019 22:31:58 GMT

Maybe you don't need to create the large table though, maybe a view is enough.

data MASTER_V/view=MASTER_V;
  set TEST1 (keep=&vars) 
      TEST2 (keep=&vars) 
      TEST3 (keep=&vars) 
      TEST4 (keep=&vars) ;
run;

Re: How can I do a loop to faster my data step

maxjiang6999 — Wed, 28 Aug 2019 22:34:08 GMT

Hi,

thanks for the quick reply.

Before merging data, I'm thinking to make my DATA step more efficient. I don't want to manually write 9 lines of code in the form of:

DATA TEST1; SET DSN1; KEEP &outcome_vars ;RUN;
DATA TEST2; SET DSN2; KEEP &outcome_vars ;RUN;
DATA TEST3; SET DSN3; KEEP &outcome_vars ;RUN;
DATA TEST4; SET DSN4; KEEP &outcome_vars ;RUN;

Re: How can I do a loop to faster my data step

maxjiang6999 — Wed, 28 Aug 2019 22:38:30 GMT

It takes longer for me to run DATA instead of PROC APPEND. These 9 files are pretty big. So if there's an option, I'd avoid going through each obs in the dataset.

Re: How can I do a loop to faster my data step

Patrick — Wed, 28 Aug 2019 23:09:16 GMT

@maxjiang6999

The question is why you've got these 9 source tables in first place. If that's something you create then consider if a change upstream could allow you to process all the data in 1 go.

It's true that an append is faster than a data step set BUT you can only append 1 table at a time so you would have to write 1 data step which creates the table structure (using your first table) and then write 8 separate Proc Append statements to append the other 8 tables - or you need to generate it using SAS Macro language.

So... if performance is not that critical then I'd go for what @ChrisNZ suggested: A single data step where you just "set" all the tables.

If these tables follow a naming convention as in your code sample then it could even be as simple as:

data MASTER_V;
  set TEST1 - test9;
  keep=&vars; 
run;

...and also the proposed data step view is a viable option if the only purpose of this master is to have a way to read all the tables at once for downstream processing.

Re: How can I do a loop to faster my data step

ChrisNZ — Thu, 29 Aug 2019 00:04:20 GMT

This then?

proc append base=MASTER_V data=TEST1 (keep=&vars) ;
proc append base=MASTER_V data=TEST2 (keep=&vars) ;
proc append base=MASTER_V data=TEST3 (keep=&vars) ;
proc append base=MASTER_V data=TEST4 (keep=&vars) ;
run;

Re: How can I do a loop to faster my data step

ChrisNZ — Thu, 29 Aug 2019 00:08:02 GMT

>It takes longer for me to run DATA instead of PROC APPEND. These 9 files are pretty big. So if there's an option, I'd avoid going through each obs in the dataset.

Fair point and good on you for considering efficiency.

If you keep data steps, consider increasing the BUFNO and BUFSIZE for faster I/O.

Re: How can I do a loop to faster my data step

hashman — Thu, 29 Aug 2019 01:43:52 GMT

@maxjiang6999:

The main direction of your intended method is right since APPEND is the fastest tool for the job. It's faster than concatenating the files in the DATA step or UNIONizing them via SQL, because APPEND stacks the DATA= file atop the BASE= file as a block, thus avoiding extruding every record through the PDV. As to MERGE, it's a wrong tool for the job - you want to stack the files vertically, while MERGE does it, roughly speaking, horizontally.

However, your APPEND syntax is wrong: It has no OUT= option but merely appends DATA= to BASE=. Furthermore, there's no need to create the interim files TEST1, TEST2, etc. Hence, your process should look schematically as follows:

proc append base = master data = dsn1 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn2 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn3 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn4 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn5 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn6 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn7 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn8 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn9 (keep = &outcome_vars) ; run ;

Frankly, with only 9 files to append, not quite tedious. But supposing that in a different situation you can have more, you may want to generate code like above given the number of files N as a parameter. There're many ways to do this. Using a macro usually comes to mind first, especially since in this case it's very simple:

%macro stack (base=, prefix=, keep=, N=) ;                    
  %do i = 1 %to &N ;                                          
    proc append base = &base data = &prefix&i (keep = &keep) ;
    run ;                                                     
  %end ;                                                      
%mend ;                                                       
                                                              
%stack (base=master, prefix=dsn, keep=&outcome_vars, N=9)

Some people like CALL EXECUTE better:

data _null_ ;                                                                                                   
  retain base "master" prefix "dsn" keep "&outcome_vars" N 9 ;                                                  
  do i = 1 to N ;                                                                                               
    call execute (catx (" ", "proc append base=", base, "data=", cats (prefix, N), "(keep=", keep, "); run;")) ;
  end ;                                                                                                         
run ;

Suum cuique.

Kind regards

Paul D.