BookmarkSubscribeRSS Feed
maxjiang6999
Calcite | Level 5

Hi there, 

I have 9 datasets which are quite big and I just want to keep some variables from those datasets. At last, I want to append/merge those 9 files to create a master file. What I have done so far is:

 

DATA TEST1; SET DSN1; KEEP &outcome_vars ;RUN;
DATA TEST2; SET DSN2; KEEP &outcome_vars ;RUN;
DATA TEST3; SET DSN3; KEEP &outcome_vars ;RUN;
DATA TEST4; SET DSN4; KEEP &outcome_vars ;RUN;

.....

 

As you can tell, it's very tedious. Is there a more efficient way?

 

Regarding append/merge, I'm thinking using:

PROC APPEND BASE = TEST1 DATA = TEST2 OUT = TEST;
RUN;

PROC APPEND BASE = TEST1 DATA = TEST3 OUT = TEST;
RUN;

...

 

OR

data master;

merge test1 test2 test3 test4;

run;

(I'm not sure if I can merge >2 files at once.)

 

 

Thanks,

8 REPLIES 8
ChrisNZ
Tourmaline | Level 20

Have you tried this:

data MASTER;
  set TEST1 (keep=&vars) 
      TEST2 (keep=&vars) 
      TEST3 (keep=&vars) 
      TEST4 (keep=&vars) ;
run; 
ChrisNZ
Tourmaline | Level 20

Maybe you don't need to create the large table though, maybe a view is enough.

data MASTER_V/view=MASTER_V;
  set TEST1 (keep=&vars) 
      TEST2 (keep=&vars) 
      TEST3 (keep=&vars) 
      TEST4 (keep=&vars) ;
run; 

 

maxjiang6999
Calcite | Level 5
It takes longer for me to run DATA instead of PROC APPEND. These 9 files are pretty big. So if there's an option, I'd avoid going through each obs in the dataset.
Patrick
Opal | Level 21

@maxjiang6999 

The question is why you've got these 9 source tables in first place. If that's something you create then consider if a change upstream could allow you to process all the data in 1 go.

It's true that an append is faster than a data  step set BUT you can only append 1 table at a time so you would have to write 1 data step which creates the table structure (using your first table) and then write 8 separate Proc Append statements to append the other 8 tables - or you need to generate it using SAS Macro language.

So... if performance is not that critical then I'd go for what @ChrisNZ suggested: A single data step where you just "set" all the tables.

If these tables follow a naming convention as in your code sample then it could even be as simple as:

data MASTER_V;
  set TEST1 - test9;
  keep=&vars; 
run;

...and also the proposed data step view is a viable option if the only purpose of this master is to have a way to read all the tables at once for downstream processing.

 

ChrisNZ
Tourmaline | Level 20

>It takes longer for me to run DATA instead of PROC APPEND. These 9 files are pretty big. So if there's an option, I'd avoid going through each obs in the dataset.

 

Fair point and good on you for considering efficiency.

 

If you keep data steps, consider increasing the BUFNO and BUFSIZE for faster I/O.

 

maxjiang6999
Calcite | Level 5

Hi,

thanks for the quick reply. 

Before merging data, I'm thinking to make my DATA step more efficient. I don't want to manually write 9 lines of code in the form of:


DATA TEST1; SET DSN1; KEEP &outcome_vars ;RUN;
DATA TEST2; SET DSN2; KEEP &outcome_vars ;RUN;
DATA TEST3; SET DSN3; KEEP &outcome_vars ;RUN;
DATA TEST4; SET DSN4; KEEP &outcome_vars ;RUN;

ChrisNZ
Tourmaline | Level 20

This then?

proc append base=MASTER_V data=TEST1 (keep=&vars) ;
proc append base=MASTER_V data=TEST2 (keep=&vars) ;
proc append base=MASTER_V data=TEST3 (keep=&vars) ;
proc append base=MASTER_V data=TEST4 (keep=&vars) ;
run; 
hashman
Ammonite | Level 13

@maxjiang6999:

The main direction of your intended method is right since APPEND is the fastest tool for the job. It's faster than concatenating the files in the DATA step or UNIONizing them via SQL, because APPEND stacks the DATA= file atop the BASE= file as a block, thus avoiding extruding every record through the PDV. As to MERGE, it's a wrong tool for the job - you want to stack the files vertically, while MERGE does it, roughly speaking, horizontally.

 

However, your APPEND syntax is wrong: It has no OUT= option but merely appends DATA= to BASE=. Furthermore, there's no need to create the interim files TEST1, TEST2, etc. Hence, your process should look schematically as follows:

proc append base = master data = dsn1 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn2 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn3 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn4 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn5 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn6 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn7 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn8 (keep = &outcome_vars) ; run ;
proc append base = master data = dsn9 (keep = &outcome_vars) ; run ;

Frankly, with only 9 files to append, not quite tedious. But supposing that in a different situation you can have more, you may want to generate code like above given the number of files N as a parameter. There're many ways to do this. Using a macro usually comes to mind first, especially since in this case it's very simple:

%macro stack (base=, prefix=, keep=, N=) ;                    
  %do i = 1 %to &N ;                                          
    proc append base = &base data = &prefix&i (keep = &keep) ;
    run ;                                                     
  %end ;                                                      
%mend ;                                                       
                                                              
%stack (base=master, prefix=dsn, keep=&outcome_vars, N=9)     

Some people like CALL EXECUTE better:

data _null_ ;                                                                                                   
  retain base "master" prefix "dsn" keep "&outcome_vars" N 9 ;                                                  
  do i = 1 to N ;                                                                                               
    call execute (catx (" ", "proc append base=", base, "data=", cats (prefix, N), "(keep=", keep, "); run;")) ;
  end ;                                                                                                         
run ;                                                                                                           

Suum cuique.

 

Kind regards

Paul D.

 

   

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1122 views
  • 0 likes
  • 4 in conversation