I have a job which concatenates many large files into two files then merges them. I'm running out of disk space. Can I concatenate and merge in the same data step to remove some of the work files? Would proc sql help, or will it also create large working files. The pseudo code below describes the current structure.
data P;
set P1 P2 ... Pn;
run;
data H;
set H1 H2 ...Hn;
run;
data M;
merge P H;
by IDvars;
run;
[output based on M]
Let's talk about the IDVARS. Does the same combination appear once or more than once? If it's just once among all the P data sets, and just once among all the H data sets, you can eliminate the middleman:
data want;
merge p1 p2 p3 ... pn
h1 h2 h3 ... hn;
by idvars;
run;
Do you really need all of the data? Can you eliminate observations or variables in any of the steps? Getting rid of data you don't need as early as you can will save the most space and time.
Are you using COMPRESSION option? Either the system option or the dataset option? Does that reduce the size of the datasets? If you have long character variables this can reduce disk usage by a lot. But if the datasets have only a few short variables it might not help at all.
Did you try making the first two steps views? (Note that might still require large utility space)
data P / view=P ;
set P1 P2 ... Pn;
run;
data H / view=H;
set H1 H2 ...Hn;
run;
Or even the third step?
Before trying different joining strategies, I suggest you add this at the start of your program and just rerun as is to see how much further it gets you:
options compress = yes;
Let's talk about the IDVARS. Does the same combination appear once or more than once? If it's just once among all the P data sets, and just once among all the H data sets, you can eliminate the middleman:
data want;
merge p1 p2 p3 ... pn
h1 h2 h3 ... hn;
by idvars;
run;
I think this might work. (I've already tried compress, length and program reorganisation).
The IDvars are not all unique, but I think I can see how it would work. The IDvars are the file numbers (file=1 for P1 and H1, =2 for P2 and H2 etc) and case numbers within each file (once each for the H (household) file, and multiple times for the P (person) file). The desired output is one record for each P record, with the H vars attached.
Here is a test program using the single datastep. The output seems correct, but I get lots of information messages about data being overwritten. What's going on here?
%macro hfile(filen);
data h&filen;
retain file &filen;
do id = 1 to 5;
v1 = 1000*file + id;
output;
end;
run;
%mend;
%macro pfile(filen);
data p&filen;
retain file &filen;
do id = 1 to 5;
do person = 1 to 2;
v2 = 100000*file + 100*id + person;
output;
end;
end;
run;
%mend;
%hfile(1);
%pfile(1);
%hfile(2);
%pfile(2);
data merged;
merge h1 h2 p1 p2;
by file id;
run;
Because you have V1 (and V2) that is not one of the BY variables but appears on more than one dataset.
If you are positive that the same values of the BY variables will never appear in both H1 and H2 then you can ignore the warning. If you set the MSGLEVEL option to N instead of I then those messages will not be written.
Or you could rename those variables.
If you are sure that every H records has at least one P record (and every P record has a matching H record) then you could re-create the merge using interleaving SET instead. Try something like this:
options msglevel=i;
data test1;
merge h1 h2 p1 p2;
by file id;
run;
data test2;
set h1(in=inh1 keep=file id) h2(in=inh2 keep=file id) p1(keep=file id) p2(keep=file id);
by file id;
if max(inh1,inh2) then do;
set h1 h2;
by file id;
delete;
end;
set p1 p2;
by file id;
run;
proc compare data=test1 compare=test2;
run;
Log
163 options msglevel=i; 164 data test1; 165 merge h1 h2 p1 p2; 166 by file id; 167 run; INFO: The variable v1 on data set WORK.H1 will be overwritten by data set WORK.H2. INFO: The variable person on data set WORK.P1 will be overwritten by data set WORK.P2. INFO: The variable v2 on data set WORK.P1 will be overwritten by data set WORK.P2. NOTE: There were 5 observations read from the data set WORK.H1. NOTE: There were 5 observations read from the data set WORK.H2. NOTE: There were 10 observations read from the data set WORK.P1. NOTE: There were 10 observations read from the data set WORK.P2. NOTE: The data set WORK.TEST1 has 20 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 168 169 data test2; 170 set h1(in=inh1 keep=file id) h2(in=inh2 keep=file id) p1(keep=file id) p2(keep=file id); 171 by file id; 172 if max(inh1,inh2) then do; 173 set h1 h2; 174 by file id; 175 delete; 176 end; 177 set p1 p2; 178 by file id; 179 run; NOTE: There were 5 observations read from the data set WORK.H1. NOTE: There were 5 observations read from the data set WORK.H2. NOTE: There were 10 observations read from the data set WORK.P1. NOTE: There were 10 observations read from the data set WORK.P2. NOTE: There were 5 observations read from the data set WORK.H1. NOTE: There were 5 observations read from the data set WORK.H2. NOTE: There were 10 observations read from the data set WORK.P1. NOTE: There were 10 observations read from the data set WORK.P2. NOTE: The data set WORK.TEST2 has 20 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.