Solved: Re: Reducing disk usage while concatenating and merging

BruceBrad

I have a job which concatenates many large files into two files then merges them. I'm running out of disk space. Can I concatenate and merge in the same data step to remove some of the work files? Would proc sql help, or will it also create large working files. The pseudo code below describes the current structure.

data P;
set P1 P2 ... Pn;
run;
data H;
set H1 H2 ...Hn;
run;
data M;
merge P H;
by IDvars;
run;
[output based on M]

Astounding

Let's talk about the IDVARS. Does the same combination appear once or more than once? If it's just once among all the P data sets, and just once among all the H data sets, you can eliminate the middleman:

data want;
   merge p1 p2 p3 ... pn
         h1 h2 h3 ... hn;
   by idvars;
run;

View solution in original post

Tom

Do you really need all of the data? Can you eliminate observations or variables in any of the steps? Getting rid of data you don't need as early as you can will save the most space and time.

Are you using COMPRESSION option? Either the system option or the dataset option? Does that reduce the size of the datasets? If you have long character variables this can reduce disk usage by a lot. But if the datasets have only a few short variables it might not help at all.

Did you try making the first two steps views? (Note that might still require large utility space)

data P / view=P ;
set P1 P2 ... Pn;
run;
data H / view=H;
set H1 H2 ...Hn;
run;

Or even the third step?

SASKiwi

Before trying different joining strategies, I suggest you add this at the start of your program and just rerun as is to see how much further it gets you:

options compress = yes;

Astounding

Let's talk about the IDVARS. Does the same combination appear once or more than once? If it's just once among all the P data sets, and just once among all the H data sets, you can eliminate the middleman:

data want;
   merge p1 p2 p3 ... pn
         h1 h2 h3 ... hn;
   by idvars;
run;

BruceBrad

I think this might work. (I've already tried compress, length and program reorganisation).

The IDvars are not all unique, but I think I can see how it would work. The IDvars are the file numbers (file=1 for P1 and H1, =2 for P2 and H2 etc) and case numbers within each file (once each for the H (household) file, and multiple times for the P (person) file). The desired output is one record for each P record, with the H vars attached.

BruceBrad

Here is a test program using the single datastep. The output seems correct, but I get lots of information messages about data being overwritten. What's going on here?

%macro hfile(filen);
data h&filen;
retain file &filen;
do id = 1 to 5;
  v1 = 1000*file + id;
  output;
end;
run;
%mend;
%macro pfile(filen);
data p&filen;
retain file &filen;
do id = 1 to 5;
  do person = 1 to 2;
    v2 = 100000*file + 100*id + person;
  output;
  end;
end;
run;
%mend;
%hfile(1);
%pfile(1);
%hfile(2);
%pfile(2);

data merged;
merge h1 h2 p1 p2;
by file id;
run;

Tom

Because you have V1 (and V2) that is not one of the BY variables but appears on more than one dataset.

If you are positive that the same values of the BY variables will never appear in both H1 and H2 then you can ignore the warning. If you set the MSGLEVEL option to N instead of I then those messages will not be written.

Or you could rename those variables.

Tom

If you are sure that every H records has at least one P record (and every P record has a matching H record) then you could re-create the merge using interleaving SET instead. Try something like this:

options msglevel=i;
data test1;
  merge h1 h2 p1 p2;
  by file id;
run;

data test2;
  set h1(in=inh1 keep=file id) h2(in=inh2 keep=file id) p1(keep=file id) p2(keep=file id);
  by file id;
  if max(inh1,inh2) then do;
    set h1 h2;
    by file id;
    delete;
  end;
  set p1 p2;
  by file id;
run;

proc compare data=test1 compare=test2;
run;

Log

163  options msglevel=i;
164  data test1;
165    merge h1 h2 p1 p2;
166    by file id;
167  run;

INFO: The variable v1 on data set WORK.H1 will be overwritten by data set WORK.H2.
INFO: The variable person on data set WORK.P1 will be overwritten by data set WORK.P2.
INFO: The variable v2 on data set WORK.P1 will be overwritten by data set WORK.P2.
NOTE: There were 5 observations read from the data set WORK.H1.
NOTE: There were 5 observations read from the data set WORK.H2.
NOTE: There were 10 observations read from the data set WORK.P1.
NOTE: There were 10 observations read from the data set WORK.P2.
NOTE: The data set WORK.TEST1 has 20 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


168
169  data test2;
170    set h1(in=inh1 keep=file id) h2(in=inh2 keep=file id) p1(keep=file id) p2(keep=file id);
171    by file id;
172    if max(inh1,inh2) then do;
173      set h1 h2;
174      by file id;
175      delete;
176    end;
177    set p1 p2;
178    by file id;
179  run;

NOTE: There were 5 observations read from the data set WORK.H1.
NOTE: There were 5 observations read from the data set WORK.H2.
NOTE: There were 10 observations read from the data set WORK.P1.
NOTE: There were 10 observations read from the data set WORK.P2.
NOTE: There were 5 observations read from the data set WORK.H1.
NOTE: There were 5 observations read from the data set WORK.H2.
NOTE: There were 10 observations read from the data set WORK.P1.
NOTE: There were 10 observations read from the data set WORK.P2.
NOTE: The data set WORK.TEST2 has 20 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

BruceBrad

Given my data structure, doing the single merge (and suppressing the warnings) seems simplest. Thanks all.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!