Re: merging sequence

HeatherNewton · Posted 11-20-2023 10:13 PM

example 1
data application;
merge application static dynamic;
by app_ref_no
run;

example 2
data application;
merge application static;
by app_ref_no
run;

data application;
merge application dynamic;
by app_ref_no;
run;

is example 1 and example 2 going to give same results?

if yes does it apply to more than one variable used for merge

eg replace in the merge process' by app_ref_no' with 'by app_ref_no org_code;

etc

Tom · Posted 11-20-2023 11:06 PM

I cannot think of a situation where it would produce different results.

Why don't you try it yourself and see?

But I would create different outputs instead of overwriting one of the inputs.

data test1;
  merge application static dynamic;
  by app_ref_no;
run;

data test2_a;
  merge application static;
  by app_ref_no;
run;

data test2;
  merge test2_a dynamic;
  by app_ref_no;
run;

Note that the number of BY variables used to define the BY groups does not matter.

But if the BY variables need to identify the groupings is different between the datasets then it might make a difference in how you merge them.

Say you have two key variables app_ref_no and org_code. And some of the datasets have information at the app_ref_no level and others have it at the org_code level. Then you might want to first merge the datasets at the org_code level first and then merge in the app_ref_no datasets so you can use different BY statements for the two merges.

FreelanceReinh · Posted 11-21-2023 12:25 PM

@Tom wrote:

I cannot think of a situation where it would produce different results.

I couldn't think of one either, but creating generic datasets enlightened me: The results are different if

There is a variable in dataset DYNAMIC which occurs in either APPLICATION or STATIC, but not both.
The dataset not containing that variable has a BY group with more observations than in DYNAMIC, where the corresponding BY group, in turn, has at least as many observations as in the third dataset, but not zero.

Let n_a, n_s and n_d denote the numbers of observations in the relevant BY groups in APPLICATION, STATIC and DYNAMIC, respectively. Basically, there are four different cases satisfying the above criteria:

Variable not contained in APPLICATION, n_a> n_d ≥ n_s > 0
Variable not contained in APPLICATION, n_a> n_d > n_s = 0
Variable not contained in STATIC, n_s> n_d ≥ n_a > 0
Variable not contained in STATIC, n_s> n_d > n_a = 0

The app_ref_no BY groups 1, 2, 3, 4 in the sample datasets below represent small examples of these four cases.

data application;
input app_ref_no x :$3.;
cards;
1 ax1
1 ax2
2 ax3
2 ax4
3 ax5
;

data static;
input app_ref_no y :$3.;
cards;
1 sy1
3 sy2
3 sy3
4 sy4
4 sy5
;

data dynamic;
input app_ref_no x :$3. y :$3.;
cards;
1 dx1 dy1
2 dx2 dy2
3 dx3 dy3
4 dx4 dy4
;

Results:

TEST1 ("triple merge"):

Obs    ref_no     x      y

 1        1      dx1    dy1
 2        1      ax2    dy1
 3        2      dx2    dy2
 4        2      ax4    dy2
 5        3      dx3    dy3
 6        3      dx3    sy3
 7        4      dx4    dy4
 8        4      dx4    sy5

TEST2 ("two-stage merge"):

Obs    ref_no     x      y

 1        1      dx1    dy1
 2        1      ax2    sy1
 3        2      dx2    dy2
 4        2      ax4
 5        3      dx3    dy3
 6        3      ax5    sy3
 7        4      dx4    dy4
 8        4             sy5

In the "triple merge" the retained value (of the variable in question) from dataset DYNAMIC prevails in the last observation of each BY group because the other two datasets don't contribute any value. In the first step of the "two-stage merge", however, a dataset containing more values (missing or non-missing) than DYNAMIC is created and the surplus values prevail in the second merge step.

merging sequence

Re: merging sequence

Re: merging sequence

Registration is open

SAS Training: Just a Click Away