DATA Step, Macro, Functions and more

Merge VS join

Reply
Contributor SJN
Contributor
Posts: 21

Merge VS join

data a1;
input no name$;
cards;
1 a
1 b
1 c
1 d
;
run;

data a2;
input no name$;
cards;
1 e
1 f
;
run;

data DatastepOP;
merge a1(in=a) a2(in=b);
by no;
if a;
run;

 

proc sql;
create table sqlopt
as
select a1.*,a2.*
from a1 left join a2
on a1.no=a2.no;
quit;

Output:

Data Step Output     Proc Sql Output
no name   no name
1 e   1 c
1 f   1 a
1 c   1 b
1 d   1 d
      1 c
      1 a
      1 b
      1 d

 

In both we are doing left join why there is difference in record count?

Super User
Posts: 7,760

Re: Merge VS join

In many-to-many situations, a data step merge behaves different that a SQL join. SQL matches every record from a1 with every record from a2, creating m*n records. Merge works through both datasets sequentially side-by-side, and once there are no new records in one of the datasets, the last one read is kept for merging; so you end up with max(m,n) records. You could better illustrate the behaviour if you named "name" differently in the second dataset a2.

 

In your example, the data step

- reads first record from a1, sets name to "a"

- reads first record from a2, sets name to "e"

- outputs and proceeds to next iteration

- reads second record from a1, sets name to "b"

- reads second record from a2, sets name to "f"

- outputs and proceeds to next iteration

- reads third record from a1, sets name to "c"

- finds no further record from a2, so does not overwrite name

- outputs and proceeds to next iteration

- reads fourth record from a1, sets name to "d"

- finds no further record from a2, so does not overwrite name

- outputs and proceeds to next iteration

- finds no new records, stops

 

SQL, OTOH, creates all combinations, but since name is present in both datasets (you will find a nastygram about this in the log), SQL takes only the first occurence; that's why you never get a value from a2.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor SJN
Contributor
Posts: 21

Re: Merge VS join

Posted in reply to KurtBremser

Thank you for the reply. I see both the methods are correct but which is the best method to use?

Super User
Posts: 7,760

Re: Merge VS join

First of all, I usually prevent a m:n situation when using a datastep merge, as the result is not really predictable; the NOTE in the log about multiple records in both datasets usually means (in my codes) that something went wrong when setting up for the merge.

Even if I have situation where I have 4 records in dataset a and 2 records in dataset b with identical data (so all resulting records would have the same values from b), I consider that sloppy programming. I'd rather do a sort nodupkey on b first, as that makes my intentions clear in the code.

 

If you have a real m:n situation (meaning none of the contributing datasets have redundant records), the cartesian join that SQL does will usually be correct and desired, as all possible combinations are iterated on.

 

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 5,424

Re: Merge VS join

One of the oldest SAS questions.

Data step works row wise, where SQL works on columns, and performs Cartesian products, and filter by the join criteria.

I suggest to go through the SAS Language concepts doc, and the SAS programming courses for better understanding.

Data never sleeps
Ask a Question
Discussion stats
  • 4 replies
  • 397 views
  • 0 likes
  • 3 in conversation