Re: Merge VS join

SJN · Posted 09-29-2016 02:54 AM

data a1;
input no name$;
cards;
1 a
1 b
1 c
1 d
;
run;

data a2;
input no name$;
cards;
1 e
1 f
;
run;

data DatastepOP;
merge a1(in=a) a2(in=b);
by no;
if a;
run;

proc sql;
create table sqlopt
as
select a1.*,a2.*
from a1 left join a2
on a1.no=a2.no;
quit;

Output:

Data Step Output		Proc Sql Output
no	name	no	name
1	e	1	c
1	f	1	a
1	c	1	b
1	d	1	d
		1	c
		1	a
		1	b
		1	d

In both we are doing left join why there is difference in record count?

Kurt_Bremser · Posted 09-29-2016 03:18 AM

In many-to-many situations, a data step merge behaves different that a SQL join. SQL matches every record from a1 with every record from a2, creating m*n records. Merge works through both datasets sequentially side-by-side, and once there are no new records in one of the datasets, the last one read is kept for merging; so you end up with max(m,n) records. You could better illustrate the behaviour if you named "name" differently in the second dataset a2.

In your example, the data step

- reads first record from a1, sets name to "a"

- reads first record from a2, sets name to "e"

- outputs and proceeds to next iteration

- reads second record from a1, sets name to "b"

- reads second record from a2, sets name to "f"

- outputs and proceeds to next iteration

- reads third record from a1, sets name to "c"

- finds no further record from a2, so does not overwrite name

- outputs and proceeds to next iteration

- reads fourth record from a1, sets name to "d"

- finds no further record from a2, so does not overwrite name

- outputs and proceeds to next iteration

- finds no new records, stops

SQL, OTOH, creates all combinations, but since name is present in both datasets (you will find a nastygram about this in the log), SQL takes only the first occurence; that's why you never get a value from a2.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SJN · Posted 09-29-2016 03:26 AM

Thank you for the reply. I see both the methods are correct but which is the best method to use?

Kurt_Bremser · Posted 09-29-2016 03:36 AM

First of all, I usually prevent a m:n situation when using a datastep merge, as the result is not really predictable; the NOTE in the log about multiple records in both datasets usually means (in my codes) that something went wrong when setting up for the merge.

Even if I have situation where I have 4 records in dataset a and 2 records in dataset b with identical data (so all resulting records would have the same values from b), I consider that sloppy programming. I'd rather do a sort nodupkey on b first, as that makes my intentions clear in the code.

If you have a real m:n situation (meaning none of the contributing datasets have redundant records), the cartesian join that SQL does will usually be correct and desired, as all possible combinations are iterated on.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX