data a1;
input no name$;
cards;
1 a
1 b
1 c
1 d
;
run;
data a2;
input no name$;
cards;
1 e
1 f
;
run;
data DatastepOP;
merge a1(in=a) a2(in=b);
by no;
if a;
run;
proc sql;
create table sqlopt
as
select a1.*,a2.*
from a1 left join a2
on a1.no=a2.no;
quit;
Output:
Data Step Output | Proc Sql Output | |||
no | name | no | name | |
1 | e | 1 | c | |
1 | f | 1 | a | |
1 | c | 1 | b | |
1 | d | 1 | d | |
1 | c | |||
1 | a | |||
1 | b | |||
1 | d |
In both we are doing left join why there is difference in record count?
In many-to-many situations, a data step merge behaves different that a SQL join. SQL matches every record from a1 with every record from a2, creating m*n records. Merge works through both datasets sequentially side-by-side, and once there are no new records in one of the datasets, the last one read is kept for merging; so you end up with max(m,n) records. You could better illustrate the behaviour if you named "name" differently in the second dataset a2.
In your example, the data step
- reads first record from a1, sets name to "a"
- reads first record from a2, sets name to "e"
- outputs and proceeds to next iteration
- reads second record from a1, sets name to "b"
- reads second record from a2, sets name to "f"
- outputs and proceeds to next iteration
- reads third record from a1, sets name to "c"
- finds no further record from a2, so does not overwrite name
- outputs and proceeds to next iteration
- reads fourth record from a1, sets name to "d"
- finds no further record from a2, so does not overwrite name
- outputs and proceeds to next iteration
- finds no new records, stops
SQL, OTOH, creates all combinations, but since name is present in both datasets (you will find a nastygram about this in the log), SQL takes only the first occurence; that's why you never get a value from a2.
Thank you for the reply. I see both the methods are correct but which is the best method to use?
First of all, I usually prevent a m:n situation when using a datastep merge, as the result is not really predictable; the NOTE in the log about multiple records in both datasets usually means (in my codes) that something went wrong when setting up for the merge.
Even if I have situation where I have 4 records in dataset a and 2 records in dataset b with identical data (so all resulting records would have the same values from b), I consider that sloppy programming. I'd rather do a sort nodupkey on b first, as that makes my intentions clear in the code.
If you have a real m:n situation (meaning none of the contributing datasets have redundant records), the cartesian join that SQL does will usually be correct and desired, as all possible combinations are iterated on.
One of the oldest SAS questions.
Data step works row wise, where SQL works on columns, and performs Cartesian products, and filter by the join criteria.
I suggest to go through the SAS Language concepts doc, and the SAS programming courses for better understanding.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.