BookmarkSubscribeRSS Feed
SJN
Fluorite | Level 6 SJN
Fluorite | Level 6

data a1;
input no name$;
cards;
1 a
1 b
1 c
1 d
;
run;

data a2;
input no name$;
cards;
1 e
1 f
;
run;

data DatastepOP;
merge a1(in=a) a2(in=b);
by no;
if a;
run;

 

proc sql;
create table sqlopt
as
select a1.*,a2.*
from a1 left join a2
on a1.no=a2.no;
quit;

Output:

Data Step Output     Proc Sql Output
no name   no name
1 e   1 c
1 f   1 a
1 c   1 b
1 d   1 d
      1 c
      1 a
      1 b
      1 d

 

In both we are doing left join why there is difference in record count?

4 REPLIES 4
Kurt_Bremser
Super User

In many-to-many situations, a data step merge behaves different that a SQL join. SQL matches every record from a1 with every record from a2, creating m*n records. Merge works through both datasets sequentially side-by-side, and once there are no new records in one of the datasets, the last one read is kept for merging; so you end up with max(m,n) records. You could better illustrate the behaviour if you named "name" differently in the second dataset a2.

 

In your example, the data step

- reads first record from a1, sets name to "a"

- reads first record from a2, sets name to "e"

- outputs and proceeds to next iteration

- reads second record from a1, sets name to "b"

- reads second record from a2, sets name to "f"

- outputs and proceeds to next iteration

- reads third record from a1, sets name to "c"

- finds no further record from a2, so does not overwrite name

- outputs and proceeds to next iteration

- reads fourth record from a1, sets name to "d"

- finds no further record from a2, so does not overwrite name

- outputs and proceeds to next iteration

- finds no new records, stops

 

SQL, OTOH, creates all combinations, but since name is present in both datasets (you will find a nastygram about this in the log), SQL takes only the first occurence; that's why you never get a value from a2.

SJN
Fluorite | Level 6 SJN
Fluorite | Level 6

Thank you for the reply. I see both the methods are correct but which is the best method to use?

Kurt_Bremser
Super User

First of all, I usually prevent a m:n situation when using a datastep merge, as the result is not really predictable; the NOTE in the log about multiple records in both datasets usually means (in my codes) that something went wrong when setting up for the merge.

Even if I have situation where I have 4 records in dataset a and 2 records in dataset b with identical data (so all resulting records would have the same values from b), I consider that sloppy programming. I'd rather do a sort nodupkey on b first, as that makes my intentions clear in the code.

 

If you have a real m:n situation (meaning none of the contributing datasets have redundant records), the cartesian join that SQL does will usually be correct and desired, as all possible combinations are iterated on.

 

 

LinusH
Tourmaline | Level 20

One of the oldest SAS questions.

Data step works row wise, where SQL works on columns, and performs Cartesian products, and filter by the join criteria.

I suggest to go through the SAS Language concepts doc, and the SAS programming courses for better understanding.

Data never sleeps

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 4692 views
  • 0 likes
  • 3 in conversation