One to many merge on SAS

cermank · Posted 09-14-2019 01:10 PM

I have two datasets.

Dataset A has Account IDs and Balance. Dataset B has Accounts IDs and other variables. Account IDs in dataset B are duplicated.

data a; 
input acct_id Bal$
21 111
20 222
24 213
26 290
29 300

data b;
input acct_id y$
21 a
21 b
21 c
21 d
21 e

I want to merge these datasets so that i get the following dataset

21 a 111
21 b 111
21 c 111
21 d 111
21 d 111
21 e 111

I tried the following patch

data c
merge b(in=inb) a;
by acct_id;
if inb;
run;

But the results are not what i expected. The final dataset has almost double the observations than the b dataset. I'm not sure why that might be happening. What am i doing wrong?

novinosrin · Posted 09-14-2019 01:19 PM

Hi @cermank Your code works fine with a SORT step of both datasets by ACCT_ID before the merge


data a; 
input acct_id Bal$;
cards;
21 111
20 222
24 213
26 290
29 300
;

data b;
input acct_id y$;
cards;
21 a
21 b
21 c
21 d
21 e
;
proc sort data=a;
by acct_id;
run;
proc sort data=b;
by acct_id;
run;
data c;
merge b(in=inb) a;
by acct_id;
if inb;
run;
proc print noobs;run;

acct_id	y	Bal
21	a	111
21	b	111
21	c	111
21	d	111
21	e	111

cermank · Posted 09-14-2019 01:21 PM

Okay, just a question though. Shouldn't proc sort on acct_id on dataset b delete duplicate ID's? How does it work here?

novinosrin · Posted 09-14-2019 01:27 PM

Hi @cermank SORT by name in itself is meant to rearrange records in an order. It could be ascending or descending. Please get your understanding that this is all SORT actually means or does by default. However, yes, you could apply some options to remove duplicates or unique depending upon your requirement. That being said, it is an option within the procedure SORT. Basically, if its sole objective was to remove duplicates, we could rather have a PROC REMOVE_DUPLICATE? 🙂

The nomenclature in SAS is in general self explanatory like proc sort, proc means(descriptive stats), proc summary(summary stats), proc freq(frequency) etc. I hope this helps. Btw Welcome to SAS community.

@cermank wrote:

Okay, just a question though. Shouldn't proc sort on acct_id on dataset b delete duplicate ID's? How does it work here?

novinosrin · Posted 09-14-2019 01:31 PM

I would even suggest to go for a SQL join to avoid explicit sortwhich happens to be a requirement for datastep merge until you have a strong understanding of match-merge in a datastep based on BY GROUPS.

So the same objective is achieved by a simple LEFT JOIN:

proc sql;
create table want as
select b.*, bal
from b left join a
on b.acct_id=a.acct_id
order by b.acct_id;
quit;

Tom · Posted 09-15-2019 10:07 AM

@cermank wrote:
...

I tried the following patch
data c;
merge b(in=inb) a;
by acct_id;
if inb;
run;
But the results are not what i expected. The final dataset has almost double the observations than the b dataset. I'm not sure why that might be happening. What am i doing wrong?

The output C will only have more observations than B when one or more values of ACCT_ID appear more than once in A. If you want to keep only one observation per value of ACCT_ID then you can augment your subsetting IF statement to also test for the flag variable FIRST.ACCT_ID that will be generated because of the BY statement.

One to many merge on SAS

Re: One to many merge on SAS

Re: One to many merge on SAS

Re: One to many merge on SAS

Re: One to many merge on SAS

Re: One to many merge on SAS

One to many merge on SAS

Re: One to many merge on SAS

Re: One to many merge on SAS

Re: One to many merge on SAS

Re: One to many merge on SAS

Re: One to many merge on SAS

Click image to register for webinar

Classroom Training Available!