I have two datasets.
Dataset A has Account IDs and Balance. Dataset B has Accounts IDs and other variables. Account IDs in dataset B are duplicated.
data a;
input acct_id Bal$
21 111
20 222
24 213
26 290
29 300
data b;
input acct_id y$
21 a
21 b
21 c
21 d
21 e
I want to merge these datasets so that i get the following dataset
21 a 111
21 b 111
21 c 111
21 d 111
21 d 111
21 e 111
I tried the following patch
data c
merge b(in=inb) a;
by acct_id;
if inb;
run;
But the results are not what i expected. The final dataset has almost double the observations than the b dataset. I'm not sure why that might be happening. What am i doing wrong?
Hi @cermank Your code works fine with a SORT step of both datasets by ACCT_ID before the merge
data a;
input acct_id Bal$;
cards;
21 111
20 222
24 213
26 290
29 300
;
data b;
input acct_id y$;
cards;
21 a
21 b
21 c
21 d
21 e
;
proc sort data=a;
by acct_id;
run;
proc sort data=b;
by acct_id;
run;
data c;
merge b(in=inb) a;
by acct_id;
if inb;
run;
proc print noobs;run;
acct_id | y | Bal |
---|---|---|
21 | a | 111 |
21 | b | 111 |
21 | c | 111 |
21 | d | 111 |
21 | e | 111 |
Okay, just a question though. Shouldn't proc sort on acct_id on dataset b delete duplicate ID's? How does it work here?
Hi @cermank SORT by name in itself is meant to rearrange records in an order. It could be ascending or descending. Please get your understanding that this is all SORT actually means or does by default. However, yes, you could apply some options to remove duplicates or unique depending upon your requirement. That being said, it is an option within the procedure SORT. Basically, if its sole objective was to remove duplicates, we could rather have a PROC REMOVE_DUPLICATE? 🙂
The nomenclature in SAS is in general self explanatory like proc sort, proc means(descriptive stats), proc summary(summary stats), proc freq(frequency) etc. I hope this helps. Btw Welcome to SAS community.
@cermank wrote:
Okay, just a question though. Shouldn't proc sort on acct_id on dataset b delete duplicate ID's? How does it work here?
I would even suggest to go for a SQL join to avoid explicit sortwhich happens to be a requirement for datastep merge until you have a strong understanding of match-merge in a datastep based on BY GROUPS.
So the same objective is achieved by a simple LEFT JOIN:
proc sql;
create table want as
select b.*, bal
from b left join a
on b.acct_id=a.acct_id
order by b.acct_id;
quit;
@cermank wrote:
...
I tried the following patch
data c; merge b(in=inb) a; by acct_id; if inb; run;
But the results are not what i expected. The final dataset has almost double the observations than the b dataset. I'm not sure why that might be happening. What am i doing wrong?
The output C will only have more observations than B when one or more values of ACCT_ID appear more than once in A. If you want to keep only one observation per value of ACCT_ID then you can augment your subsetting IF statement to also test for the flag variable FIRST.ACCT_ID that will be generated because of the BY statement.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.