BookmarkSubscribeRSS Feed
cermank
Calcite | Level 5

I have two datasets.

Dataset A has Account IDs and Balance. Dataset B has Accounts IDs and other variables. Account IDs in dataset B are duplicated.

data a; 
input acct_id Bal$
21 111
20 222
24 213
26 290
29 300

data b;
input acct_id y$
21 a
21 b
21 c
21 d
21 e

I want to merge these datasets so that i get the following dataset

21 a 111
21 b 111
21 c 111
21 d 111
21 d 111
21 e 111

I tried the following patch

data c
merge b(in=inb) a;
by acct_id;
if inb;
run;

But the results are not what i expected. The final dataset has almost double the observations than the b dataset. I'm not sure why that might be happening. What am i doing wrong?

5 REPLIES 5
novinosrin
Tourmaline | Level 20

Hi @cermank  Your code works fine with a SORT step of both datasets by ACCT_ID before the merge

 


data a; 
input acct_id Bal$;
cards;
21 111
20 222
24 213
26 290
29 300
;

data b;
input acct_id y$;
cards;
21 a
21 b
21 c
21 d
21 e
;
proc sort data=a;
by acct_id;
run;
proc sort data=b;
by acct_id;
run;
data c;
merge b(in=inb) a;
by acct_id;
if inb;
run;
proc print noobs;run;
acct_id y Bal
21 a 111
21 b 111
21 c 111
21 d 111
21 e 111
cermank
Calcite | Level 5

Okay, just a question though. Shouldn't proc sort on acct_id on dataset b delete duplicate ID's? How does it work here?

novinosrin
Tourmaline | Level 20

Hi @cermank   SORT by name in itself is meant to rearrange records in an order. It could be ascending or descending.  Please get your understanding that this is all SORT actually means or does by default. However, yes, you could apply some options to remove duplicates or unique depending upon your requirement. That being said, it is an option within the procedure SORT. Basically, if its sole objective was to remove duplicates, we could rather have a PROC REMOVE_DUPLICATE? 🙂 

 

The nomenclature in SAS is in general self explanatory like proc sort, proc means(descriptive stats), proc summary(summary stats), proc freq(frequency) etc. I hope this helps. Btw Welcome to SAS community.

 

 

 


@cermank wrote:

Okay, just a question though. Shouldn't proc sort on acct_id on dataset b delete duplicate ID's? How does it work here?


 

novinosrin
Tourmaline | Level 20

I would even suggest to go for a SQL join to avoid explicit sortwhich happens to be a requirement for datastep merge until you have a strong understanding of match-merge in a datastep based on BY GROUPS. 

 So the same objective is achieved by a simple LEFT JOIN:

 

proc sql;
create table want as
select b.*, bal
from b left join a
on b.acct_id=a.acct_id
order by b.acct_id;
quit;
Tom
Super User Tom
Super User

@cermank wrote:
...

I tried the following patch

data c;
merge b(in=inb) a;
by acct_id;
if inb;
run;

But the results are not what i expected. The final dataset has almost double the observations than the b dataset. I'm not sure why that might be happening. What am i doing wrong?


The output C will only have more observations than B when one or more values of ACCT_ID appear more than once in A.  If you want to keep only one observation per value of ACCT_ID then you can augment your subsetting IF statement to also test for the flag variable FIRST.ACCT_ID that will be generated because of the BY statement.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 557 views
  • 0 likes
  • 3 in conversation