BookmarkSubscribeRSS Feed
LarissaW
Obsidian | Level 7

I was trying to merge two datasets and I used proc compare to check before I merged them, below is the result from proc compare

 

Screenshot 2023-04-25 202057.png

The code I used to merge datasets is

proc sort data=LC.FAM;
by pid;
run;

proc sort data=hepa;
by PID;
run;

data combined1;
    merge LC.FAM  hepa ;
    by PID;
run;

I got a new dataset with 50273 rows, which contains more rows than hepa (50032 obs). Did anyone know why this happens?

4 REPLIES 4
Quentin
Super User

This can happen if there are duplicates in the data, or if there are mis-matches (e.g. values for pid in lc.fam that are not in hepa).

 

Running PROC COMPARE is an interesting idea.  You usually use PROC COMPARE to compare variables, but it will also tell you if you have duplicate values, or if you have mis-matches.  What do you get if you run:

 

proc compare base=hepa compare=lc.fam ;
  id pid ; *use an ID statement here, not a BY statement;
run ;
The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
Tom
Super User Tom
Super User

There is no way to know in advance how many observations the merge will generate.  Unless you know the values of PID in both datasets.

 

If HEPA has 100 observations that are each a distinct value of PID

And FAM has 10 observations that are each a distinct value of PID

 

Then merging them can result in between 100 observations (all of the values of PID in FAM already existed in HEPA) to 110 observations (none of the values of PID in FAM existed in HEPA).

 

And if either dataset has multiple observations for the same PID then even stranger things can happen.

A_Kh
Lapis Lazuli | Level 10

In addition to abovementioned comments, if you need to merge based on PID.HEPA then use conditional merge, or SQL joins, then you get the 50032 obs. 

eg:

data combined1;
    merge LC.FAM (in=a)  hepa (in=b);
    by PID;
    if b;
run;

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 826 views
  • 0 likes
  • 5 in conversation