BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
m1986MM
Obsidian | Level 7

Hello everyone,

I tried to merge two datasets: 

     have1 with two common variables: var11 and var12

     have2 with two common variables: var21 and var22

So I used the following code:

proc sql;

create table want 

as select L.*, R.* 

from have1 as L full outer join have2 as R

on L.var11=R.var21 and L.var12=R.var22;

quit;

 

Then I tried to do the same thing in Stata. I change the common variables in both dataset to var1 and var2, and ran the following code:

use file1

merge m:m var1 var2 using file2

 

There is a conflict in the result of these two approaches. The first code gives me a result with 10M observation, but the output of the second code has only 300K observations. 

I read the manuals but I couldn't figure out what is the reason that I get two different results by using these methods. 

I appreciate if you could help.

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

According to the STATA documentation of their merge statement (http://www.stata.com/manuals13/dmerge.pdf), "m:m specifies a many-to-many merge and is a bad idea."

 

What they describe is exactly what happens in a SAS data step if you merge two datasets by a key with repeats of BY values in both datasets (i.e. one-to-one within the BY group, as pointed out by @Astounding, and repeating the last observation of the shorter BY group, if any).

 

In contrast, when PROC SQL joins two tables, it always creates a Cartesian product of the two tables (i.e. each row of the first combined with all rows of the second) as the internal "input to the rest of the query" (PROC SQL documentation, section "Joining Tables"). Please see @Astounding's example or the example "Full Outer Join" in the linked documentation.

 

So, if a BY group in table L has m observations and there is a matching BY group in table R with n observations, the STATA merge of the two tables would result in max(m,n) observations for that BY group, whereas the full outer join of PROC SQL would produce m·n observations. Please note that m·n > max(m,n) if and only if m>1 and n>1. Moreover, non-matching BY groups would be copied to the output dataset both in the STATA merge and the full outer join. That is, the difference of 10M vs. 300K observations that you observed implies that your datasets L and R have matching "BY groups" with non-unique keys (probably many of them).

 

Here's what the STATA documentation linked above says about such merges:

  • "Because m:m merges are such a bad idea, we are not going to show you an example."
  • "First, if you think you need to perform an m:m merge, then we suspect you are wrong."

View solution in original post

6 REPLIES 6
Astounding
PROC Star

The second combination using MERGE doesn't look like a real SAS program, but let's assume that you successfully completed a MERGE.  

 

The number of observations would be different if there are multiple observations in both data sets with the same values for VAR1 and VAR2.  SQL gives you all combinations, but MERGE matches observation by observation.  Let's take a simpler case where you are just joining by 1 variable.

 

Have1:

 

VAR1     tracker1

1              A

1              B

1              C

 

Have2:

 

VAR1     tracker2

1               X

1               Y

1               Z

 

If you were to MERGE these two data sets, you would get observations matched 1 by 1:

 

VAR1     tracker1     tracker2

1                A               X

1                B               Y

1                C               Z

 

You would also get a note in the log about more than one data set containing repeated values for the BY variable.

 

If you were to join with SQL, you would get all combinations:

 

VAR1    tracker1       tracker2

1               A                 X

1               A                 Y

1               A                 Z

1               B                 X

1               B                 Y

1               B                 Z

1               C                 X

1               C                 Y

1               C                 Z

 

That's just how the tools work.

 

Good luck.

m1986MM
Obsidian | Level 7

Thanks for your respond, but as you said it is not a sas code, it's STATA. I want to compare these two commands, the first one in SAS and the second one in STATA.

Astounding
PROC Star

Oops ... missed the part about Stata.

 

I'm not sure how Stata would combine the data sets, but SAS would definitely multiply the number of matches by giving you all combinations as illustrated in my example.  Depending on the characteristics of Stata, that might still be the difference between the two results.

Reeza
Super User

I don't know the answer, but I have an idea on how I'd find it. 

 

Generate two small sample data sets, that represent your merge type, in this case many to many.

Determine what the expected output is.

Join using both Stata and SAS

See which is correct

With a smaller dataset it's more likely you can determine what type  of join Stata is using.

 

If you were trying to google, I would look for how Stata handles the many to many merge. The SAS SQL Full outer join is pretty explicit and common, but what Stata is doing may differ. My 2 minute search, didn't yield anything obvious, but I'm not a Stata user at all. 

FreelanceReinh
Jade | Level 19

According to the STATA documentation of their merge statement (http://www.stata.com/manuals13/dmerge.pdf), "m:m specifies a many-to-many merge and is a bad idea."

 

What they describe is exactly what happens in a SAS data step if you merge two datasets by a key with repeats of BY values in both datasets (i.e. one-to-one within the BY group, as pointed out by @Astounding, and repeating the last observation of the shorter BY group, if any).

 

In contrast, when PROC SQL joins two tables, it always creates a Cartesian product of the two tables (i.e. each row of the first combined with all rows of the second) as the internal "input to the rest of the query" (PROC SQL documentation, section "Joining Tables"). Please see @Astounding's example or the example "Full Outer Join" in the linked documentation.

 

So, if a BY group in table L has m observations and there is a matching BY group in table R with n observations, the STATA merge of the two tables would result in max(m,n) observations for that BY group, whereas the full outer join of PROC SQL would produce m·n observations. Please note that m·n > max(m,n) if and only if m>1 and n>1. Moreover, non-matching BY groups would be copied to the output dataset both in the STATA merge and the full outer join. That is, the difference of 10M vs. 300K observations that you observed implies that your datasets L and R have matching "BY groups" with non-unique keys (probably many of them).

 

Here's what the STATA documentation linked above says about such merges:

  • "Because m:m merges are such a bad idea, we are not going to show you an example."
  • "First, if you think you need to perform an m:m merge, then we suspect you are wrong."
m1986MM
Obsidian | Level 7

Thanks,

I also did an example and it clearly proved your explanation. Just to document it here, just in case that someone needs it one day.

If we want to merge these two tables by ID and Weight:

Name         ID           Weight

Sara           11             110

Rose          11             110

 

ID           Weight       Height

11           110            6

11            110           5

 

The SAS output will be:

Name         ID           Weight         Height

Sara           11             110               6

Sara           11             110               5

Rose          11             110               6

Rose          11             110               5

 

The STATA output will be:

Name         ID           Weight         Height

Sara           11             110               6

Rose          11             110               5

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 3015 views
  • 6 likes
  • 4 in conversation