DATA Step, Macro, Functions and more

Difference between merging by Proc sql in SAS and merge m:m in Stata

Accepted Solution Solved
Reply
Contributor
Posts: 65
Accepted Solution

Difference between merging by Proc sql in SAS and merge m:m in Stata

Hello everyone,

I tried to merge two datasets: 

     have1 with two common variables: var11 and var12

     have2 with two common variables: var21 and var22

So I used the following code:

proc sql;

create table want 

as select L.*, R.* 

from have1 as L full outer join have2 as R

on L.var11=R.var21 and L.var12=R.var22;

quit;

 

Then I tried to do the same thing in Stata. I change the common variables in both dataset to var1 and var2, and ran the following code:

use file1

merge m:m var1 var2 using file2

 

There is a conflict in the result of these two approaches. The first code gives me a result with 10M observation, but the output of the second code has only 300K observations. 

I read the manuals but I couldn't figure out what is the reason that I get two different results by using these methods. 

I appreciate if you could help.

 


Accepted Solutions
Solution
‎01-12-2016 12:30 PM
Trusted Advisor
Posts: 1,115

Re: Difference between merging by Proc sql in SAS and merge m:m in Stata

According to the STATA documentation of their merge statement (http://www.stata.com/manuals13/dmerge.pdf), "m:m specifies a many-to-many merge and is a bad idea."

 

What they describe is exactly what happens in a SAS data step if you merge two datasets by a key with repeats of BY values in both datasets (i.e. one-to-one within the BY group, as pointed out by @Astounding, and repeating the last observation of the shorter BY group, if any).

 

In contrast, when PROC SQL joins two tables, it always creates a Cartesian product of the two tables (i.e. each row of the first combined with all rows of the second) as the internal "input to the rest of the query" (PROC SQL documentation, section "Joining Tables"). Please see @Astounding's example or the example "Full Outer Join" in the linked documentation.

 

So, if a BY group in table L has m observations and there is a matching BY group in table R with n observations, the STATA merge of the two tables would result in max(m,n) observations for that BY group, whereas the full outer join of PROC SQL would produce m·n observations. Please note that m·n > max(m,n) if and only if m>1 and n>1. Moreover, non-matching BY groups would be copied to the output dataset both in the STATA merge and the full outer join. That is, the difference of 10M vs. 300K observations that you observed implies that your datasets L and R have matching "BY groups" with non-unique keys (probably many of them).

 

Here's what the STATA documentation linked above says about such merges:

  • "Because m:m merges are such a bad idea, we are not going to show you an example."
  • "First, if you think you need to perform an m:m merge, then we suspect you are wrong."

View solution in original post


All Replies
Super User
Posts: 5,099

Re: Difference between merging by Proc sql in SAS and merge m:m in Stata

The second combination using MERGE doesn't look like a real SAS program, but let's assume that you successfully completed a MERGE.  

 

The number of observations would be different if there are multiple observations in both data sets with the same values for VAR1 and VAR2.  SQL gives you all combinations, but MERGE matches observation by observation.  Let's take a simpler case where you are just joining by 1 variable.

 

Have1:

 

VAR1     tracker1

1              A

1              B

1              C

 

Have2:

 

VAR1     tracker2

1               X

1               Y

1               Z

 

If you were to MERGE these two data sets, you would get observations matched 1 by 1:

 

VAR1     tracker1     tracker2

1                A               X

1                B               Y

1                C               Z

 

You would also get a note in the log about more than one data set containing repeated values for the BY variable.

 

If you were to join with SQL, you would get all combinations:

 

VAR1    tracker1       tracker2

1               A                 X

1               A                 Y

1               A                 Z

1               B                 X

1               B                 Y

1               B                 Z

1               C                 X

1               C                 Y

1               C                 Z

 

That's just how the tools work.

 

Good luck.

Contributor
Posts: 65

Re: Difference between merging by Proc sql in SAS and merge m:m in Stata

Thanks for your respond, but as you said it is not a sas code, it's STATA. I want to compare these two commands, the first one in SAS and the second one in STATA.

Super User
Posts: 5,099

Re: Difference between merging by Proc sql in SAS and merge m:m in Stata

Oops ... missed the part about Stata.

 

I'm not sure how Stata would combine the data sets, but SAS would definitely multiply the number of matches by giving you all combinations as illustrated in my example.  Depending on the characteristics of Stata, that might still be the difference between the two results.

Super User
Posts: 17,963

Re: Difference between merging by Proc sql in SAS and merge m:m in Stata

I don't know the answer, but I have an idea on how I'd find it. 

 

Generate two small sample data sets, that represent your merge type, in this case many to many.

Determine what the expected output is.

Join using both Stata and SAS

See which is correct

With a smaller dataset it's more likely you can determine what type  of join Stata is using.

 

If you were trying to google, I would look for how Stata handles the many to many merge. The SAS SQL Full outer join is pretty explicit and common, but what Stata is doing may differ. My 2 minute search, didn't yield anything obvious, but I'm not a Stata user at all. 

Solution
‎01-12-2016 12:30 PM
Trusted Advisor
Posts: 1,115

Re: Difference between merging by Proc sql in SAS and merge m:m in Stata

According to the STATA documentation of their merge statement (http://www.stata.com/manuals13/dmerge.pdf), "m:m specifies a many-to-many merge and is a bad idea."

 

What they describe is exactly what happens in a SAS data step if you merge two datasets by a key with repeats of BY values in both datasets (i.e. one-to-one within the BY group, as pointed out by @Astounding, and repeating the last observation of the shorter BY group, if any).

 

In contrast, when PROC SQL joins two tables, it always creates a Cartesian product of the two tables (i.e. each row of the first combined with all rows of the second) as the internal "input to the rest of the query" (PROC SQL documentation, section "Joining Tables"). Please see @Astounding's example or the example "Full Outer Join" in the linked documentation.

 

So, if a BY group in table L has m observations and there is a matching BY group in table R with n observations, the STATA merge of the two tables would result in max(m,n) observations for that BY group, whereas the full outer join of PROC SQL would produce m·n observations. Please note that m·n > max(m,n) if and only if m>1 and n>1. Moreover, non-matching BY groups would be copied to the output dataset both in the STATA merge and the full outer join. That is, the difference of 10M vs. 300K observations that you observed implies that your datasets L and R have matching "BY groups" with non-unique keys (probably many of them).

 

Here's what the STATA documentation linked above says about such merges:

  • "Because m:m merges are such a bad idea, we are not going to show you an example."
  • "First, if you think you need to perform an m:m merge, then we suspect you are wrong."
Contributor
Posts: 65

Re: Difference between merging by Proc sql in SAS and merge m:m in Stata

Thanks,

I also did an example and it clearly proved your explanation. Just to document it here, just in case that someone needs it one day.

If we want to merge these two tables by ID and Weight:

Name         ID           Weight

Sara           11             110

Rose          11             110

 

ID           Weight       Height

11           110            6

11            110           5

 

The SAS output will be:

Name         ID           Weight         Height

Sara           11             110               6

Sara           11             110               5

Rose          11             110               6

Rose          11             110               5

 

The STATA output will be:

Name         ID           Weight         Height

Sara           11             110               6

Rose          11             110               5

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 552 views
  • 6 likes
  • 4 in conversation