- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone,
I tried to merge two datasets:
have1 with two common variables: var11 and var12
have2 with two common variables: var21 and var22
So I used the following code:
proc sql;
create table want
as select L.*, R.*
from have1 as L full outer join have2 as R
on L.var11=R.var21 and L.var12=R.var22;
quit;
Then I tried to do the same thing in Stata. I change the common variables in both dataset to var1 and var2, and ran the following code:
use file1
merge m:m var1 var2 using file2
There is a conflict in the result of these two approaches. The first code gives me a result with 10M observation, but the output of the second code has only 300K observations.
I read the manuals but I couldn't figure out what is the reason that I get two different results by using these methods.
I appreciate if you could help.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
According to the STATA documentation of their merge statement (http://www.stata.com/manuals13/dmerge.pdf), "m:m specifies a many-to-many merge and is a bad idea."
What they describe is exactly what happens in a SAS data step if you merge two datasets by a key with repeats of BY values in both datasets (i.e. one-to-one within the BY group, as pointed out by @Astounding, and repeating the last observation of the shorter BY group, if any).
In contrast, when PROC SQL joins two tables, it always creates a Cartesian product of the two tables (i.e. each row of the first combined with all rows of the second) as the internal "input to the rest of the query" (PROC SQL documentation, section "Joining Tables"). Please see @Astounding's example or the example "Full Outer Join" in the linked documentation.
So, if a BY group in table L has m observations and there is a matching BY group in table R with n observations, the STATA merge of the two tables would result in max(m,n) observations for that BY group, whereas the full outer join of PROC SQL would produce m·n observations. Please note that m·n > max(m,n) if and only if m>1 and n>1. Moreover, non-matching BY groups would be copied to the output dataset both in the STATA merge and the full outer join. That is, the difference of 10M vs. 300K observations that you observed implies that your datasets L and R have matching "BY groups" with non-unique keys (probably many of them).
Here's what the STATA documentation linked above says about such merges:
- "Because m:m merges are such a bad idea, we are not going to show you an example."
- "First, if you think you need to perform an m:m merge, then we suspect you are wrong."
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The second combination using MERGE doesn't look like a real SAS program, but let's assume that you successfully completed a MERGE.
The number of observations would be different if there are multiple observations in both data sets with the same values for VAR1 and VAR2. SQL gives you all combinations, but MERGE matches observation by observation. Let's take a simpler case where you are just joining by 1 variable.
Have1:
VAR1 tracker1
1 A
1 B
1 C
Have2:
VAR1 tracker2
1 X
1 Y
1 Z
If you were to MERGE these two data sets, you would get observations matched 1 by 1:
VAR1 tracker1 tracker2
1 A X
1 B Y
1 C Z
You would also get a note in the log about more than one data set containing repeated values for the BY variable.
If you were to join with SQL, you would get all combinations:
VAR1 tracker1 tracker2
1 A X
1 A Y
1 A Z
1 B X
1 B Y
1 B Z
1 C X
1 C Y
1 C Z
That's just how the tools work.
Good luck.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your respond, but as you said it is not a sas code, it's STATA. I want to compare these two commands, the first one in SAS and the second one in STATA.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Oops ... missed the part about Stata.
I'm not sure how Stata would combine the data sets, but SAS would definitely multiply the number of matches by giving you all combinations as illustrated in my example. Depending on the characteristics of Stata, that might still be the difference between the two results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't know the answer, but I have an idea on how I'd find it.
Generate two small sample data sets, that represent your merge type, in this case many to many.
Determine what the expected output is.
Join using both Stata and SAS
See which is correct
With a smaller dataset it's more likely you can determine what type of join Stata is using.
If you were trying to google, I would look for how Stata handles the many to many merge. The SAS SQL Full outer join is pretty explicit and common, but what Stata is doing may differ. My 2 minute search, didn't yield anything obvious, but I'm not a Stata user at all.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
According to the STATA documentation of their merge statement (http://www.stata.com/manuals13/dmerge.pdf), "m:m specifies a many-to-many merge and is a bad idea."
What they describe is exactly what happens in a SAS data step if you merge two datasets by a key with repeats of BY values in both datasets (i.e. one-to-one within the BY group, as pointed out by @Astounding, and repeating the last observation of the shorter BY group, if any).
In contrast, when PROC SQL joins two tables, it always creates a Cartesian product of the two tables (i.e. each row of the first combined with all rows of the second) as the internal "input to the rest of the query" (PROC SQL documentation, section "Joining Tables"). Please see @Astounding's example or the example "Full Outer Join" in the linked documentation.
So, if a BY group in table L has m observations and there is a matching BY group in table R with n observations, the STATA merge of the two tables would result in max(m,n) observations for that BY group, whereas the full outer join of PROC SQL would produce m·n observations. Please note that m·n > max(m,n) if and only if m>1 and n>1. Moreover, non-matching BY groups would be copied to the output dataset both in the STATA merge and the full outer join. That is, the difference of 10M vs. 300K observations that you observed implies that your datasets L and R have matching "BY groups" with non-unique keys (probably many of them).
Here's what the STATA documentation linked above says about such merges:
- "Because m:m merges are such a bad idea, we are not going to show you an example."
- "First, if you think you need to perform an m:m merge, then we suspect you are wrong."
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks,
I also did an example and it clearly proved your explanation. Just to document it here, just in case that someone needs it one day.
If we want to merge these two tables by ID and Weight:
Name ID Weight
Sara 11 110
Rose 11 110
ID Weight Height
11 110 6
11 110 5
The SAS output will be:
Name ID Weight Height
Sara 11 110 6
Sara 11 110 5
Rose 11 110 6
Rose 11 110 5
The STATA output will be:
Name ID Weight Height
Sara 11 110 6
Rose 11 110 5