03-02-2014 05:54 PM
Working on merging two data sets with two identical variables. Each data set contains 768 observations. After using a proc merge, I got 288 observations. Did I do anything wrong?
how many observations should ideally be in the final/merged data set?
03-02-2014 06:17 PM
Anywhere from 0 to significantly more than 768, if you have many to many matches.
The answer depends on your business context and what you expect. Did you expect a one-to-one merge, with all records in each table?
03-03-2014 02:41 PM
Thanks Reeza for the speedy response. I was expecting a one to one merge. However, after altering the code a bit, I did manage to have a dataset with 768 obs after the merge. However, I still don't understand what happened the first time, where there was 288 obs in after the merge. At the very least I would have expected 2x 768. What would have made SAS drop 480 obs?
03-02-2014 06:45 PM
A normal SAS MERGE statement you cannot end up with fewer than the number in the larger of the two sets.
Total will N1 + N2 - (number that are in both).
So if I merge (1,2,3,4) with (4,5,6,7) I will end up with 7 observations. 7=4+4-1 .
03-03-2014 07:06 PM
You posted code down thread where you adding an IF in addition to the MERGE statement. So your code is only looking only for the INTERSECTION of the two files. So the number of records for an intersection of two tables could be as small as zero or as large as the number of records the larger of the two files. And anywhere in between.
03-03-2014 07:13 PM
Thanks for pointing the fact that the IF statement points at an intersection of the two data sets...
So I guess in my code, SAS only looked for where the two variables were aligned in both the data sets, and output that number of obs. Hence the 288 obs.
03-03-2014 02:36 PM
Thank you for your help
Below is the code:
tab1 and tab2 has 768
*creating srno using automatic _N_ variable to perform merge;
*sort tab1 dataset by var1 and srno for merge;
PROC SORT DATA=tab1 OUT=sr_tab1;
BY var1 srno;
*creating varible srno using automatic _N_ to create a common variable;
*sort tab2 dataset by var1 and srno for merge;
PROC SORT DATA=tab2 OUT=sr_tab2;
BY var1 srno;
*merging datasets across var1 and srno variables;
MERGE sr_tab1 (IN=a)
BY var1 srno;
IF a AND b;
03-03-2014 07:16 PM
Why did you add the record counter? Are you doing a many to many merge? SAS data step is not well suited for that. For each level of your id variable (VAR1) you will get the number of observations equal to the larger of those that are in either source data set. The records will be matched one to one and when one of the datasets runs out of records for that level of the id variable the values for the other variables from the dataset will be retained (replicated) onto each extra row. So if merge (1,1),(1,2) with (1,A),(1,B),(1,C) you will get (1,1,A),(1,2,B),(1,2,C). Looks like from your code you want instead to have that third observation be (1,.,C). That can be done without the extra recno variables by adding a OUTPUT and CALL MISSING() statements.
Let's assume that TAB1 has extra variable X and TAB2 has extra variable Z.
MERGE tab1 (IN=a)
BY var1 ;
IF a AND b;
Normally what people want when doing many-to-many joins is a full outer join that you can get from SQL.
proc sql ;
create table tabf as select * from tab1 full join tab2 on tab1.var1 = tab2.var1 ;
So for my example about with 2 records from TAB1 and 3 records from TAB2 this would yield 6 records (2*3=6).
(1,1,A), (1,1,B), (1,1,C), (1,2,A),(1,2,B),(1,2,C)
03-03-2014 03:58 PM
I don't think that anyone can answer your question about what initially went wrong without seeing both the code you originally submitted and the data.
Since you got the expected results running the code that you did run, it sounds like you only needed to interleave the two datasets. e.g.:
input var x;
input var y;