- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Working on merging two data sets with two identical variables. Each data set contains 768 observations. After using a proc merge, I got 288 observations. Did I do anything wrong?
how many observations should ideally be in the final/merged data set?
-Sam
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Anywhere from 0 to significantly more than 768, if you have many to many matches.
The answer depends on your business context and what you expect. Did you expect a one-to-one merge, with all records in each table?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Reeza for the speedy response. I was expecting a one to one merge. However, after altering the code a bit, I did manage to have a dataset with 768 obs after the merge. However, I still don't understand what happened the first time, where there was 288 obs in after the merge. At the very least I would have expected 2x 768. What would have made SAS drop 480 obs?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A normal SAS MERGE statement you cannot end up with fewer than the number in the larger of the two sets.
Total will N1 + N2 - (number that are in both).
So if I merge (1,2,3,4) with (4,5,6,7) I will end up with 7 observations. 7=4+4-1 .
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom. thanks for speedy response.Your explanation was very helpful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You posted code down thread where you adding an IF in addition to the MERGE statement. So your code is only looking only for the INTERSECTION of the two files. So the number of records for an intersection of two tables could be as small as zero or as large as the number of records the larger of the two files. And anywhere in between.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom,
Thanks for pointing the fact that the IF statement points at an intersection of the two data sets...
So I guess in my code, SAS only looked for where the two variables were aligned in both the data sets, and output that number of obs. Hence the 288 obs.
-Sam
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
if you show code, we might be able to explain
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your help
Below is the code:
tab1 and tab2 has 768
*creating srno using automatic _N_ variable to perform merge;
DATA tab1;
SET raw.tab1;
srno=_N_;
RUN;
*sort tab1 dataset by var1 and srno for merge;
PROC SORT DATA=tab1 OUT=sr_tab1;
BY var1 srno;
RUN;
*creating varible srno using automatic _N_ to create a common variable;
DATA tab2;
SET raw.tab2;
srno=_N_;
RUN;
*sort tab2 dataset by var1 and srno for merge;
PROC SORT DATA=tab2 OUT=sr_tab2;
BY var1 srno;
RUN;
*merging datasets across var1 and srno variables;
DATA tabf;
MERGE sr_tab1 (IN=a)
sr_tab2 (IN=b)
;
BY var1 srno;
IF a AND b;
RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why did you add the record counter? Are you doing a many to many merge? SAS data step is not well suited for that. For each level of your id variable (VAR1) you will get the number of observations equal to the larger of those that are in either source data set. The records will be matched one to one and when one of the datasets runs out of records for that level of the id variable the values for the other variables from the dataset will be retained (replicated) onto each extra row. So if merge (1,1),(1,2) with (1,A),(1,B),(1,C) you will get (1,1,A),(1,2,B),(1,2,C). Looks like from your code you want instead to have that third observation be (1,.,C). That can be done without the extra recno variables by adding a OUTPUT and CALL MISSING() statements.
Let's assume that TAB1 has extra variable X and TAB2 has extra variable Z.
DATA tabf;
MERGE tab1 (IN=a)
tab2 (IN=b)
;
BY var1 ;
IF a AND b;
OUTPUT;
call missing(X,Z);
RUN;
Normally what people want when doing many-to-many joins is a full outer join that you can get from SQL.
proc sql ;
create table tabf as select * from tab1 full join tab2 on tab1.var1 = tab2.var1 ;
quit;
So for my example about with 2 records from TAB1 and 3 records from TAB2 this would yield 6 records (2*3=6).
(1,1,A), (1,1,B), (1,1,C), (1,2,A),(1,2,B),(1,2,C)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
DATA DSN3;
MERGE DSN2(IN=A) DSN2(IN=B);
BY VAR;
IF A AND B;
RUN;
ARE YOU SURE YOU ARE USING THE 'MERGE' STATEMENT AND NOT THE 'SET' STATEMENT?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Statdork,
Thanks for the response. I have posted the code in a lower post for review.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
There is no proc (that I'm aware of) called proc merge
Show the code you ran and some sample data that produces the observed discrepancy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your are absolutely correct Authur, there is no proc merge, that was an error, I was using the merge in a datastep. Thanks for the correction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't think that anyone can answer your question about what initially went wrong without seeing both the code you originally submitted and the data.
Since you got the expected results running the code that you did run, it sounds like you only needed to interleave the two datasets. e.g.:
data tab1;
input var x;
cards;
1 1
2 2
3 3
4 4
5 5
;
data tab2;
input var y;
cards;
1 21
2 22
3 23
4 24
5 25
;
data want;
set tab1;
set tab2;
run;