SAS Procedures

samchat · Posted 03-02-2014 05:54 PM

Working on merging two data sets with two identical variables. Each data set contains 768 observations. After using a proc merge, I got 288 observations. Did I do anything wrong?

how many observations should ideally be in the final/merged data set?

-Sam

Reeza · Posted 03-02-2014 06:17 PM

Anywhere from 0 to significantly more than 768, if you have many to many matches.

The answer depends on your business context and what you expect. Did you expect a one-to-one merge, with all records in each table?

samchat · Posted 03-03-2014 02:41 PM

Thanks Reeza for the speedy response. I was expecting a one to one merge. However, after altering the code a bit, I did manage to have a dataset with 768 obs after the merge. However, I still don't understand what happened the first time, where there was 288 obs in after the merge. At the very least I would have expected 2x 768. What would have made SAS drop 480 obs?

Tom · Posted 03-02-2014 06:45 PM

A normal SAS MERGE statement you cannot end up with fewer than the number in the larger of the two sets.

Total will N1 + N2 - (number that are in both).

So if I merge (1,2,3,4) with (4,5,6,7) I will end up with 7 observations. 7=4+4-1 .

samchat · Posted 03-03-2014 02:43 PM

Hi Tom. thanks for speedy response.Your explanation was very helpful.

Tom · Posted 03-03-2014 07:06 PM

You posted code down thread where you adding an IF in addition to the MERGE statement. So your code is only looking only for the INTERSECTION of the two files. So the number of records for an intersection of two tables could be as small as zero or as large as the number of records the larger of the two files. And anywhere in between.

samchat · Posted 03-03-2014 07:13 PM

Hi Tom,

Thanks for pointing the fact that the IF statement points at an intersection of the two data sets...

So I guess in my code, SAS only looked for where the two variables were aligned in both the data sets, and output that number of obs. Hence the 288 obs.

-Sam

Peter_C · Posted 03-03-2014 01:15 PM

if you show code, we might be able to explain

samchat · Posted 03-03-2014 02:36 PM

Thank you for your help

Below is the code:

tab1 and tab2 has 768

*creating srno using automatic _N_ variable to perform merge;

DATA tab1;

SET raw.tab1;

srno=_N_;

RUN;

*sort tab1 dataset by var1 and srno for merge;

PROC SORT DATA=tab1 OUT=sr_tab1;

BY var1 srno;

RUN;

*creating varible srno using automatic _N_ to create a common variable;

DATA tab2;

SET raw.tab2;

srno=_N_;

RUN;

*sort tab2 dataset by var1 and srno for merge;

PROC SORT DATA=tab2 OUT=sr_tab2;

BY var1 srno;

RUN;

*merging datasets across var1 and srno variables;

DATA tabf;

MERGE sr_tab1 (IN=a)

sr_tab2 (IN=b)

;

BY var1 srno;

IF a AND b;

RUN;

Tom · Posted 03-03-2014 07:16 PM

Why did you add the record counter? Are you doing a many to many merge? SAS data step is not well suited for that. For each level of your id variable (VAR1) you will get the number of observations equal to the larger of those that are in either source data set. The records will be matched one to one and when one of the datasets runs out of records for that level of the id variable the values for the other variables from the dataset will be retained (replicated) onto each extra row. So if merge (1,1),(1,2) with (1,A),(1,B),(1,C) you will get (1,1,A),(1,2,B),(1,2,C). Looks like from your code you want instead to have that third observation be (1,.,C). That can be done without the extra recno variables by adding a OUTPUT and CALL MISSING() statements.

Let's assume that TAB1 has extra variable X and TAB2 has extra variable Z.

DATA tabf;

MERGE tab1 (IN=a)

tab2 (IN=b)

;

BY var1 ;

IF a AND b;

OUTPUT;

call missing(X,Z);

RUN;

Normally what people want when doing many-to-many joins is a full outer join that you can get from SQL.

proc sql ;

create table tabf as select * from tab1 full join tab2 on tab1.var1 = tab2.var1 ;

quit;

So for my example about with 2 records from TAB1 and 3 records from TAB2 this would yield 6 records (2*3=6).

(1,1,A), (1,1,B), (1,1,C), (1,2,A),(1,2,B),(1,2,C)

Statdork · Posted 03-03-2014 01:36 PM

DATA DSN3;

MERGE DSN2(IN=A) DSN2(IN=B);

BY VAR;

IF A AND B;

RUN;

ARE YOU SURE YOU ARE USING THE 'MERGE' STATEMENT AND NOT THE 'SET' STATEMENT?

samchat · Posted 03-03-2014 02:44 PM

Hi Statdork,

Thanks for the response. I have posted the code in a lower post for review.

art297 · Posted 03-03-2014 02:12 PM

There is no proc (that I'm aware of) called proc merge

Show the code you ran and some sample data that produces the observed discrepancy.

samchat · Posted 03-03-2014 02:37 PM

Your are absolutely correct Authur, there is no proc merge, that was an error, I was using the merge in a datastep. Thanks for the correction.

art297 · Posted 03-03-2014 03:58 PM

I don't think that anyone can answer your question about what initially went wrong without seeing both the code you originally submitted and the data.

Since you got the expected results running the code that you did run, it sounds like you only needed to interleave the two datasets. e.g.:

data tab1;

input var x;

cards;

1 1

2 2

3 3

4 4

5 5

;

data tab2;

input var y;

cards;

1 21

2 22

3 23

4 24

5 25

;

data want;

set tab1;

set tab2;

run;

SAS Procedures

Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Re: Merging two datasets

Join vs Merge 1.2 TB with 110 GB Datasets

Merging By Variable

set and merge

Help Merging Datasets

Merging 3 Datasets

Follow Us

What is...

SAS Procedures

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...