Help using Base SAS procedures

Merging two datasets

Reply
Occasional Contributor
Posts: 14

Merging two datasets

Working on merging two data sets with two identical variables. Each data set contains 768 observations. After using a proc merge, I got 288 observations. Did I do anything wrong?

how many observations should ideally be in the final/merged data set?

-Sam

Super User
Posts: 17,734

Re: Merging two datasets

Anywhere from 0 to significantly more than 768, if you have many to many matches. 

The answer depends on your business context and what you expect. Did you expect a one-to-one merge, with all records in each table?

Occasional Contributor
Posts: 14

Re: Merging two datasets

Thanks Reeza for the speedy response. I was expecting a one to one merge. However, after altering the code a bit, I did manage to have a dataset with 768 obs after the merge. However, I still don't understand what happened the first time, where there was 288 obs in after the merge. At the very least I would have expected 2x 768. What would have made SAS drop 480 obs?

Super User
Super User
Posts: 6,495

Re: Merging two datasets

A normal SAS MERGE statement you cannot end up with fewer than the number in the larger of the two sets.

Total will N1 + N2 - (number that are in both).

So if I merge (1,2,3,4) with (4,5,6,7) I will end up with 7 observations. 7=4+4-1 .

Occasional Contributor
Posts: 14

Re: Merging two datasets

Hi Tom. thanks for speedy response.Your explanation was very helpful.

Super User
Super User
Posts: 6,495

Re: Merging two datasets

You posted code down thread where you adding an IF in addition to the MERGE statement.  So your code is only looking only for the INTERSECTION of the two files. So the number of records for an intersection of two tables could be as small as zero or as large as the number of records the larger of the two files. And anywhere in between.

Occasional Contributor
Posts: 14

Re: Merging two datasets

Hi Tom,

Thanks for pointing the fact that the IF statement points at an intersection of the two data sets...

So I guess in my code, SAS only looked for where the two variables were aligned in both the data sets, and output that number of obs. Hence the 288 obs.

-Sam

Valued Guide
Posts: 2,174

Re: Merging two datasets

if you show code, we might be able to explain

Occasional Contributor
Posts: 14

Re: Merging two datasets

Thank you for your help

Below is the code:

tab1 and tab2 has 768

*creating srno using automatic _N_ variable to perform merge;

DATA tab1;

    SET raw.tab1;

    srno=_N_;

RUN;

*sort tab1 dataset by var1 and srno for merge;

PROC SORT DATA=tab1  OUT=sr_tab1;

    BY var1 srno;

RUN;

  *creating varible srno using automatic _N_  to create a common variable;

DATA tab2;

    SET raw.tab2;

    srno=_N_;

RUN;

      *sort tab2 dataset by var1 and srno for merge;

PROC SORT DATA=tab2 OUT=sr_tab2;

    BY var1 srno;

RUN;

      *merging datasets across var1 and srno variables;

DATA tabf;

    MERGE sr_tab1 (IN=a)

                 sr_tab2 (IN=b)

          ;

    BY var1 srno;

    IF a AND b;

RUN;

Super User
Super User
Posts: 6,495

Re: Merging two datasets

Why did you add the record counter?  Are you doing a many to many merge?  SAS data step is not well suited for that.  For each level of your id variable (VAR1)  you will get the number of observations equal to the larger of those that are in either source data set.  The records will be matched one to one and when one of the datasets runs out of records for that level of the id variable the values for the other variables from the dataset will be retained (replicated) onto each extra row.  So if merge (1,1),(1,2) with (1,A),(1,B),(1,C) you will get (1,1,A),(1,2,B),(1,2,C).  Looks like from your code you want instead to have that third observation be (1,.,C).  That can be done without the extra recno variables by adding a OUTPUT and CALL MISSING() statements.

Let's assume that TAB1 has extra variable X and TAB2 has extra variable Z.

DATA tabf;

    MERGE tab1 (IN=a)

                 tab2 (IN=b)

          ;

    BY var1 ;

    IF a AND b;

   OUTPUT;

    call missing(X,Z);

RUN;

Normally what people want when doing many-to-many joins is a full outer join that you can get from SQL.

proc sql ;

  create table tabf as select * from tab1 full join tab2 on tab1.var1 = tab2.var1 ;

quit;

So for my example about with 2 records from TAB1 and 3 records from TAB2 this would yield 6 records (2*3=6). 

(1,1,A), (1,1,B), (1,1,C), (1,2,A),(1,2,B),(1,2,C)

Occasional Contributor
Posts: 10

Re: Merging two datasets

DATA DSN3;

     MERGE DSN2(IN=A) DSN2(IN=B);

     BY VAR;

     IF A AND B;

RUN;

ARE YOU SURE YOU ARE USING THE 'MERGE' STATEMENT AND NOT THE 'SET' STATEMENT?

Occasional Contributor
Posts: 14

Re: Merging two datasets

Hi Statdork,

Thanks for the response. I have posted the code in a lower post for review.

PROC Star
Posts: 7,356

Re: Merging two datasets

There is no proc (that I'm aware of) called proc merge

Show the code you ran and some sample data that produces the observed discrepancy.

Occasional Contributor
Posts: 14

Re: Merging two datasets

Your are absolutely correct Authur, there is no proc merge, that was an error, I was using the merge in a datastep. Thanks for the correction.

PROC Star
Posts: 7,356

Re: Merging two datasets

I don't think that anyone can answer your question about what initially went wrong without seeing both the code you originally submitted and the data.

Since you got the expected results running the code that you did run, it sounds like you only needed to interleave the two datasets.  e.g.:

data tab1;

  input var x;

  cards;

1 1

2 2

3 3

4 4

5 5

;

data tab2;

  input var y;

  cards;

1 21

2 22

3 23

4 24

5 25

;

data want;

  set tab1;

  set tab2;

run;

Ask a Question
Discussion stats
  • 14 replies
  • 459 views
  • 3 likes
  • 6 in conversation