BookmarkSubscribeRSS Feed
samchat
Calcite | Level 5

Working on merging two data sets with two identical variables. Each data set contains 768 observations. After using a proc merge, I got 288 observations. Did I do anything wrong?

how many observations should ideally be in the final/merged data set?

-Sam

14 REPLIES 14
Reeza
Super User

Anywhere from 0 to significantly more than 768, if you have many to many matches. 

The answer depends on your business context and what you expect. Did you expect a one-to-one merge, with all records in each table?

samchat
Calcite | Level 5

Thanks Reeza for the speedy response. I was expecting a one to one merge. However, after altering the code a bit, I did manage to have a dataset with 768 obs after the merge. However, I still don't understand what happened the first time, where there was 288 obs in after the merge. At the very least I would have expected 2x 768. What would have made SAS drop 480 obs?

Tom
Super User Tom
Super User

A normal SAS MERGE statement you cannot end up with fewer than the number in the larger of the two sets.

Total will N1 + N2 - (number that are in both).

So if I merge (1,2,3,4) with (4,5,6,7) I will end up with 7 observations. 7=4+4-1 .

samchat
Calcite | Level 5

Hi Tom. thanks for speedy response.Your explanation was very helpful.

Tom
Super User Tom
Super User

You posted code down thread where you adding an IF in addition to the MERGE statement.  So your code is only looking only for the INTERSECTION of the two files. So the number of records for an intersection of two tables could be as small as zero or as large as the number of records the larger of the two files. And anywhere in between.

samchat
Calcite | Level 5

Hi Tom,

Thanks for pointing the fact that the IF statement points at an intersection of the two data sets...

So I guess in my code, SAS only looked for where the two variables were aligned in both the data sets, and output that number of obs. Hence the 288 obs.

-Sam

Peter_C
Rhodochrosite | Level 12

if you show code, we might be able to explain

samchat
Calcite | Level 5

Thank you for your help

Below is the code:

tab1 and tab2 has 768

*creating srno using automatic _N_ variable to perform merge;

DATA tab1;

    SET raw.tab1;

    srno=_N_;

RUN;

*sort tab1 dataset by var1 and srno for merge;

PROC SORT DATA=tab1  OUT=sr_tab1;

    BY var1 srno;

RUN;

  *creating varible srno using automatic _N_  to create a common variable;

DATA tab2;

    SET raw.tab2;

    srno=_N_;

RUN;

      *sort tab2 dataset by var1 and srno for merge;

PROC SORT DATA=tab2 OUT=sr_tab2;

    BY var1 srno;

RUN;

      *merging datasets across var1 and srno variables;

DATA tabf;

    MERGE sr_tab1 (IN=a)

                 sr_tab2 (IN=b)

          ;

    BY var1 srno;

    IF a AND b;

RUN;

Tom
Super User Tom
Super User

Why did you add the record counter?  Are you doing a many to many merge?  SAS data step is not well suited for that.  For each level of your id variable (VAR1)  you will get the number of observations equal to the larger of those that are in either source data set.  The records will be matched one to one and when one of the datasets runs out of records for that level of the id variable the values for the other variables from the dataset will be retained (replicated) onto each extra row.  So if merge (1,1),(1,2) with (1,A),(1,B),(1,C) you will get (1,1,A),(1,2,B),(1,2,C).  Looks like from your code you want instead to have that third observation be (1,.,C).  That can be done without the extra recno variables by adding a OUTPUT and CALL MISSING() statements.

Let's assume that TAB1 has extra variable X and TAB2 has extra variable Z.

DATA tabf;

    MERGE tab1 (IN=a)

                 tab2 (IN=b)

          ;

    BY var1 ;

    IF a AND b;

   OUTPUT;

    call missing(X,Z);

RUN;

Normally what people want when doing many-to-many joins is a full outer join that you can get from SQL.

proc sql ;

  create table tabf as select * from tab1 full join tab2 on tab1.var1 = tab2.var1 ;

quit;

So for my example about with 2 records from TAB1 and 3 records from TAB2 this would yield 6 records (2*3=6). 

(1,1,A), (1,1,B), (1,1,C), (1,2,A),(1,2,B),(1,2,C)

Statdork
Calcite | Level 5

DATA DSN3;

     MERGE DSN2(IN=A) DSN2(IN=B);

     BY VAR;

     IF A AND B;

RUN;

ARE YOU SURE YOU ARE USING THE 'MERGE' STATEMENT AND NOT THE 'SET' STATEMENT?

samchat
Calcite | Level 5

Hi Statdork,

Thanks for the response. I have posted the code in a lower post for review.

art297
Opal | Level 21

There is no proc (that I'm aware of) called proc merge

Show the code you ran and some sample data that produces the observed discrepancy.

samchat
Calcite | Level 5

Your are absolutely correct Authur, there is no proc merge, that was an error, I was using the merge in a datastep. Thanks for the correction.

art297
Opal | Level 21

I don't think that anyone can answer your question about what initially went wrong without seeing both the code you originally submitted and the data.

Since you got the expected results running the code that you did run, it sounds like you only needed to interleave the two datasets.  e.g.:

data tab1;

  input var x;

  cards;

1 1

2 2

3 3

4 4

5 5

;

data tab2;

  input var y;

  cards;

1 21

2 22

3 23

4 24

5 25

;

data want;

  set tab1;

  set tab2;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 3322 views
  • 3 likes
  • 6 in conversation