Solved: Merging

shynu · Posted 06-17-2017 07:17 AM

data a;
input m p c;
cards;
102 113 114
102 124 134
103 114 115
103 124 135
104 115 116
104 125 136
run;

data b;
input m p c;
cards;
102 103 104
103 104 105
104 105 106
run;

data e1;
merge a b;
by m;
run;
proc print data=e;
run;

OUTPUT:

Obs m p c
1 102 103 104
2 102 124 134
3 103 104 105
4 103 124 135
5 104 105 106
6 104 125 136

data a;
input m p c;
cards;
102 103 104
103 104 105
104 105 106
run;

data b;
input m p c ;
cards;
102 113 114
102 103 104
102 124 134
103 114 115
103 104 105
103 124 135
104 115 116
104 105 106
104 125 136
run;

data e2;
merge a b;
by m;
run;
proc print data=e;
run;

OUTPUT:

Obs m p c
1 102 113 114
2 102 103 104
3 102 124 134
4 103 114 115
5 103 104 105
6 103 124 135
7 104 115 116
8 104 105 106
9 104 125 136

Data A;
Input ID Name$ Height;
cards;
1 A 1
3 B 2
5 C 2
7 D 2
9 E 2
;
run;

Data B;
Input ID Name$ Weight;
cards;
2 A 2
4 B 3
5 C 4
7 D 5
;
run;

Data dummy;
Merge A B;
by ID;
run;
proc print data = dummy;
run;

OUTPUT:

Obs ID Name Height Weight
1 1 A 1 .
2 2 A . 2
3 3 B 2 .
4 4 B . 3
5 5 C 2 4
6 7 D 2 5
7 9 E 2 .

I have used a simple merge condition in all the 3 codes, but very confused after seeing the ouput.. Can anyone please clarify my confusion?

Tom · Posted 06-17-2017 10:11 AM

If you just walk through the steps involved in merging you can easily see why you are getting the output you posted.

Let's take the case where M=102.

data a;
  input m p c;
cards;
102 113 114
102 124 134
;
data b;
  input m p c;
cards;
102 103 104
;
data want ;
  merge a b;
  by m;
run;

So on the first iteration (_N_=1) the MERGE statement will find a record from both A and B.

So it first loads the record from A and sets the variables to these values

m=102 p=113 c=114

It then loads the record from B and doing this will replace the values of P and C (I guess it might also replace M but since it is the same value it doesn't change anything).

This will result in

m=102 p=103 c=104

We then reach the end and this is output and the next iteration _n_=2 starts. Now this time there are no more records from B to load so we only load the new recrod from A. So we get

m=102 p=124 c=134

which is then output at the end of the data step.

In general it is a bad idea to merge datasets that have common variables that are not part of the BY variables.

View solution in original post

PaigeMiller · Posted 06-17-2017 07:24 AM

We don't know what is your confusion. You haven't explained what you are confused about. Give us details.

--
Paige Miller

shynu · Posted 06-17-2017 07:30 AM

when this type of merging is done we need to get all the obs from both the datasets right? But why is the output varying in all the cases?

Kurt_Bremser · Posted 06-17-2017 08:26 AM

Your third example is impossible. None of your input datasets contains ID 6.

Examples 1 & 2 perfectly document the behaviour of datasteps with a by, no conditions, and identically named variables in the input datasets. With identical variables, the sequence of reading the input datasets has an impact on what will end up in the output.

Having identical variables in a data step merge is therefore (usually) a BAD IDEA.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

shynu · Posted 06-17-2017 12:13 PM

but there is no 6 in ID in output dataset too....the first column which has 6 is the obs number

Tom · Posted 06-17-2017 10:11 AM

If you just walk through the steps involved in merging you can easily see why you are getting the output you posted.

Let's take the case where M=102.

data a;
  input m p c;
cards;
102 113 114
102 124 134
;
data b;
  input m p c;
cards;
102 103 104
;
data want ;
  merge a b;
  by m;
run;

So on the first iteration (_N_=1) the MERGE statement will find a record from both A and B.

So it first loads the record from A and sets the variables to these values

m=102 p=113 c=114

It then loads the record from B and doing this will replace the values of P and C (I guess it might also replace M but since it is the same value it doesn't change anything).

This will result in

m=102 p=103 c=104

We then reach the end and this is output and the next iteration _n_=2 starts. Now this time there are no more records from B to load so we only load the new recrod from A. So we get

m=102 p=124 c=134

which is then output at the end of the data step.

In general it is a bad idea to merge datasets that have common variables that are not part of the BY variables.

shynu · Posted 06-17-2017 12:18 PM

thank you ... Now I understood how it works 🙂

Merging

Re: Merging

Re: Merging

Re: Merging

Re: Merging

Re: Merging

Re: Merging

Re: Merging

Classroom Training Available!