Merging question.

deleted_user · Posted 12-16-2010 06:45 AM

If

1. every ID in A can be found in B

and

2. there is no other ID in B than in A,

shouldn't the following examples give the same result?

data C;
merge A(in=x) B(in=y);
by ID;
if x and y;
run;

data C;
merge A(in=x) B;
by ID;
if x;
run;

art297 · Posted 12-16-2010 09:17 AM

Ernesto,

I would think that the only instances where it might produce different results are:

1. if either file has variables with the name x or y
2. if one or both files have multiple records for some IDs

Art

SPR · Posted 12-16-2010 10:44 AM

Hello EmestoC,

Your conditions claim that IDs in both datasets are the same. The only possible difference you allow is to have multiple IDs in both datasets. However, this does not change the conclusion. I think the output should be the same in both merges.

Sincerely,
SPR

chang_y_chung_hotmail_com · Posted 12-16-2010 11:24 AM

SPR wrote:

> Your conditions claim that IDs in both datasets are

> the same. The only possible difference you allow is

> to have multiple IDs in both datasets. However, this

> does not change the conclusion. I think the output

> should be the same in both merges.

If A is a subset of B and B is a subset of A, then A is equal to B for sure. But OP's question is not about set theory or of logic. And OP's assertion is easy to be shown false with a simple example like so:

/* a counter example */

data a;

id=1; output;

run;

data b;

id=1; y=0; output;

run;

data c;

merge a(in=x) b(in=y);

by id;

if x and y;

run;

data d;

merge a(in=x) b;

by id;

if x;

run;

proc compare base=c compare=d;

run;

/* on log -- in part

WARNING: No matching observations were found.

NOTE: There were 0 observations read from the data set WORK.C.

NOTE: There were 1 observations read from the data set WORK.D.

*/

SPR · Posted 12-16-2010 01:35 PM

Hello Chang_y_chung,

"If A is a subset of B and B is a subset of A, then A is equal to B for sure."

How you arrived to this conclusion? OP said: "1. every ID in A can be found in B" it means that multiple obs with the same ID are possible, so A should NOT ALWAYS be equal B.

As to (in=y). my EG 4.3 produces the following message:

WARNING: The variable y exists on an input data set and is also set by an I/O statement option. The variable will not be included on any output data set and unexpected results may occur.

Sincerely,
SPR

ArtC · Posted 12-16-2010 02:03 PM

The data set option IN= creates a temporary variable, unless that variable already exists on the PDV. In that case the warning is fortunately issued. However the results are sometimes surprising.

Here the use of IN=Y forces Y to become a temporary variable and the variable Y from the data set B is dropped.
[pre]
data d;
set a(in=y) b;
run;
[/pre]

deleted_user · Posted 12-16-2010 12:41 PM

ID is a numeric variable.

My point 2 was wrong, I have now found out. There were records in B with missing values( = . ) for ID.

That explains why I got different results by the two examples.

Thank's for your answers.

Merging question.

Re: Merging question.

Re: Merging question.

Re: Merging question.

Re: Merging question.

Re: Merging question.

Re: Merging question.

Classroom Training Available!