Re: merge

HeatherNewton · Posted 02-21-2024 05:48 AM

data pd_score;
merge pd_score(in=a) long_run_pd;
by segment_id;
if pd>long_run_pd then long_run_pd=pd;
if a;
run;

in the above, if segment_id is the same, pd is different in pd_score and long_run_pd, which pd do we get in the resulting table in the merging process?

also the if then statement, does it matter if it is after the statement if a or before it ...

I was also looking at the following merge:

data long3;

merge long_run_pd pd_score;

by segment_id sampdate;

run;

here I am sure if segment_id, sampdate same, it will take pd from sampdate which we can think of as b

but after this I cant figure out how it works for the other merge with if A anymore...

himself · Posted 02-21-2024 06:08 AM

Hi to help you better would recommend you go through the following link:

Merge with Caution by Joshua M. Horstman

However

data pd_score;
merge pd_score(in=a) long_run_pd;
by segment_id;
if pd>long_run_pd then long_run_pd=pd;
if a;
run;

In this merge, if segment_id is the same but pd is different in pd_score and long_run_pd, the pd value from pd_score will be used in the resulting table. This is because pd_score is listed first in the merge statement, so its values have precedence.

The if pd>long_run_pd then long_run_pd=pd; statement is updating long_run_pd to be equal to pd if pd is greater than long_run_pd.

The if a; statement is a subsetting if statement. It means that only the observations that come from the pd_score dataset (where a is true) will be output to the resulting dataset.

I hope this helps!

HeatherNewton · Posted 02-21-2024 06:44 AM

data long3;
merge long_run_pd pd_score;
by segment_id sampdate;
run;

What about the above, do we take pd from long run pd or pd score, i tested it takes from pd score

What are the rules really?

Astounding · Posted 02-21-2024 08:18 AM

The answer is actually more complex than you would think. It depends in part on whether you have a one-to-one match or a many-to-one match. If you are interested in reading and learning, here is a link to an old paper than explains the process:

https://www.lexjansen.com/nesug/nesug99/ad/ad155.pdf

HeatherNewton · Posted 02-21-2024 07:04 PM

If one to one, which one does it take? I thought would be b

Patrick · Posted 02-21-2024 08:10 PM

@HeatherNewton wrote:
If one to one, which one does it take? I thought would be b

Please read some of the very good papers already shared with you that explain all of this.

See for example 6. Overlapping variables from Merge with Caution

Tom · Posted 02-21-2024 08:14 PM

@HeatherNewton wrote:
If one to one, which one does it take? I thought would be b

No. It value will reflect the last observation read in.

So in a 1 to 1 match if both A and B contribute the the value from B is read last.

But if there that value of KEY variables do not exist in B then the value from A stays since nothing was read to overwrite it.

Patrick · Posted 02-21-2024 08:34 PM

Creating small sample programs to test what's happening often help to clarify things.

data a;
  id=1;var='a1';output;
  id=2;var='a2';output;
run;
data b;
  id=1;var='b1';output;
  id=1;var='b1';output;
  id=3;var='b3';output;
run;

data test;
  merge a b;
  by id;
run;

proc print data=test;
run;

PaigeMiller · Posted 02-21-2024 07:34 AM

You could do something like this, which would make it very clear which PD is which

data pd_score;
    merge pd_score(in=a rename=(pd=pd_from_pd_score)) long_run_pd(rename=(pd=pd_from_long_run));
    by segment_id;
/* More data step commands as necessary */
run;

--
Paige Miller

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away