topic Re: many many merge issue in SAS Programming

many many merge issue

mapuchito — Thu, 06 Jul 2023 13:40:14 GMT

Hello all,

I have this dataset:

subjid stdy

103002 1

103002 141

103003 1

103003 534

and i need to merge with this dataset:

subjid ady aval

103002 42 stable disease

103002 98 progressive disease

103002 140 NE

103003 45 stable disease

103003 101 stable disease

103003 157 stable disease

103003 213 stable disease

103003 269 stable disease

103003 325 stable disease

103003 380 stable disease

103003 436 stable disease

103003 491 progressive disease

103003 549 stable disease

I need to keep that order ady in the final dataset and avoid the message : MERGE statement has more than one dataset with repeat by values.

Any help?

thanks in advance.

Re: many many merge issue

Reeza — Thu, 06 Jul 2023 15:24:10 GMT

Please show expected results.

Re: many many merge issue

KRusso — Thu, 06 Jul 2023 16:33:58 GMT

If you need to preserve the order after merging, you could create a sortkey that is the row number. sortkey= _N_;

Re: many many merge issue

Kurt_Bremser — Thu, 06 Jul 2023 17:03:55 GMT

@mapuchito wrote:

I need to keep that order ady in the final dataset and avoid the message : MERGE statement has more than one dataset with repeat by values.

To avoid this message, you need to de-duplicate all but one of the datasets used in the MERGE.

Please show the expected result of the merge of your datasets..

Re: many many merge issue

Tom — Thu, 06 Jul 2023 17:16:14 GMT

Either MERGE with both SUBJECT and DAY.

data want;
  merge one(in=in1 rename=(stdy=day)) two(in=in2 rename=(ady=day));
  by subjid day;
  if in1 then stdy=day;
  if in2 then ady=day;
run;

Or depending on what you are actually trying to do INTERLEAVE instead. So to find the most recent STDY value for each ADY value you could use something like:

data want;
  set one(in=in1 rename=(stdy=day)) two(in=in2 rename=(ady=day));
  by subjid day;
  if in1 then stdy=day;
  if in2 and first.subjid then call missing(stdy);
  retain stdy;
  if in2 then ady=day;
run;

Re: many many merge issue

mapuchito — Fri, 07 Jul 2023 08:43:14 GMT

the expected results in my dataset is:

103002 1 42 stable disease

103002 141 98 progressive disease

103002 141 140 NE

103003 1 45 stable disease

103003 534 101 stable disease

103003 534 549 stable disease

the first row (stdy=1) corresponding the first ady (aval),

and the rest of the ady (aval) values go to to the last row (stdy)

I hope it is more clear now.

Thanks in advance.

Re: many many merge issue

Kurt_Bremser — Fri, 07 Jul 2023 10:11:37 GMT

Which means you will get the message about multiple datasets with repeats. No way around that.

Re: many many merge issue

Tom — Fri, 07 Jul 2023 12:29:00 GMT

That is exactly what a MERGE statement will do with a many to many merge. Ignore the note as you are expecting there to be a many to many merge.

data one ;
  input subjid stdy ;
cards;
103002 1
103002 141
103003 1
103003 534
;

data two;
  input subjid ady aval $40. ;
cards;
103002  42 stable disease
103002  98 progressive disease
103002 140 NE
103003  45 stable disease
103003 101 stable disease
103003 157 stable disease 
103003 213 stable disease
103003 269 stable disease
103003 325 stable disease 
103003 380 stable disease
103003 436 stable disease
103003 491 progressive disease
103003 549 stable disease
;

data expect ;
  input subjid stdy ady aval $40.;
cards;
103002 1 42 stable disease
103002 141 98 progressive disease
103002 141 140 NE
103003 1 45 stable disease
103003 534 101 stable disease
;

data want;
  merge one two;
  by subjid;
run;

proc compare data=want compare=expect;
run;

What do you want to happen if there is only one observation in the second dataset for a subject?

If you want to avoid having ADY and AVAL repeated onto every value of STDY then add and explicit OUTPUT and a CALL MISSING().

data want;
  merge one two;
  by subjid;
  output;
  call missing(ady,aval);
run;

Re: many many merge issue

mapuchito — Fri, 07 Jul 2023 12:32:17 GMT

i cannot ignore the note, i am afraid.

it should be away to do it, to avoid the note.

thanks.

Re: many many merge issue

Tom — Fri, 07 Jul 2023 12:52:42 GMT

@mapuchito wrote:

i cannot ignore the note, i am afraid.

it should be away to do it, to avoid the note.

thanks.

Why?

You can add your own note.

79   data want;
80     merge one two;
81     by subjid;
82     if _N_=1 then put 'NOTE: MERGE statement is EXPECTED to have more than one data set with repeats of BY values.';
83   run;

NOTE: MERGE statement is EXPECTED to have more than one data set with repeats of BY values.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 13 observations read from the data set WORK.TWO.
NOTE: The data set WORK.WANT has 13 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Re: many many merge issue

Tom — Fri, 07 Jul 2023 13:53:55 GMT

If you really need to replicate the action of MERGE without the NOTE about multiple repeats then you need to add a new counting variable that counts only up to the number of observations per subject in the smaller dataset. Then you can replicate by adding that new variable to the BY statement.

If you know ONE will have at most two observations per subject then just do:

data one;
  set one;
  by subjid;
  row = not first.subjid;
run;
data two;
  set two;
  row = not first.subjid;
run;
data want;
   merge one two;
   by subjid row;
run;

If you don't know the maximum number of observations per ID in ONE, but you know it will never be more than the maximum number in TWO then you could to this to add a ROW value to both datasets.


data oneB oneC(keep=subjid row rename=(row=maxrow));
  set one;
  by subjid ;
  if first.subjid then row=0;
  row+1;
  output oneB;
  if last.subjid then output oneC;
run;

data twoB ;
  merge two onec;
  by subjid;
  if first.subjid then row=0;
  row+1;
  row=min(row,maxrow);
  drop maxrow;
run;


data want;
  merge oneB twoB;
  by subjid row;
run;

If it is possible that for some ids the maximum will be the TWO dataset then the logic will need to be more complicated.

data maxrow;
  set one(in=in1) two(in=in2);
  by subjid;
  if first.subjid then call missing(max1,max2);
  max1+in1;
  max2+in2;
  if last.subjid;
  maxrow=max(max1,max2);
  keep subjid maxrow;
run;

data oneB;
  merge one(in=in1) maxrow;
  by subjid;
  if first.subjid then row=0;
  row+1;
  row=min(row,maxrow);
  if in1;
  drop maxrow;
run;

data twoB;
  merge two(in=in2) maxrow;
  by subjid;
  if first.subjid then row=0;
  row+1;
  row=min(row,maxrow);
  if in2;
  drop maxrow;
run;

data want;
  merge oneB twoB ;
  by subjid row;
run;

Re: many many merge issue

mkeintz — Sat, 08 Jul 2023 13:50:02 GMT

You can replicate the merge results in a single data step and avoid the "multiple data sets have repeats of by values" as below. Why one would need to do this, I don't see.

data merge_without_notes (drop=_:);
  /* For each subjid, read and count obs in each dataset */
  do until (last.subjid);
    set one (in=in1)  two (in=in2);
    by subjid;
    _n1+in1;
    _n2+in2;
  end;
  /* Reread both datasets, output, and  decrement each count to zero */
  do until (max(_n1,_n2)=0);
    if _n1>0 then set one;
    if _n2>0 then set two;
    output;
    _n1=ifn(_n1>0,_n1-1,0);
    _n2=ifn(_n2>0,_n2-1,0);
  end;
run;

Re: many many merge issue

Kurt_Bremser — Sat, 08 Jul 2023 06:02:23 GMT

@mapuchito wrote:

i cannot ignore the note, i am afraid.

Why? It's a NOTE, not an ERROR or WARNING. It does not cause a non-zero exit code in batch mode.

Re: many many merge issue

mapuchito — Wed, 12 Jul 2023 20:20:11 GMT

Brilliant, this is exactly what I wanted. Thanks a lot.