Solved: Re: many many merge issue

mapuchito · Posted 07-06-2023 09:40 AM

Hello all,

I have this dataset:

subjid stdy

103002 1

103002 141

103003 1

103003 534

and i need to merge with this dataset:

subjid ady aval

103002 42 stable disease

103002 98 progressive disease

103002 140 NE

103003 45 stable disease

103003 101 stable disease

103003 157 stable disease

103003 213 stable disease

103003 269 stable disease

103003 325 stable disease

103003 380 stable disease

103003 436 stable disease

103003 491 progressive disease

103003 549 stable disease

I need to keep that order ady in the final dataset and avoid the message : MERGE statement has more than one dataset with repeat by values.

Any help?

thanks in advance.

mkeintz · Posted 07-07-2023 11:00 PM

You can replicate the merge results in a single data step and avoid the "multiple data sets have repeats of by values" as below. Why one would need to do this, I don't see.

data merge_without_notes (drop=_:);
  /* For each subjid, read and count obs in each dataset */
  do until (last.subjid);
    set one (in=in1)  two (in=in2);
    by subjid;
    _n1+in1;
    _n2+in2;
  end;
  /* Reread both datasets, output, and  decrement each count to zero */
  do until (max(_n1,_n2)=0);
    if _n1>0 then set one;
    if _n2>0 then set two;
    output;
    _n1=ifn(_n1>0,_n1-1,0);
    _n2=ifn(_n2>0,_n2-1,0);
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 07-06-2023 11:24 AM

Please show expected results.

KRusso · Posted 07-06-2023 12:33 PM

If you need to preserve the order after merging, you could create a sortkey that is the row number. sortkey= _N_;

Kurt_Bremser · Posted 07-06-2023 01:03 PM

@mapuchito wrote:

I need to keep that order ady in the final dataset and avoid the message : MERGE statement has more than one dataset with repeat by values.

To avoid this message, you need to de-duplicate all but one of the datasets used in the MERGE.

Please show the expected result of the merge of your datasets..

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 07-06-2023 01:13 PM

Either MERGE with both SUBJECT and DAY.

data want;
  merge one(in=in1 rename=(stdy=day)) two(in=in2 rename=(ady=day));
  by subjid day;
  if in1 then stdy=day;
  if in2 then ady=day;
run;

Or depending on what you are actually trying to do INTERLEAVE instead. So to find the most recent STDY value for each ADY value you could use something like:

data want;
  set one(in=in1 rename=(stdy=day)) two(in=in2 rename=(ady=day));
  by subjid day;
  if in1 then stdy=day;
  if in2 and first.subjid then call missing(stdy);
  retain stdy;
  if in2 then ady=day;
run;

mapuchito · Posted 07-07-2023 04:43 AM

the expected results in my dataset is:

103002 1 42 stable disease

103002 141 98 progressive disease

103002 141 140 NE

103003 1 45 stable disease

103003 534 101 stable disease

.

103003 534 549 stable disease

the first row (stdy=1) corresponding the first ady (aval),

and the rest of the ady (aval) values go to to the last row (stdy)

I hope it is more clear now.

Thanks in advance.

Kurt_Bremser · Posted 07-07-2023 06:11 AM

Which means you will get the message about multiple datasets with repeats. No way around that.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 07-07-2023 08:29 AM

That is exactly what a MERGE statement will do with a many to many merge. Ignore the note as you are expecting there to be a many to many merge.

data one ;
  input subjid stdy ;
cards;
103002 1
103002 141
103003 1
103003 534
;

data two;
  input subjid ady aval $40. ;
cards;
103002  42 stable disease
103002  98 progressive disease
103002 140 NE
103003  45 stable disease
103003 101 stable disease
103003 157 stable disease 
103003 213 stable disease
103003 269 stable disease
103003 325 stable disease 
103003 380 stable disease
103003 436 stable disease
103003 491 progressive disease
103003 549 stable disease
;

data expect ;
  input subjid stdy ady aval $40.;
cards;
103002 1 42 stable disease
103002 141 98 progressive disease
103002 141 140 NE
103003 1 45 stable disease
103003 534 101 stable disease
;

data want;
  merge one two;
  by subjid;
run;

proc compare data=want compare=expect;
run;

What do you want to happen if there is only one observation in the second dataset for a subject?

If you want to avoid having ADY and AVAL repeated onto every value of STDY then add and explicit OUTPUT and a CALL MISSING().

data want;
  merge one two;
  by subjid;
  output;
  call missing(ady,aval);
run;

mapuchito · Posted 07-07-2023 08:32 AM

i cannot ignore the note, i am afraid.

it should be away to do it, to avoid the note.

thanks.

Tom · Posted 07-07-2023 08:52 AM

@mapuchito wrote:

i cannot ignore the note, i am afraid.

it should be away to do it, to avoid the note.

thanks.

Why?

You can add your own note.

79   data want;
80     merge one two;
81     by subjid;
82     if _N_=1 then put 'NOTE: MERGE statement is EXPECTED to have more than one data set with repeats of BY values.';
83   run;

NOTE: MERGE statement is EXPECTED to have more than one data set with repeats of BY values.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 13 observations read from the data set WORK.TWO.
NOTE: The data set WORK.WANT has 13 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Tom · Posted 07-07-2023 09:34 AM

If you really need to replicate the action of MERGE without the NOTE about multiple repeats then you need to add a new counting variable that counts only up to the number of observations per subject in the smaller dataset. Then you can replicate by adding that new variable to the BY statement.

If you know ONE will have at most two observations per subject then just do:

data one;
  set one;
  by subjid;
  row = not first.subjid;
run;
data two;
  set two;
  row = not first.subjid;
run;
data want;
   merge one two;
   by subjid row;
run;

If you don't know the maximum number of observations per ID in ONE, but you know it will never be more than the maximum number in TWO then you could to this to add a ROW value to both datasets.


data oneB oneC(keep=subjid row rename=(row=maxrow));
  set one;
  by subjid ;
  if first.subjid then row=0;
  row+1;
  output oneB;
  if last.subjid then output oneC;
run;

data twoB ;
  merge two onec;
  by subjid;
  if first.subjid then row=0;
  row+1;
  row=min(row,maxrow);
  drop maxrow;
run;


data want;
  merge oneB twoB;
  by subjid row;
run;

If it is possible that for some ids the maximum will be the TWO dataset then the logic will need to be more complicated.

data maxrow;
  set one(in=in1) two(in=in2);
  by subjid;
  if first.subjid then call missing(max1,max2);
  max1+in1;
  max2+in2;
  if last.subjid;
  maxrow=max(max1,max2);
  keep subjid maxrow;
run;

data oneB;
  merge one(in=in1) maxrow;
  by subjid;
  if first.subjid then row=0;
  row+1;
  row=min(row,maxrow);
  if in1;
  drop maxrow;
run;

data twoB;
  merge two(in=in2) maxrow;
  by subjid;
  if first.subjid then row=0;
  row+1;
  row=min(row,maxrow);
  if in2;
  drop maxrow;
run;

data want;
  merge oneB twoB ;
  by subjid row;
run;

mkeintz · Posted 07-07-2023 11:00 PM

You can replicate the merge results in a single data step and avoid the "multiple data sets have repeats of by values" as below. Why one would need to do this, I don't see.

data merge_without_notes (drop=_:);
  /* For each subjid, read and count obs in each dataset */
  do until (last.subjid);
    set one (in=in1)  two (in=in2);
    by subjid;
    _n1+in1;
    _n2+in2;
  end;
  /* Reread both datasets, output, and  decrement each count to zero */
  do until (max(_n1,_n2)=0);
    if _n1>0 then set one;
    if _n2>0 then set two;
    output;
    _n1=ifn(_n1>0,_n1-1,0);
    _n2=ifn(_n2>0,_n2-1,0);
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mapuchito · Posted 07-12-2023 04:20 PM

Brilliant, this is exactly what I wanted. Thanks a lot.

Kurt_Bremser · Posted 07-08-2023 02:02 AM

@mapuchito wrote:

i cannot ignore the note, i am afraid.

Why? It's a NOTE, not an ERROR or WARNING. It does not cause a non-zero exit code in batch mode.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Registration is open

SAS Training: Just a Click Away