Hello all,
I have this dataset:
subjid stdy
103002 1
103002 141
103003 1
103003 534
and i need to merge with this dataset:
subjid ady aval
103002 42 stable disease
103002 98 progressive disease
103002 140 NE
103003 45 stable disease
103003 101 stable disease
103003 157 stable disease
103003 213 stable disease
103003 269 stable disease
103003 325 stable disease
103003 380 stable disease
103003 436 stable disease
103003 491 progressive disease
103003 549 stable disease
I need to keep that order ady in the final dataset and avoid the message : MERGE statement has more than one dataset with repeat by values.
Any help?
thanks in advance.
You can replicate the merge results in a single data step and avoid the "multiple data sets have repeats of by values" as below. Why one would need to do this, I don't see.
data merge_without_notes (drop=_:);
/* For each subjid, read and count obs in each dataset */
do until (last.subjid);
set one (in=in1) two (in=in2);
by subjid;
_n1+in1;
_n2+in2;
end;
/* Reread both datasets, output, and decrement each count to zero */
do until (max(_n1,_n2)=0);
if _n1>0 then set one;
if _n2>0 then set two;
output;
_n1=ifn(_n1>0,_n1-1,0);
_n2=ifn(_n2>0,_n2-1,0);
end;
run;
If you need to preserve the order after merging, you could create a sortkey that is the row number. sortkey= _N_;
@mapuchito wrote:
I need to keep that order ady in the final dataset and avoid the message : MERGE statement has more than one dataset with repeat by values.
To avoid this message, you need to de-duplicate all but one of the datasets used in the MERGE.
Please show the expected result of the merge of your datasets..
Either MERGE with both SUBJECT and DAY.
data want;
merge one(in=in1 rename=(stdy=day)) two(in=in2 rename=(ady=day));
by subjid day;
if in1 then stdy=day;
if in2 then ady=day;
run;
Or depending on what you are actually trying to do INTERLEAVE instead. So to find the most recent STDY value for each ADY value you could use something like:
data want;
set one(in=in1 rename=(stdy=day)) two(in=in2 rename=(ady=day));
by subjid day;
if in1 then stdy=day;
if in2 and first.subjid then call missing(stdy);
retain stdy;
if in2 then ady=day;
run;
the expected results in my dataset is:
103002 1 42 stable disease
103002 141 98 progressive disease
103002 141 140 NE
103003 1 45 stable disease
103003 534 101 stable disease
.
.
.
103003 534 549 stable disease
the first row (stdy=1) corresponding the first ady (aval),
and the rest of the ady (aval) values go to to the last row (stdy)
I hope it is more clear now.
Thanks in advance.
Which means you will get the message about multiple datasets with repeats. No way around that.
That is exactly what a MERGE statement will do with a many to many merge. Ignore the note as you are expecting there to be a many to many merge.
data one ;
input subjid stdy ;
cards;
103002 1
103002 141
103003 1
103003 534
;
data two;
input subjid ady aval $40. ;
cards;
103002 42 stable disease
103002 98 progressive disease
103002 140 NE
103003 45 stable disease
103003 101 stable disease
103003 157 stable disease
103003 213 stable disease
103003 269 stable disease
103003 325 stable disease
103003 380 stable disease
103003 436 stable disease
103003 491 progressive disease
103003 549 stable disease
;
data expect ;
input subjid stdy ady aval $40.;
cards;
103002 1 42 stable disease
103002 141 98 progressive disease
103002 141 140 NE
103003 1 45 stable disease
103003 534 101 stable disease
;
data want;
merge one two;
by subjid;
run;
proc compare data=want compare=expect;
run;
What do you want to happen if there is only one observation in the second dataset for a subject?
If you want to avoid having ADY and AVAL repeated onto every value of STDY then add and explicit OUTPUT and a CALL MISSING().
data want;
merge one two;
by subjid;
output;
call missing(ady,aval);
run;
i cannot ignore the note, i am afraid.
it should be away to do it, to avoid the note.
thanks.
@mapuchito wrote:
i cannot ignore the note, i am afraid.
it should be away to do it, to avoid the note.
thanks.
Why?
You can add your own note.
79 data want; 80 merge one two; 81 by subjid; 82 if _N_=1 then put 'NOTE: MERGE statement is EXPECTED to have more than one data set with repeats of BY values.'; 83 run; NOTE: MERGE statement is EXPECTED to have more than one data set with repeats of BY values. NOTE: MERGE statement has more than one data set with repeats of BY values. NOTE: There were 4 observations read from the data set WORK.ONE. NOTE: There were 13 observations read from the data set WORK.TWO. NOTE: The data set WORK.WANT has 13 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
If you really need to replicate the action of MERGE without the NOTE about multiple repeats then you need to add a new counting variable that counts only up to the number of observations per subject in the smaller dataset. Then you can replicate by adding that new variable to the BY statement.
If you know ONE will have at most two observations per subject then just do:
data one;
set one;
by subjid;
row = not first.subjid;
run;
data two;
set two;
row = not first.subjid;
run;
data want;
merge one two;
by subjid row;
run;
If you don't know the maximum number of observations per ID in ONE, but you know it will never be more than the maximum number in TWO then you could to this to add a ROW value to both datasets.
data oneB oneC(keep=subjid row rename=(row=maxrow));
set one;
by subjid ;
if first.subjid then row=0;
row+1;
output oneB;
if last.subjid then output oneC;
run;
data twoB ;
merge two onec;
by subjid;
if first.subjid then row=0;
row+1;
row=min(row,maxrow);
drop maxrow;
run;
data want;
merge oneB twoB;
by subjid row;
run;
If it is possible that for some ids the maximum will be the TWO dataset then the logic will need to be more complicated.
data maxrow;
set one(in=in1) two(in=in2);
by subjid;
if first.subjid then call missing(max1,max2);
max1+in1;
max2+in2;
if last.subjid;
maxrow=max(max1,max2);
keep subjid maxrow;
run;
data oneB;
merge one(in=in1) maxrow;
by subjid;
if first.subjid then row=0;
row+1;
row=min(row,maxrow);
if in1;
drop maxrow;
run;
data twoB;
merge two(in=in2) maxrow;
by subjid;
if first.subjid then row=0;
row+1;
row=min(row,maxrow);
if in2;
drop maxrow;
run;
data want;
merge oneB twoB ;
by subjid row;
run;
You can replicate the merge results in a single data step and avoid the "multiple data sets have repeats of by values" as below. Why one would need to do this, I don't see.
data merge_without_notes (drop=_:);
/* For each subjid, read and count obs in each dataset */
do until (last.subjid);
set one (in=in1) two (in=in2);
by subjid;
_n1+in1;
_n2+in2;
end;
/* Reread both datasets, output, and decrement each count to zero */
do until (max(_n1,_n2)=0);
if _n1>0 then set one;
if _n2>0 then set two;
output;
_n1=ifn(_n1>0,_n1-1,0);
_n2=ifn(_n2>0,_n2-1,0);
end;
run;
@mapuchito wrote:
i cannot ignore the note, i am afraid.
Why? It's a NOTE, not an ERROR or WARNING. It does not cause a non-zero exit code in batch mode.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.