BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
mapuchito
Calcite | Level 5

Hello all,

 

I have this dataset:

 

subjid stdy 

103002 1

103002 141

103003 1

103003 534

 

and i need to merge with this dataset:

subjid ady aval

103002  42 stable disease

103002 98 progressive disease

103002 140 NE

103003 45 stable disease

103003 101 stable disease

103003 157 stable disease 

103003 213 stable disease

103003 269 stable disease

103003 325 stable disease 

103003 380 stable disease

103003 436 stable disease

103003 491 progressive disease

103003 549 stable disease

 

I need to keep that order ady in the final dataset and avoid the message : MERGE statement has more than one dataset with repeat by values.

 

Any help?

 

thanks in advance.

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You can replicate the merge results in a single data step and avoid the "multiple data sets have repeats of by values" as below.  Why one would need to do this, I don't see.

 

data merge_without_notes (drop=_:);
  /* For each subjid, read and count obs in each dataset */
  do until (last.subjid);
    set one (in=in1)  two (in=in2);
    by subjid;
    _n1+in1;
    _n2+in2;
  end;
  /* Reread both datasets, output, and  decrement each count to zero */
  do until (max(_n1,_n2)=0);
    if _n1>0 then set one;
    if _n2>0 then set two;
    output;
    _n1=ifn(_n1>0,_n1-1,0);
    _n2=ifn(_n2>0,_n2-1,0);
  end;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

13 REPLIES 13
Reeza
Super User
Please show expected results.
KRusso
Obsidian | Level 7

If you need to preserve the order after merging, you could create a sortkey that is the row number.  sortkey= _N_; 

Kurt_Bremser
Super User

@mapuchito wrote:

 

I need to keep that order ady in the final dataset and avoid the message : MERGE statement has more than one dataset with repeat by values.


To avoid this message, you need  to de-duplicate all but one of the datasets used in the MERGE.

Please show the expected result of the merge of your datasets..

Tom
Super User Tom
Super User

Either MERGE with both SUBJECT and DAY.

data want;
  merge one(in=in1 rename=(stdy=day)) two(in=in2 rename=(ady=day));
  by subjid day;
  if in1 then stdy=day;
  if in2 then ady=day;
run;

Or depending on what you are actually trying to do INTERLEAVE instead.  So to find the most recent STDY value for each ADY value you could use something like:

data want;
  set one(in=in1 rename=(stdy=day)) two(in=in2 rename=(ady=day));
  by subjid day;
  if in1 then stdy=day;
  if in2 and first.subjid then call missing(stdy);
  retain stdy;
  if in2 then ady=day;
run;
mapuchito
Calcite | Level 5

the expected results in my dataset is:

 

103002 1 42 stable disease

103002 141 98 progressive disease

103002 141 140 NE

103003 1 45 stable disease

103003 534 101 stable disease

.

.

.

103003 534 549 stable disease

 

the first row (stdy=1) corresponding the first ady (aval),

and the rest of the ady (aval) values go to to the last row (stdy)

 

I hope it is more clear now.

 

Thanks in advance.

Tom
Super User Tom
Super User

That is exactly what a MERGE statement will do with a many to many merge.  Ignore the note as you are expecting there to be a many to many merge.

data one ;
  input subjid stdy ;
cards;
103002 1
103002 141
103003 1
103003 534
;

data two;
  input subjid ady aval $40. ;
cards;
103002  42 stable disease
103002  98 progressive disease
103002 140 NE
103003  45 stable disease
103003 101 stable disease
103003 157 stable disease 
103003 213 stable disease
103003 269 stable disease
103003 325 stable disease 
103003 380 stable disease
103003 436 stable disease
103003 491 progressive disease
103003 549 stable disease
;

data expect ;
  input subjid stdy ady aval $40.;
cards;
103002 1 42 stable disease
103002 141 98 progressive disease
103002 141 140 NE
103003 1 45 stable disease
103003 534 101 stable disease
;

data want;
  merge one two;
  by subjid;
run;

proc compare data=want compare=expect;
run;

What do you want to happen if there is only one observation in the second dataset for a subject?

If you want to avoid having ADY and AVAL repeated onto every value of STDY then add and explicit OUTPUT and a CALL MISSING().

data want;
  merge one two;
  by subjid;
  output;
  call missing(ady,aval);
run;
mapuchito
Calcite | Level 5

i cannot ignore the note, i am afraid. 

it should be away to do it, to avoid the note.

thanks.

Tom
Super User Tom
Super User

@mapuchito wrote:

i cannot ignore the note, i am afraid. 

it should be away to do it, to avoid the note.

thanks.


Why?

 

You can add your own note.

79   data want;
80     merge one two;
81     by subjid;
82     if _N_=1 then put 'NOTE: MERGE statement is EXPECTED to have more than one data set with repeats of BY values.';
83   run;

NOTE: MERGE statement is EXPECTED to have more than one data set with repeats of BY values.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 13 observations read from the data set WORK.TWO.
NOTE: The data set WORK.WANT has 13 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
Tom
Super User Tom
Super User

If you really need to replicate the action of MERGE without the NOTE about multiple repeats then you need to add a new counting variable that counts only up to the number of observations per subject in the smaller dataset.  Then you can replicate by adding that new variable to the BY statement.

 

If you know ONE will have at most two observations per subject then just do:

data one;
  set one;
  by subjid;
  row = not first.subjid;
run;
data two;
  set two;
  row = not first.subjid;
run;
data want;
   merge one two;
   by subjid row;
run;

If you don't know the maximum number of observations per ID in ONE, but you know it will never be more than the maximum number in TWO then you could to this to add a ROW value to both datasets.


data oneB oneC(keep=subjid row rename=(row=maxrow));
  set one;
  by subjid ;
  if first.subjid then row=0;
  row+1;
  output oneB;
  if last.subjid then output oneC;
run;

data twoB ;
  merge two onec;
  by subjid;
  if first.subjid then row=0;
  row+1;
  row=min(row,maxrow);
  drop maxrow;
run;


data want;
  merge oneB twoB;
  by subjid row;
run;

If it is possible that for some ids the maximum will be the TWO dataset then the logic will need to be more complicated.

data maxrow;
  set one(in=in1) two(in=in2);
  by subjid;
  if first.subjid then call missing(max1,max2);
  max1+in1;
  max2+in2;
  if last.subjid;
  maxrow=max(max1,max2);
  keep subjid maxrow;
run;

data oneB;
  merge one(in=in1) maxrow;
  by subjid;
  if first.subjid then row=0;
  row+1;
  row=min(row,maxrow);
  if in1;
  drop maxrow;
run;

data twoB;
  merge two(in=in2) maxrow;
  by subjid;
  if first.subjid then row=0;
  row+1;
  row=min(row,maxrow);
  if in2;
  drop maxrow;
run;

data want;
  merge oneB twoB ;
  by subjid row;
run;
mkeintz
PROC Star

You can replicate the merge results in a single data step and avoid the "multiple data sets have repeats of by values" as below.  Why one would need to do this, I don't see.

 

data merge_without_notes (drop=_:);
  /* For each subjid, read and count obs in each dataset */
  do until (last.subjid);
    set one (in=in1)  two (in=in2);
    by subjid;
    _n1+in1;
    _n2+in2;
  end;
  /* Reread both datasets, output, and  decrement each count to zero */
  do until (max(_n1,_n2)=0);
    if _n1>0 then set one;
    if _n2>0 then set two;
    output;
    _n1=ifn(_n1>0,_n1-1,0);
    _n2=ifn(_n2>0,_n2-1,0);
  end;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mapuchito
Calcite | Level 5
Brilliant, this is exactly what I wanted. Thanks a lot.
Kurt_Bremser
Super User

@mapuchito wrote:

i cannot ignore the note, i am afraid. 


Why? It's a NOTE, not an ERROR or WARNING. It does not cause a non-zero exit code in batch mode.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1382 views
  • 1 like
  • 6 in conversation