Re: How to do a one-to-many merge while avoiding log note?

Hello_there · Posted 01-23-2023 10:56 AM

Hi,

The log note says "MERGE statement has more than one data set w/ repeats of BY values"

How do i avoid it?

data have1; 
infile datalines dsd dlm=",";
	input id $ visitnum var1 var2 var3 var4;
datalines;
001, 0, 56, 78, 39, 94,
001, 1, 54, 73, , ,
001, 1, 94, 95, , ,
001, 2, 94, 29, , ,
002, 0, 57, 73, 23, 11,
002, 1, 51, 75, , ,
002, 1, 90, 91, , ,
002, 2, 89, 30, , ,
;
run;

data have2;
infile datalines dsd dlm=",";
	input id $ visitnum var3 var4;
datalines;
001, 1, 40, 96
002, 1, 24, 14
;
run;

desired output:

PaigeMiller · Posted 01-23-2023 11:01 AM

You can avoid it by doing merges that are not one-to-many but are one-to-one, if possible.

Otherwise, NOTEs are a good thing, you shouldn't want to avoid them, this could lead to some unexpected and unpleasant outcomes.

--
Paige Miller

Hello_there · Posted 01-23-2023 11:03 AM

I wish i could, but this is necessary for my use case. My logs get checked and unfortunately this would not pass and i would have to redo it. Is there a way to do this in PROC SQL?

Kurt_Bremser · Posted 01-23-2023 02:06 PM

Whenever you expect a one-to-many MERGE, but get this NOTE, your data is not what you expected, and you either need to get correct data in the first place, or deduplicate the datasets which are supposed to be unique with regards to the BY variable(s).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 01-23-2023 11:13 AM

By not having more than one dataset that has "many".

Your example data does not have any trouble.

340  data want ;
341    merge have1 have2 ;
342    by id visitnum;
343  run;

NOTE: There were 8 observations read from the data set WORK.HAVE1.
NOTE: There were 2 observations read from the data set WORK.HAVE2.
NOTE: The data set WORK.WANT has 8 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Which of the two dataset should be the "one" dataset? Is it HAVE2?

To test if HAVE2 has duplicates run a test like:

data have2_unique have2_dups;
  set have2 ;
  by id visitnum;
  if first.visitnum then output have2_unique;
  if not (first.visitnum and last.visitnum) then output have2_dups;
run;

Hello_there · Posted 01-23-2023 11:28 AM

Hi Tom,
Thanks for taking time to reply. My example data was not good, so that's why the log note didn't populate. I was in a hurry and couldn't think of good enough data to replicate it.

The good news it that i solved my problem by renaming the variables and doing a proc sql left join.
Thanks again!

Tom · Posted 01-23-2023 11:29 AM

@Hello_there wrote:
Hi Tom,
Thanks for taking time to reply. My example data was not good, so that's why the log note didn't populate. I was in a hurry and couldn't think of good enough data to replicate it.

The good news it that i solved my problem by renaming the variables and doing a proc sql left join.
Thanks again!

Unless you got rid of the duplicates you didn't solve the underlying problem, you just used a tool that wouldn't complain about it.

Hello_there · Posted 01-23-2023 11:33 AM

The issue is that for my use case, there are multiple values of the by statement bc there were two identical visits and no other unique variable to further make the key variables distinct. It is common for there to be messy data and sometimes it requires a somewhat unrefined solution.

But you could be right that i didn't address the underlying problem which might cause potential issues later on, I have to investigate further and collaborate w/ someone to see if they come to the same output as me.

Hello_there · Posted 01-23-2023 02:55 PM

Isn't many defined as duplicates of by variables?

In my case, isn't my "many" the duplicates of id and visitnum when the visit is equal to 1? In other words, two id=001, visitnum=1 and two id=002, vistnum=1.

Tom · Posted 01-23-2023 03:22 PM

@Hello_there wrote:

Isn't many defined as duplicates of by variables?

In my case, isn't my "many" the duplicates of id and visitnum when the visit is equal to 1? In other words, two id=001, visitnum=1 and two id=002, vistnum=1.

No. Your example is showing two distinct values since the variable ID is different. It would only be consider MANY if you use only VISITNUM as the key, but that does not make much sense with the data you posted. If VISITNUM was the key then you should not have the ID variable on both datasets.

If both dataset have ID and VISITNUM then those should be the keys.

If you want to use only one of them as the key then you should not have the other in both dataset.

Reeza · Posted 01-23-2023 11:17 AM

Your sample data doesn't generate those notes. Show the code used, especially because you have multiple variables that overlap (id visitnum var3 var4).

I suspect your actual data has duplicates in each of the files so you have a many to many merge, not a one to many merge.

 69         data want;
 70         merge have1 have2;
 71         by id ;
 72         run;
 
 NOTE: There were 8 observations read from the data set WORK.HAVE1.
 NOTE: There were 2 observations read from the data set WORK.HAVE2.
 NOTE: The data set WORK.WANT has 8 observations and 8 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              1241.06k
       OS Memory           20908.00k
       Timestamp           01/23/2023 04:16:32 PM
       Step Count                        61  Switch Count  2
       Page Faults                       0
       Page Reclaims                     247
       Page Swaps                        0
       Voluntary Context Switches        9
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           264
       
 
 73         
 74         data want;
 75         merge have1 have2;
 76         by id visitnum;
 77         run;
 
 NOTE: There were 8 observations read from the data set WORK.HAVE1.
 NOTE: There were 2 observations read from the data set WORK.HAVE2.
 NOTE: The data set WORK.WANT has 8 observations and 8 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              1251.15k
       OS Memory           20908.00k
       Timestamp           01/23/2023 04:16:32 PM
       Step Count                        62  Switch Count  2
       Page Faults                       0
       Page Reclaims                     184
       Page Swaps                        0
       Voluntary Context Switches        10
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           272

Hello_there · Posted 01-23-2023 11:29 AM

Hi Reeza,
Thanks for looking at my issue. My example code was hastily done so I couldn't think of a good way to replicate my issue.
I did however solve my issue by renaming my variables and doing a proc sql left join.
Thanks though for looking at it.

Reeza · Posted 01-23-2023 12:26 PM

SQL wouldn't have fixed your issue, your join conditions need to be modified.

PaigeMiller · Posted 01-23-2023 11:29 AM

So, @Hello_there according to the others, these notes are a good things, they are caused by many-to-many merges, which is not what you thought you were doing.

--
Paige Miller

Hello_there · Posted 01-23-2023 11:35 AM

Hi PaigeMiller,

I think my example was not a good one and not reflective of the true nature of my problem.
I did come up with a rather unrefined solution, but i think it achieved the end result of what i wanted.

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away