Hi,
The log note says "MERGE statement has more than one data set w/ repeats of BY values"
How do i avoid it?
data have1;
infile datalines dsd dlm=",";
input id $ visitnum var1 var2 var3 var4;
datalines;
001, 0, 56, 78, 39, 94,
001, 1, 54, 73, , ,
001, 1, 94, 95, , ,
001, 2, 94, 29, , ,
002, 0, 57, 73, 23, 11,
002, 1, 51, 75, , ,
002, 1, 90, 91, , ,
002, 2, 89, 30, , ,
;
run;
data have2;
infile datalines dsd dlm=",";
input id $ visitnum var3 var4;
datalines;
001, 1, 40, 96
002, 1, 24, 14
;
run;
desired output:
You can avoid it by doing merges that are not one-to-many but are one-to-one, if possible.
Otherwise, NOTEs are a good thing, you shouldn't want to avoid them, this could lead to some unexpected and unpleasant outcomes.
Whenever you expect a one-to-many MERGE, but get this NOTE, your data is not what you expected, and you either need to get correct data in the first place, or deduplicate the datasets which are supposed to be unique with regards to the BY variable(s).
By not having more than one dataset that has "many".
Your example data does not have any trouble.
340 data want ;
341 merge have1 have2 ;
342 by id visitnum;
343 run;
NOTE: There were 8 observations read from the data set WORK.HAVE1.
NOTE: There were 2 observations read from the data set WORK.HAVE2.
NOTE: The data set WORK.WANT has 8 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
Which of the two dataset should be the "one" dataset? Is it HAVE2?
To test if HAVE2 has duplicates run a test like:
data have2_unique have2_dups;
set have2 ;
by id visitnum;
if first.visitnum then output have2_unique;
if not (first.visitnum and last.visitnum) then output have2_dups;
run;
@Hello_there wrote:
Hi Tom,
Thanks for taking time to reply. My example data was not good, so that's why the log note didn't populate. I was in a hurry and couldn't think of good enough data to replicate it.
The good news it that i solved my problem by renaming the variables and doing a proc sql left join.
Thanks again!
Unless you got rid of the duplicates you didn't solve the underlying problem, you just used a tool that wouldn't complain about it.
The issue is that for my use case, there are multiple values of the by statement bc there were two identical visits and no other unique variable to further make the key variables distinct. It is common for there to be messy data and sometimes it requires a somewhat unrefined solution.
But you could be right that i didn't address the underlying problem which might cause potential issues later on, I have to investigate further and collaborate w/ someone to see if they come to the same output as me.
Isn't many defined as duplicates of by variables?
In my case, isn't my "many" the duplicates of id and visitnum when the visit is equal to 1? In other words, two id=001, visitnum=1 and two id=002, vistnum=1.
@Hello_there wrote:
Isn't many defined as duplicates of by variables?
In my case, isn't my "many" the duplicates of id and visitnum when the visit is equal to 1? In other words, two id=001, visitnum=1 and two id=002, vistnum=1.
No. Your example is showing two distinct values since the variable ID is different. It would only be consider MANY if you use only VISITNUM as the key, but that does not make much sense with the data you posted. If VISITNUM was the key then you should not have the ID variable on both datasets.
If both dataset have ID and VISITNUM then those should be the keys.
If you want to use only one of them as the key then you should not have the other in both dataset.
Your sample data doesn't generate those notes. Show the code used, especially because you have multiple variables that overlap (id visitnum var3 var4).
I suspect your actual data has duplicates in each of the files so you have a many to many merge, not a one to many merge.
69 data want;
70 merge have1 have2;
71 by id ;
72 run;
NOTE: There were 8 observations read from the data set WORK.HAVE1.
NOTE: There were 2 observations read from the data set WORK.HAVE2.
NOTE: The data set WORK.WANT has 8 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 1241.06k
OS Memory 20908.00k
Timestamp 01/23/2023 04:16:32 PM
Step Count 61 Switch Count 2
Page Faults 0
Page Reclaims 247
Page Swaps 0
Voluntary Context Switches 9
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 264
73
74 data want;
75 merge have1 have2;
76 by id visitnum;
77 run;
NOTE: There were 8 observations read from the data set WORK.HAVE1.
NOTE: There were 2 observations read from the data set WORK.HAVE2.
NOTE: The data set WORK.WANT has 8 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 1251.15k
OS Memory 20908.00k
Timestamp 01/23/2023 04:16:32 PM
Step Count 62 Switch Count 2
Page Faults 0
Page Reclaims 184
Page Swaps 0
Voluntary Context Switches 10
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 272
So, @Hello_there according to the others, these notes are a good things, they are caused by many-to-many merges, which is not what you thought you were doing.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.