You will have to find where the message appears in your log. Then right above that you will see the DATA step code that generated the message, including the names of the data sets, and a BY statement that tells you the names of the BY variable(s).
Sometimes when I merge a lot of datasets into one (using a by) this message appears. It appears when two or more datasets are having repeats of by values. The log tells me, but it does not tell me which of the merged datasets are 'guilty'. I can find out by deduplicating them one by one of course, but maybe the log can tell me...
I am still trying to present a example using small datasets, so far without success....
data a; do id=1 to 5; avar = id; output a; end; run;
data b; id=1; bvar=id; output; id = 4; bvar=id; output; id = 4; bvar=id; output; run;
data c; id=1; cvar=id; output; id = 3; cvar=id; output; id = 4; cvar=id; output; id = 4; evar=id; output; id = 5; cvar=id; output; run;
data d; do id=1 to 5; dvar = id; output d; end; run;
data total;
merge a b c d;
by id;
run;
The log shows:
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5 observations read from the data set WORK.A.
NOTE: There were 3 observations read from the data set WORK.B.
NOTE: There were 5 observations read from the data set WORK.C.
NOTE: There were 5 observations read from the data set WORK.D.
NOTE: The data set WORK.TOTAL has 6 observations and 6 variables.
NOTE: DATA statement used (Total process time):
But how do I know which of the datasets are having (the same) repeats of by values. (yes, they are b and c). I can find out by using proc sort nodupkey but wouldn't it be nice if the log tells me?
Yeah, it would be nice. But then, the next thing one might want is the value(s) of the BY variables that were duplicates. Personally, I think it's fair that SAS only gives a note (which I treat as an error), and then it's left up to you to investigate. Generally, if I have any doubt about the uniqueness of a dataset I'm about to merge, I will run %DuplicateCheck(work.a,by=id), which does something like:
data _null_;
set work.a(keep=id);
by id;
if not (first.id and last.id) then put "ERROR: found a dup " id= ;
run;
That said, you can code up something what you want, using in= variables and forcing them to reset to 0 at the top of the data step loop, something like:
40 data total;
41 a=0;b=0;c=0;d=0;
42 merge a (in=a)
43 b (in=b)
44 c (in=c)
45 d (in=d);
46 by id;
47 if first.id=0 and sum(a,b,c,d)>1 then do;
48 put "WARNING: Duplicates in multiple datasets " (id a b c d)(=);
49 end;
50 run;
WARNING: Duplicates in multiple datasets id=4 a=0 b=1 c=1 d=0
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5 observations read from the data set WORK.A.
NOTE: There were 3 observations read from the data set WORK.B.
NOTE: There were 5 observations read from the data set WORK.C.
NOTE: There were 5 observations read from the data set WORK.D.
NOTE: The data set WORK.TOTAL has 6 observations and 6 variables
I can see your point about merging multiple data sets, not just 2. I guess this would be relatively easy to implement. After all, SAS is already doing most of the required work in order to print the current warning message.
Also note, the warning message appears when there is no many-to-many matching. For example, it would appear here:
@Astounding, I've only seen the message when there is many-to-many matching. I don't get the message from your test data:
85 data a;
86 id=1;
87 output;output;output;
88 id=2;
89 output;
90 run;
NOTE: The data set WORK.A has 4 observations and 1 variables.
91
92 data b;
93 id=1;
94 output;
95 id=2;
96 output;output;output;
97 run;
NOTE: The data set WORK.B has 4 observations and 1 variables.
98
99 data c;
100 merge a b;
101 by id;
102 run;
NOTE: There were 4 observations read from the data set WORK.A.
NOTE: There were 4 observations read from the data set WORK.B.
NOTE: The data set WORK.C has 6 observations and 1 variables.
I think the NOTE is generated when the MERGE statement executes and encounters a case where the BY value does not change and records are read from multiple data sets.