MERGE statement has more than one data set with repeats of BY values

EH · ‎02-11-2016

Hello SAS,

I would like to see the names of those datasets.

Or maybe it is already included in SAS and I cannot find it 🙂

Cheers,

Eric

Reeza · ‎02-11-2016

This isn't a clear idea, can you please clarify your suggestion and possibly include sample data that will generate the warning/error.

Astounding · ‎02-11-2016

You will have to find where the message appears in your log. Then right above that you will see the DATA step code that generated the message, including the names of the data sets, and a BY statement that tells you the names of the BY variable(s).

EH · ‎02-12-2016

Sometimes when I merge a lot of datasets into one (using a by) this message appears. It appears when two or more datasets are having repeats of by values. The log tells me, but it does not tell me which of the merged datasets are 'guilty'. I can find out by deduplicating them one by one of course, but maybe the log can tell me...

I am still trying to present a example using small datasets, so far without success....

Cheers,

Eric

EH · ‎02-12-2016

Ok, below a working example.

data a; do id=1 to 5; avar = id; output a; end; run;
data b; id=1; bvar=id; output; id = 4; bvar=id; output; id = 4; bvar=id; output; run;
data c; id=1; cvar=id; output; id = 3; cvar=id; output; id = 4; cvar=id; output; id = 4; evar=id; output; id = 5; cvar=id; output; run;
data d; do id=1 to 5; dvar = id; output d; end; run;

data total;
merge a b c d;
by id;
run;

The log shows:

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5 observations read from the data set WORK.A.
NOTE: There were 3 observations read from the data set WORK.B.
NOTE: There were 5 observations read from the data set WORK.C.
NOTE: There were 5 observations read from the data set WORK.D.
NOTE: The data set WORK.TOTAL has 6 observations and 6 variables.
NOTE: DATA statement used (Total process time):

But how do I know which of the datasets are having (the same) repeats of by values. (yes, they are b and c). I can find out by using proc sort nodupkey but wouldn't it be nice if the log tells me?

Cheers,

Eric

Quentin · ‎02-12-2016

Yeah, it would be nice. But then, the next thing one might want is the value(s) of the BY variables that were duplicates. Personally, I think it's fair that SAS only gives a note (which I treat as an error), and then it's left up to you to investigate. Generally, if I have any doubt about the uniqueness of a dataset I'm about to merge, I will run %DuplicateCheck(work.a,by=id), which does something like:

data _null_;
  set work.a(keep=id);
  by id;
  if not (first.id and last.id) then put "ERROR: found a dup " id= ;
run;

That said, you can code up something what you want, using in= variables and forcing them to reset to 0 at the top of the data step loop, something like:

40   data total;
41     a=0;b=0;c=0;d=0;
42     merge a (in=a)
43           b (in=b)
44           c (in=c)
45           d (in=d);
46     by id;
47     if first.id=0 and sum(a,b,c,d)>1 then do;
48       put "WARNING: Duplicates in multiple datasets " (id a b c d)(=);
49     end;
50   run;

WARNING: Duplicates in multiple datasets id=4 a=0 b=1 c=1 d=0
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5 observations read from the data set WORK.A.
NOTE: There were 3 observations read from the data set WORK.B.
NOTE: There were 5 observations read from the data set WORK.C.
NOTE: There were 5 observations read from the data set WORK.D.
NOTE: The data set WORK.TOTAL has 6 observations and 6 variables

Astounding · ‎02-12-2016

I can see your point about merging multiple data sets, not just 2. I guess this would be relatively easy to implement. After all, SAS is already doing most of the required work in order to print the current warning message.

Also note, the warning message appears when there is no many-to-many matching. For example, it would appear here:

ID in A ID in B

1 1

1 2

2 2

Quentin · ‎02-12-2016

@Astounding, I've only seen the message when there is many-to-many matching. I don't get the message from your test data&colon;

85   data a;
86     id=1;
87     output;output;output;
88     id=2;
89     output;
90   run;

NOTE: The data set WORK.A has 4 observations and 1 variables.

91
92   data b;
93     id=1;
94     output;
95     id=2;
96     output;output;output;
97   run;

NOTE: The data set WORK.B has 4 observations and 1 variables.

98
99   data c;
100    merge a b;
101    by id;
102  run;

NOTE: There were 4 observations read from the data set WORK.A.
NOTE: There were 4 observations read from the data set WORK.B.
NOTE: The data set WORK.C has 6 observations and 1 variables.

I think the NOTE is generated when the MERGE statement executes and encounters a case where the BY value does not change and records are read from multiple data sets.

Astounding · ‎02-12-2016

OK, that's a change then. I did test it, but that was years ago when the message first started appearing.

EH · ‎02-13-2016

Hello Quentin and Astounding,

Thanks for your comments. The NOTE only appears when there are datasets with the same repeats of by values.

My first suggestion would be to call it a WARNING.

And still I would like SAS to name the suspected datasets. Sometimes I merge over ten large datasets.

Cheers,

Eric

ChrisNZ · ‎01-17-2017

I agree. I made a ballot entry.

https://communities.sas.com/t5/forums/editpage/board-id/sas_ideas/message-id/2332