@Theo_Gh, in your log there is a message:
19 data data3; 20 merge theo.to_be_merged(in=a) theo.security_daily(in=b); 21 by gvkey; 22 if b; 23 run; NOTE: MERGE statement has more than one data set with repeats of BY values.
that means, your datasets are in relation N:M per gvkey.
Run next test code and compare results of merge vs sql -
data tst_nok vs tst_ok:
/* merge M:n vs SQL */
data tst1;
key = 1; varn = 5; varx='A'; output;
key = 1; varn = 6; varx='B'; output;
key = 1; varn = 7; varx='C'; output;
run;
data tst2;
key = 1; varm = 8; vary='X'; output;
key = 1; varm = 9; vary='Y'; output;
run;
data tst_nok;
merge tst1 tst2;
by key;
run;
proc sql;
create table tst_ok as
select a.*, b.varm, b.vary
from tst1 as a
left join tst2 as b
on a.key = b.key
order by key,varn;
quit;
I recommend use sql to join data from both datasets or
use @mkeintz code if it fits your needs.
When I use sql, I do not get " MERGE statement has more than one data set with repeats of BY values" . But will it work when one data set is annual and the other is daily ?
Thank you.
You do not get this message because SQL creates a cartesian product in many-to-many situations.
See this code
data have1;
input id value1;
cards;
1 1
1 2
;
run;
data have2;
input id value2;
cards;
1 3
1 4
1 5
;
run;
proc sql;
create table want as select
a.id, a.value1, b.value2
from have1 a inner join have2 b
on a.id = b.id
;
quit;
proc print data=want noobs;
run;
The result is
id value1 value2 1 1 3 1 2 3 1 1 4 1 2 4 1 1 5 1 2 5
Now a data step works differently:
data want2;
merge
have1 (in=a)
have2 (in=b)
;
by id;
if a and b;
run;
proc print data=want2 noobs;
run;
You get the NOTE: MERGE statement has more than one data set with repeats of BY values.
And this is the result:
id value1 value2 1 1 3 1 2 4 1 2 5
@Theo_Gh wrote:
Does one have advantage over the other? Do they give significantly different research outcomes?
Sorry to you all for the back and forth; I'm new to SAS and research in general
The difference IS siginificant. Just look at the outputs of my example.
In most cases, when you have a many-to-many relationship, you will want to go with SQL and join every instance of A with every instance of B, thereby getting A*B records(observations).
If you don't want that type of expansion, it is usually best to summarize one (or more) of the datasets so you get 1 observation per key, and then do the join. Or make a conscious decision which one of the obervations per key is interesting for you at a given moment and filter for that.
Basically, everytime I get the NOTE about more than one datasets having repeating by values, I know I either have unexpected data or my code is faulty.
It depends on the contents. Do you need an overall sum, or an average? Do you want to keep an earliest or last timestamp/date, or both? Or do you have character values where one takes precedence over others, or which you might want to concatenate into a string? Or would a proc transpose make sense?
This has to be determined by the logic behind the data. Once that has been done, the code can be written.
I want to thank everybody for your help. I just wanted codes but your answers to my questions meant I had to look at my data itself; I have learnt so much here. Thanks, everybody!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.