Solved: Re: merging compustat global annual data and compusat global security ... - Page 2

Theo_Gh · Posted 02-12-2017 06:02 AM

Thank you.

Shmuel · Posted 02-12-2017 03:35 PM

@Theo_Gh, in your log there is a message:

19   data data3;
20       merge theo.to_be_merged(in=a) theo.security_daily(in=b);
21       by gvkey;
22       if b;
23   run;
 
NOTE: MERGE statement has more than one data set with repeats of BY values.

that means, your datasets are in relation N:M per gvkey.

Run next test code and compare results of merge vs sql -

data tst_nok vs tst_ok:

/* merge M:n vs SQL */
data tst1;
  key = 1; varn = 5; varx='A'; output;
  key = 1; varn = 6; varx='B'; output;
  key = 1; varn = 7; varx='C'; output;
run;

data tst2;
  key = 1; varm = 8; vary='X'; output;
  key = 1; varm = 9; vary='Y'; output;
run;

data tst_nok;
merge tst1 tst2;
  by key;
run;

proc sql;
  create table tst_ok as 
  select a.*, b.varm, b.vary
  from tst1 as a
  left join tst2 as b 
  on a.key = b.key
  order by key,varn;
quit;

I recommend use sql to join data from both datasets or

use @mkeintz code if it fits your needs.

Theo_Gh · Posted 02-13-2017 12:05 AM

Thank you very much. Will try it.

Theo_Gh · Posted 02-14-2017 06:03 AM

When I use sql, I do not get " MERGE statement has more than one data set with repeats of BY values" . But will it work when one data set is annual and the other is daily ?

Thank you.

Kurt_Bremser · Posted 02-14-2017 06:40 AM

You do not get this message because SQL creates a cartesian product in many-to-many situations.

See this code

data have1;
input id value1;
cards;
1 1
1 2
;
run;

data have2;
input id value2;
cards;
1 3
1 4
1 5
;
run;

proc sql;
create table want as select
a.id, a.value1, b.value2
from have1 a inner join have2 b
on a.id = b.id
;
quit;

proc print data=want noobs;
run;

The result is

id    value1    value2

 1       1         3  
 1       2         3  
 1       1         4  
 1       2         4  
 1       1         5  
 1       2         5

Now a data step works differently:

data want2;
merge
  have1 (in=a)
  have2 (in=b)
;
by id;
if a and b;
run;

proc print data=want2 noobs;
run;

You get the NOTE: MERGE statement has more than one data set with repeats of BY values.
And this is the result:

id    value1    value2

 1       1         3  
 1       2         4  
 1       2         5

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Theo_Gh · Posted 02-14-2017 07:59 AM

Does one have advantage over the other? Do they give significantly different research outcomes?
Sorry to you all for the back and forth; I'm new to SAS and research in general

Kurt_Bremser · Posted 02-14-2017 08:10 AM

@Theo_Gh wrote:
Does one have advantage over the other? Do they give significantly different research outcomes?
Sorry to you all for the back and forth; I'm new to SAS and research in general

The difference IS siginificant. Just look at the outputs of my example.

In most cases, when you have a many-to-many relationship, you will want to go with SQL and join every instance of A with every instance of B, thereby getting A*B records(observations).

If you don't want that type of expansion, it is usually best to summarize one (or more) of the datasets so you get 1 observation per key, and then do the join. Or make a conscious decision which one of the obervations per key is interesting for you at a given moment and filter for that.

Basically, everytime I get the NOTE about more than one datasets having repeating by values, I know I either have unexpected data or my code is faulty.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Theo_Gh · Posted 02-14-2017 09:18 AM

How do you summarize dataset(s) to get one observation per key?

Kurt_Bremser · Posted 02-14-2017 09:33 AM

It depends on the contents. Do you need an overall sum, or an average? Do you want to keep an earliest or last timestamp/date, or both? Or do you have character values where one takes precedence over others, or which you might want to concatenate into a string? Or would a proc transpose make sense?

This has to be determined by the logic behind the data. Once that has been done, the code can be written.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Theo_Gh · Posted 02-18-2017 03:18 AM

I want to thank everybody for your help. I just wanted codes but your answers to my questions meant I had to look at my data itself; I have learnt so much here. Thanks, everybody!

Re: merging compustat global annual data and compusat global security daily data

Re: merging compustat global annual data and compusat global security daily data

Re: merging compustat global annual data and compusat global security daily data

Re: merging compustat global annual data and compusat global security daily data

Re: merging compustat global annual data and compusat global security daily data

Re: merging compustat global annual data and compusat global security daily data

Re: merging compustat global annual data and compusat global security daily data

Re: merging compustat global annual data and compusat global security daily data

Re: merging compustat global annual data and compusat global security daily data

Re: merging compustat global annual data and compusat global security daily data

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away