BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
anildonmez
Calcite | Level 5

Hi SAS community,


This is my very first post to this forum, I hope I can ask my question while abiding by the forum rules.

I have an unbalanced panel data set of daily security log returns. This data set is constructed by merging two data sets: security return data (firm id is gvkey) and another data set containing firm-specific information (firm id is gvkey+iid). Each observation has two dates (date and date2) together with a dummy variable and I'd like to calculate the cumulative log returns between these two dates for those having dummy=1. I'd also like to note that the data is not uniquely defined in the gvkey-date pair since the data set containing firm characteristics has a more disaggregated firm id definition (gvkey+iid). Below, I included a screenshot of the part of my data set:

 

sample.png

What I want to do is to calculate the cumulative returns between date and date2 only for row numbers 301, 302, and 310 (where dummy=1). Below screenshot displays what I am trying to get in the end:

sample2.png

Just to make it clearer, the cumulative return for row 310 is calculated as the sum of log-returns shown within red rectangles (or, the sum of single daily log returns from 12/21/2010 to 12/31/2010). Of course, this is just a sample data set, the original data has millions of observations with multiple gvkeys and a larger time frame.

 

I have some experience with SAS but I couldn't think a clever way to tackle this specific problem. I'd appreciate your help!

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

@anildonmez wrote:

 

I have an unbalanced panel data set of daily security log returns. This data set is constructed by merging two data sets: security return data (firm id is gvkey) and another data set containing firm-specific information (firm id is gvkey+iid). Each observation has two dates (date and date2) together with a dummy variable and I'd like to calculate the cumulative log returns between these two dates for those having dummy=1. I'd also like to note that the data is not uniquely defined in the gvkey-date pair since the data set containing firm characteristics has a more disaggregated firm id definition (gvkey+iid). Below, I included a screenshot of the part of my data set:

First things first:

In Compustat GVKEY is the unique firm id, and IID is the issue id within that firm.  So gvkey+iid is essentially the unique identifier for a stock issue - it's not really a firm id.

 

As far as I can tell from the boxes you drew, for each record with dummy=1 you want to aggregate the log returns for all dates between DATE and DATE2 in which dummy is not a zero:

 

%let begdate=01jan2010;
%let enddate=31dec2010;

data want (drop=_:);
  array ret_history {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} _temporary_;
  call missing(of ret_history{*});

  do until (last.gvkey);
    set have;
    by gvkey;
    if dummy ^=0 then ret_history{date}=log_ret;
  end;

  do until(last.gvkey);
    set have;
	by gvkey;
	cum_ret=.;
	if dummy=1 then do _d=date to date2;
	  cum_ret=sum(cum_ret,ret_history{_d});
	end;
    output;
  end;
run;

If you know the date range of your data, then set the macrovars BEGDATE and ENDDATE accordingly.  This allows the definition of an array RET_HISTORY, indexed by date.  For each GVKEY, RET_HISTORY is initialized to missing, and then each of the dates with dummy^=0 has its return put in ret_history.   That's the first pass of each gvkey, just to establish ret_history.

 

Then reread the gvkey.  Each time a dummy=1 aggregate the log_ret values from ret_history from date to date2, followed by an output.

 

Note this assumes that any date with dummy=.  or dummy=0 has only a single unique value for log_ret. 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

2 REPLIES 2
mkeintz
PROC Star

@anildonmez wrote:

 

I have an unbalanced panel data set of daily security log returns. This data set is constructed by merging two data sets: security return data (firm id is gvkey) and another data set containing firm-specific information (firm id is gvkey+iid). Each observation has two dates (date and date2) together with a dummy variable and I'd like to calculate the cumulative log returns between these two dates for those having dummy=1. I'd also like to note that the data is not uniquely defined in the gvkey-date pair since the data set containing firm characteristics has a more disaggregated firm id definition (gvkey+iid). Below, I included a screenshot of the part of my data set:

First things first:

In Compustat GVKEY is the unique firm id, and IID is the issue id within that firm.  So gvkey+iid is essentially the unique identifier for a stock issue - it's not really a firm id.

 

As far as I can tell from the boxes you drew, for each record with dummy=1 you want to aggregate the log returns for all dates between DATE and DATE2 in which dummy is not a zero:

 

%let begdate=01jan2010;
%let enddate=31dec2010;

data want (drop=_:);
  array ret_history {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} _temporary_;
  call missing(of ret_history{*});

  do until (last.gvkey);
    set have;
    by gvkey;
    if dummy ^=0 then ret_history{date}=log_ret;
  end;

  do until(last.gvkey);
    set have;
	by gvkey;
	cum_ret=.;
	if dummy=1 then do _d=date to date2;
	  cum_ret=sum(cum_ret,ret_history{_d});
	end;
    output;
  end;
run;

If you know the date range of your data, then set the macrovars BEGDATE and ENDDATE accordingly.  This allows the definition of an array RET_HISTORY, indexed by date.  For each GVKEY, RET_HISTORY is initialized to missing, and then each of the dates with dummy^=0 has its return put in ret_history.   That's the first pass of each gvkey, just to establish ret_history.

 

Then reread the gvkey.  Each time a dummy=1 aggregate the log_ret values from ret_history from date to date2, followed by an output.

 

Note this assumes that any date with dummy=.  or dummy=0 has only a single unique value for log_ret. 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
anildonmez
Calcite | Level 5
Thanks a lot, it worked just great!