BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
anildonmez
Calcite | Level 5

Hi SAS community,


This is my very first post to this forum, I hope I can ask my question while abiding by the forum rules.

I have an unbalanced panel data set of daily security log returns. This data set is constructed by merging two data sets: security return data (firm id is gvkey) and another data set containing firm-specific information (firm id is gvkey+iid). Each observation has two dates (date and date2) together with a dummy variable and I'd like to calculate the cumulative log returns between these two dates for those having dummy=1. I'd also like to note that the data is not uniquely defined in the gvkey-date pair since the data set containing firm characteristics has a more disaggregated firm id definition (gvkey+iid). Below, I included a screenshot of the part of my data set:

 

sample.png

What I want to do is to calculate the cumulative returns between date and date2 only for row numbers 301, 302, and 310 (where dummy=1). Below screenshot displays what I am trying to get in the end:

sample2.png

Just to make it clearer, the cumulative return for row 310 is calculated as the sum of log-returns shown within red rectangles (or, the sum of single daily log returns from 12/21/2010 to 12/31/2010). Of course, this is just a sample data set, the original data has millions of observations with multiple gvkeys and a larger time frame.

 

I have some experience with SAS but I couldn't think a clever way to tackle this specific problem. I'd appreciate your help!

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

@anildonmez wrote:

 

I have an unbalanced panel data set of daily security log returns. This data set is constructed by merging two data sets: security return data (firm id is gvkey) and another data set containing firm-specific information (firm id is gvkey+iid). Each observation has two dates (date and date2) together with a dummy variable and I'd like to calculate the cumulative log returns between these two dates for those having dummy=1. I'd also like to note that the data is not uniquely defined in the gvkey-date pair since the data set containing firm characteristics has a more disaggregated firm id definition (gvkey+iid). Below, I included a screenshot of the part of my data set:

First things first:

In Compustat GVKEY is the unique firm id, and IID is the issue id within that firm.  So gvkey+iid is essentially the unique identifier for a stock issue - it's not really a firm id.

 

As far as I can tell from the boxes you drew, for each record with dummy=1 you want to aggregate the log returns for all dates between DATE and DATE2 in which dummy is not a zero:

 

%let begdate=01jan2010;
%let enddate=31dec2010;

data want (drop=_:);
  array ret_history {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} _temporary_;
  call missing(of ret_history{*});

  do until (last.gvkey);
    set have;
    by gvkey;
    if dummy ^=0 then ret_history{date}=log_ret;
  end;

  do until(last.gvkey);
    set have;
	by gvkey;
	cum_ret=.;
	if dummy=1 then do _d=date to date2;
	  cum_ret=sum(cum_ret,ret_history{_d});
	end;
    output;
  end;
run;

If you know the date range of your data, then set the macrovars BEGDATE and ENDDATE accordingly.  This allows the definition of an array RET_HISTORY, indexed by date.  For each GVKEY, RET_HISTORY is initialized to missing, and then each of the dates with dummy^=0 has its return put in ret_history.   That's the first pass of each gvkey, just to establish ret_history.

 

Then reread the gvkey.  Each time a dummy=1 aggregate the log_ret values from ret_history from date to date2, followed by an output.

 

Note this assumes that any date with dummy=.  or dummy=0 has only a single unique value for log_ret. 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

2 REPLIES 2
mkeintz
PROC Star

@anildonmez wrote:

 

I have an unbalanced panel data set of daily security log returns. This data set is constructed by merging two data sets: security return data (firm id is gvkey) and another data set containing firm-specific information (firm id is gvkey+iid). Each observation has two dates (date and date2) together with a dummy variable and I'd like to calculate the cumulative log returns between these two dates for those having dummy=1. I'd also like to note that the data is not uniquely defined in the gvkey-date pair since the data set containing firm characteristics has a more disaggregated firm id definition (gvkey+iid). Below, I included a screenshot of the part of my data set:

First things first:

In Compustat GVKEY is the unique firm id, and IID is the issue id within that firm.  So gvkey+iid is essentially the unique identifier for a stock issue - it's not really a firm id.

 

As far as I can tell from the boxes you drew, for each record with dummy=1 you want to aggregate the log returns for all dates between DATE and DATE2 in which dummy is not a zero:

 

%let begdate=01jan2010;
%let enddate=31dec2010;

data want (drop=_:);
  array ret_history {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} _temporary_;
  call missing(of ret_history{*});

  do until (last.gvkey);
    set have;
    by gvkey;
    if dummy ^=0 then ret_history{date}=log_ret;
  end;

  do until(last.gvkey);
    set have;
	by gvkey;
	cum_ret=.;
	if dummy=1 then do _d=date to date2;
	  cum_ret=sum(cum_ret,ret_history{_d});
	end;
    output;
  end;
run;

If you know the date range of your data, then set the macrovars BEGDATE and ENDDATE accordingly.  This allows the definition of an array RET_HISTORY, indexed by date.  For each GVKEY, RET_HISTORY is initialized to missing, and then each of the dates with dummy^=0 has its return put in ret_history.   That's the first pass of each gvkey, just to establish ret_history.

 

Then reread the gvkey.  Each time a dummy=1 aggregate the log_ret values from ret_history from date to date2, followed by an output.

 

Note this assumes that any date with dummy=.  or dummy=0 has only a single unique value for log_ret. 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
anildonmez
Calcite | Level 5
Thanks a lot, it worked just great!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 769 views
  • 1 like
  • 2 in conversation