Re: Create rolling 12 month window for daily data - Page 2

Ksharp · Posted 11-18-2016 02:22 AM

Oh. You need use IML or make a macro to call PROC REG.
Some skeleton code like.


data date;
 set have(keep=date);
end_date=intnx('month',date,12,'e');
run;

data _null_;
 set date;
call execute('proc reg data=have(where=(date between '|| date ||' and '||end_date ||' ))  outest=xxx'|| strip(_n_)||'  ;');
..............
run;

data want;
 set xxx:;
run;

mkeintz · Posted 11-18-2016 07:53 AM

@Ksharp:

While a you offer a nice neat program, I don't think it will solve @yotsuba88 problem.

It does avoid the need for a lot of disk space, but:

It generates about 20 windows for each month, starting on EACH trading day of the month and ending on the last trading day 11 months later, i.e. 10/03/2016-09/ 26/2017, 10/04/2016-09/26/2017, ..... yotsuba presumably wants only one window beginning in October 2016. This of course is easily fixed (say "if intck('month',lag(date),date)^=0;").
Data set HAVE is probably tens of millions of records sorted by PERMNO (company id) and DATE (I am familiar with the standard layout of database in question). So the WHERE filter in each PROC REG will also need to filter on PERMNO . But more importantly the where clause will repeatedly filter millions of obs just to get about 200 records for each window. It'll take forever.

However, your approach would become a lot more efficient if, instead of generating a "where" filter, you could generate FirStOBS and OBS data set parameters to put into the CALL EXECUTE.

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 11-18-2016 10:09 PM

@mkeintz,
Maybe I misunderstood what OP means .
I think the best choice and fast way is using IML code .Which is very flexible for such kind of question.

mkeintz · Posted 11-18-2016 10:44 PM

@Ksharp

I would need a lot of convincing to think that IML is the best solution for this user's problem, not because I don't recognize the flexibility of PROC IML, but mainly because of the likely size of the data. Although he didn't say so, he is probably like a lot of users of the CRSP database, in that he's looking at possibly thousands of stocks (identified by variable PERMNO) and several years for each stock.

So if a given firm has, say 20 years of data in his sample (the CRSP database actually goes back to 1929), then for each firm there will be 231 complete windows of 12-months duration. So for, say 4,000 stocks (not unusual for finance researchers) that means about 924,000 regressions, with each regression having about 240 observations (number of trading days in 12 months). It's hard for me to believe that IML is the way to go for this.

I would recommend a first pass through the data using a DATA step, generating a Unadjusted SSCP matrix for each month. For (say) a dependent, 3 independents, and an intercept, that's 5 rows per month. And the data step could also accumulate the first 12 months of data to have a USSCP for a complete year. Output that USSCP, assigning variable WINDOWID=1. Then read in month 13, calculate its USSCP, add it to the 12-month total, and subtract out the first month USSCP. You now have USSCP for months 2-13 - output that, assigning WINDOWID=2. And so on.

This would generate a reasonably sized data set (either view or file) of 5*924,000=4,620,000 records (instead of 240*924,000 =221,760,000 original data records) accompanied by the WINDOWID variable. And note that, unlike regressing on the original data, the SSCP for a given month is generated only once, even though it contributes to 12 different windows.

This data set could be read in by PROC REG, with a BY WINDOWID statement. (The documentation on PROC CORR, or PROC REG describes what a sas TYPE=SSCP data set should look like to be recognized by PROC REG).

That is how I understand the OP's original request. I would applaud an IML solution that would handle that data volume.

Regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

yotsuba88 · Posted 11-19-2016 04:48 AM

I really appreciate both of you, guys. Now I am still struggling to understand how it works and how I write the code. Anw, I will try all suggestions to find the best one.

Best,

yotsuba88 · Posted 11-20-2016 05:40 AM

I am really confused how to use your code because the daily data in one year is not the same. Sometimes, there are 270 days or nearly 300 days because data of some days are missing. How can I use it flexible and shift for one month forward.
Many thanks

mkeintz · Posted 11-17-2016 09:25 PM

How many stocks, for how many windows, do you intend to process? The CRSP (Center for Research in Stock Prices) that you are using has daily data on thousands of stocks going back to 1929, so your potential universe is big.

SQL is NOT the solution to such a problem. It's going to compare EVERY record in "crsp as b" to each record in "crsp as a" to determine which records to put in the window. In other words, SQL will ignore the data order (PERMNO/DATE) in most CRSP data sets. Even if SAS marks the dataset as sorted, I suspect SQL won't take advantage (but that could be tested).

And consider the sheer size of the data you want. I suppose you want a new window every month, so that's 12 windows per year. With about 200 trading days per 12 months in each window, you are asking for each stock about 2400 records for a year of data. So multiply N(stocks)*N(years)*2400*(record size) to determine your disk space requirements.

But wait -- there's more! You apparently plan to do a regression on each of the windows, meaning you intend to recalculate sums of squares for 12 month windows, even though you could inherit 11 of those monthly sum-of-squares from the prior window.

Unless you have an unusually small portfolio and date range, try to avoid making a data set of the base data for every windows.

Now that I've sounded the alarm, it is possible that you might get away with making a data set VIEW as opposed to a data set FILE. You could then submit that view to a PROC REG, with a BY WINDOWID statement. It would still be calculating sums-of-squares 12 times as much as needed, but you'd reduce disk space requirements.

Regards,
Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

yotsuba88 · Posted 11-17-2016 10:04 PM

Thank you very much for your suggestion. So I think I will try to another code to solve this.

mkeintz · Posted 11-17-2016 10:20 PM

@yotsuba88: what other code do you imagine would solve this problem?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

yotsuba88 · Posted 11-17-2016 10:31 PM

I dont know, I read your alarm message first then your 2 papers. So it is misunderstood, I am reading your papers and trying to apply them.

Best,

mkeintz · Posted 11-20-2016 01:47 PM

OK, here is a program that deconstructs the task, losing efficiency (for instance it does not combine generation of monthly SSCP with accumulating 12-month rolling SSCP:

Notes:

The program is untested, so I recommend you test it on 2 permnos with 13 months each (i.e. use something like
   where permno in (12404,16801) and date between ('01jan2011'd and 31jan2012'd
in the first data step.

That's 4 rolling 12-month windows (JAN-DEC and FEB-JAN for each permno). Then you can test the regression using this program against a direct PROC REG. Just run the direct proc reg's as follows:
proc reg data=have;
    where permno=12404 and date between '01jan2011'd and '31dec2012'd;
    model ...
quit;
proc reg data=have;
    where permno=12404 and date between '01feb2011'd and '31jan2012'd;

Do same for permno 16801

I suggest you not only compare the proc reg results, but take a look to the intermediate datasets to see the work that is taking place.
When you run the real data, do NOT run the PROC REG at the end as I have specified it. You'll need at least to put in a NOPRINT statement and also some keywords to output the r-square, total residuals, and estimated coefficients. I have it there just as a placeholder.
Notice you have to specific two macrovars:
1. varnames .. a list of variables that MIGHT be in the model. You can list, say 10 variables, and later on when you run PROC REGs, you can specific a subset in your model statement. I've just put in
  %let varnames= RET FACTOR1 FACTOR2 FACTOR3
  as an example
2. NM .. number of months in your rolling windows. I put in %let NM=12;
  
  A third macro variable NRC is created by the program. NRC is the number of rows (and number of columns) in the generated SSCP matrix. It equals the number of vars in VARNAMES plus 1 (for the intercept term).
The sequence of steps are these:
1. data vtemp / view=vtemp
  Make a data set with a fixed value MONTH_END_DATE for every record. This is needed by the BY STATEMENT in the next proc.
2. PROC REG. Notice that this reg procedure does NOT estimate models. All is does is generate an SSCP for each permno/month_end_date. These are the value that need to be aggregated into 12-month rolling windows.
3. DATA SSCP_FINAL This one aggregate the rolling windows and puts them into a data set file (SSCP_FINAL), for submission to any regression you want to run. SSCP_FINAL has the "TYPE=SSCP" data set name parameter. Setting this attribute for the data set is neccessary for subsequent proc reg's to recognize that these are not regular data files. Also note it has the "if month_n>=&nm then output;" This prevent outputting windows before the first 12-months have been read in.
  
  IMPORTANT, IMPORTANT: This program assumes there are no "holes" in any window. I.e. a given permno has at least one active trading date in every month from its first month to its last month. This is probably a safe assumption in most cases, but remember: a stock can be termporarily delisted from an exchange, often when its price goes below a certain value. If it regains value it can be relisted.
4. PROC REG. This is where you run you model or models. Note this can be done later, because once you have generated SSCP_FINAL, you have the 12-month rolling SSCP values for all the estimations you need. Just remember that all the PROC REGs you run against SSCP_FINAL have to have a "by permno month_end_date;" statement.
  
  And as mentioned earlier, you probably want to put a NOPRINT option on this proc reg, and then use various parameters on it to store you R-squared, total residuals, and estimated coefficents in a separated data set.
5. Again, the efficieny loss here is that the data are passed through 2 times to get SSCP_FINAL. Both tasks could be done in a single DATA step, but the program simplification here might be worth it.

Regards,

Mark

editted addition. Notice that in the SSCP_FINAL step I use lag&nrc (which is LAG5 in this case). That is I am using a 5-deep lag queue. The reason is the every "record" in the incoming SSCP is a single row in the 5*5 matrix. One row for _NAME_="intercept", one for _NAME_="ret", one for _name_="FACTOR1" through "FACTOR3". This means that the each month has 5 records, so to get lagged values for corresponding records, I use LAG5, not LAG. Of course, if the user specifies, say 8 variables in macrovar VARNAMES, then there are 9 rows per month. That's why this program uses LAG&nrc - it automatically adjusts for the size of the SSCP matrix.

MK

/* Names of variables that might be part of models*/
%let varnames=ret factor1 factor2 factor3; 

%let NM=12;  /* Number of months per rolling window */

/* Get size of SSCP matrix (one row/col per variable & 1 row/col for intercept)*/
%let nrc=%eval(1+  %sysfunc(countw(&varnames,%str( ))));


/* Make a dataset view with fixed value (month_end_date) for each month*/
data vtemp / view=vtemp;
  set have;
  month_end_date=intnx('month',date,0,'end');
  format month_end_date yymmddn8.;
run;

/* Use proc reg to make SSCP for each month */
/* Notice there is no MODEL statement       */
proc reg data=vtemp noprint outsscp=sscp (where=(_type_='SSCP')) ;
  var &varnames;
  by permno month_end_date;
run;

/* Now accumlate rolling total 12-month SSCP values */
data sscp_final (type=sscp drop=row col month_n);

  array total_sscp{&nrc,&nrc} _temporary_ ;

  do row=1 to &nrc; do col=1 to &nrc; total_sscp{row,col}=0; end; end;

  do month_n=1 by 1 until (last.permno);
    do row=1 to &nrc;
      set sscp;
      by permno;
      array vars  {*} intercept &varnames;
	  do col=1 to &nrc;
        total_sscp{row,col}=total_sscp{row,col}+vars{col}-ifn(month_n>&nm,lag&nrc(vars{col}),0);
      end;
	  do col=1 to &nrc;
	    vars{col}=total_sscp{row,col};
	  end;
	  if month_n>=&nm then output;
    end;
  end;
run;

/* And run the regression for each permno/month_end_date */
proc reg data=sscp_final ;
  by permno month_end_date;
  var &varnames;
  model ret=factor1 factor2 factor3 ;
quit;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

yotsuba88 · Posted 11-21-2016 09:31 AM

Hi Mark,

I really appreciate your help. I did try with one permno and compare results with your code. It is correct with the first window, other windows are different. The end of months are correct but I dont know how to check the first date of every loop, I believe something is wrong with this.

Could you please help me again? Could I use SSCP for proc means like Proc reg?

Thank you so much,

Ha

mkeintz · Posted 11-21-2016 10:32 AM

PROC CORR also can generate TYPE=SSCP (and TYPE=CORR, and TYPE=CSSCP I believe) datasets. And of course you could run a DATA step to create such data set, i.e. just start out with

data newsscp (type=SSCP) ;

As to your inconsistent results, why not post 13 months of data for 1 or 2 PERMNO's? Then any particpant could assess the inconsistency. Otherwise you are asking us to read your mind.

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away