Oh. You need use IML or make a macro to call PROC REG. Some skeleton code like. data date; set have(keep=date); end_date=intnx('month',date,12,'e'); run; data _null_; set date; call execute('proc reg data=have(where=(date between '|| date ||' and '||end_date ||' )) outest=xxx'|| strip(_n_)||' ;'); .............. run; data want; set xxx:; run;
While a you offer a nice neat program, I don't think it will solve @yotsuba88 problem.
It does avoid the need for a lot of disk space, but:
regards,
Mark
@mkeintz, Maybe I misunderstood what OP means . I think the best choice and fast way is using IML code .Which is very flexible for such kind of question.
I would need a lot of convincing to think that IML is the best solution for this user's problem, not because I don't recognize the flexibility of PROC IML, but mainly because of the likely size of the data. Although he didn't say so, he is probably like a lot of users of the CRSP database, in that he's looking at possibly thousands of stocks (identified by variable PERMNO) and several years for each stock.
So if a given firm has, say 20 years of data in his sample (the CRSP database actually goes back to 1929), then for each firm there will be 231 complete windows of 12-months duration. So for, say 4,000 stocks (not unusual for finance researchers) that means about 924,000 regressions, with each regression having about 240 observations (number of trading days in 12 months). It's hard for me to believe that IML is the way to go for this.
I would recommend a first pass through the data using a DATA step, generating a Unadjusted SSCP matrix for each month. For (say) a dependent, 3 independents, and an intercept, that's 5 rows per month. And the data step could also accumulate the first 12 months of data to have a USSCP for a complete year. Output that USSCP, assigning variable WINDOWID=1. Then read in month 13, calculate its USSCP, add it to the 12-month total, and subtract out the first month USSCP. You now have USSCP for months 2-13 - output that, assigning WINDOWID=2. And so on.
This would generate a reasonably sized data set (either view or file) of 5*924,000=4,620,000 records (instead of 240*924,000 =221,760,000 original data records) accompanied by the WINDOWID variable. And note that, unlike regressing on the original data, the SSCP for a given month is generated only once, even though it contributes to 12 different windows.
This data set could be read in by PROC REG, with a BY WINDOWID statement. (The documentation on PROC CORR, or PROC REG describes what a sas TYPE=SSCP data set should look like to be recognized by PROC REG).
That is how I understand the OP's original request. I would applaud an IML solution that would handle that data volume.
Regards,
Mark
I really appreciate both of you, guys. Now I am still struggling to understand how it works and how I write the code. Anw, I will try all suggestions to find the best one.
Best,
How many stocks, for how many windows, do you intend to process? The CRSP (Center for Research in Stock Prices) that you are using has daily data on thousands of stocks going back to 1929, so your potential universe is big.
SQL is NOT the solution to such a problem. It's going to compare EVERY record in "crsp as b" to each record in "crsp as a" to determine which records to put in the window. In other words, SQL will ignore the data order (PERMNO/DATE) in most CRSP data sets. Even if SAS marks the dataset as sorted, I suspect SQL won't take advantage (but that could be tested).
And consider the sheer size of the data you want. I suppose you want a new window every month, so that's 12 windows per year. With about 200 trading days per 12 months in each window, you are asking for each stock about 2400 records for a year of data. So multiply N(stocks)*N(years)*2400*(record size) to determine your disk space requirements.
But wait -- there's more! You apparently plan to do a regression on each of the windows, meaning you intend to recalculate sums of squares for 12 month windows, even though you could inherit 11 of those monthly sum-of-squares from the prior window.
Unless you have an unusually small portfolio and date range, try to avoid making a data set of the base data for every windows.
Now that I've sounded the alarm, it is possible that you might get away with making a data set VIEW as opposed to a data set FILE. You could then submit that view to a PROC REG, with a BY WINDOWID statement. It would still be calculating sums-of-squares 12 times as much as needed, but you'd reduce disk space requirements.
Regards,
Mark
Thank you very much for your suggestion. So I think I will try to another code to solve this.
@yotsuba88: what other code do you imagine would solve this problem?
OK, here is a program that deconstructs the task, losing efficiency (for instance it does not combine generation of monthly SSCP with accumulating 12-month rolling SSCP:
Notes:
Regards,
Mark
editted addition. Notice that in the SSCP_FINAL step I use lag&nrc (which is LAG5 in this case). That is I am using a 5-deep lag queue. The reason is the every "record" in the incoming SSCP is a single row in the 5*5 matrix. One row for _NAME_="intercept", one for _NAME_="ret", one for _name_="FACTOR1" through "FACTOR3". This means that the each month has 5 records, so to get lagged values for corresponding records, I use LAG5, not LAG. Of course, if the user specifies, say 8 variables in macrovar VARNAMES, then there are 9 rows per month. That's why this program uses LAG&nrc - it automatically adjusts for the size of the SSCP matrix.
MK
/* Names of variables that might be part of models*/
%let varnames=ret factor1 factor2 factor3;
%let NM=12; /* Number of months per rolling window */
/* Get size of SSCP matrix (one row/col per variable & 1 row/col for intercept)*/
%let nrc=%eval(1+ %sysfunc(countw(&varnames,%str( ))));
/* Make a dataset view with fixed value (month_end_date) for each month*/
data vtemp / view=vtemp;
set have;
month_end_date=intnx('month',date,0,'end');
format month_end_date yymmddn8.;
run;
/* Use proc reg to make SSCP for each month */
/* Notice there is no MODEL statement */
proc reg data=vtemp noprint outsscp=sscp (where=(_type_='SSCP')) ;
var &varnames;
by permno month_end_date;
run;
/* Now accumlate rolling total 12-month SSCP values */
data sscp_final (type=sscp drop=row col month_n);
array total_sscp{&nrc,&nrc} _temporary_ ;
do row=1 to &nrc; do col=1 to &nrc; total_sscp{row,col}=0; end; end;
do month_n=1 by 1 until (last.permno);
do row=1 to &nrc;
set sscp;
by permno;
array vars {*} intercept &varnames;
do col=1 to &nrc;
total_sscp{row,col}=total_sscp{row,col}+vars{col}-ifn(month_n>&nm,lag&nrc(vars{col}),0);
end;
do col=1 to &nrc;
vars{col}=total_sscp{row,col};
end;
if month_n>=&nm then output;
end;
end;
run;
/* And run the regression for each permno/month_end_date */
proc reg data=sscp_final ;
by permno month_end_date;
var &varnames;
model ret=factor1 factor2 factor3 ;
quit;
Hi Mark,
I really appreciate your help. I did try with one permno and compare results with your code. It is correct with the first window, other windows are different. The end of months are correct but I dont know how to check the first date of every loop, I believe something is wrong with this.
Could you please help me again? Could I use SSCP for proc means like Proc reg?
Thank you so much,
Ha
PROC CORR also can generate TYPE=SSCP (and TYPE=CORR, and TYPE=CSSCP I believe) datasets. And of course you could run a DATA step to create such data set, i.e. just start out with
data newsscp (type=SSCP) ;
As to your inconsistent results, why not post 13 months of data for 1 or 2 PERMNO's? Then any particpant could assess the inconsistency. Otherwise you are asking us to read your mind.
Mark
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.