BookmarkSubscribeRSS Feed
pedrogares
Calcite | Level 5

Hi,

 

I am trying to run PROC corr in order to obtain some rolling correlations based on 3 years. My main problem is that there is no enough internal space as the utility temporary file is really big ( I am dealing with more than 100M datapoints). Is there any way to avoid this utility temporary file to be saved on the SAS temporary files? I tried many things but neither of them worked.

 

This is my code 

 

DATA rwin / view=rwin;
array _X {1116} _temporary_ ;
array _Y {1116} _temporary_ ;
array _D {1116} _temporary_ ;
set CORRELATIONS3;
by PERMNO;
retain N 0;
N = ifn(first.PERMNO,1,N+1);
I=mod(N-1,1116)+1;
_X{I}=marketRET1;
_Y{I}=stockRET1;
_D{I}=day;
if N>=1116 then do I= 1 to 1116;
x=_X{I};
y=_Y{I};
d=_D{I};
output;
end;
run;


proc corr data=rwin noprint NOPROB NOSIMPLE outp=Babcorr1(where=(_type_='CORR'));
by permno N;
var x;
with y;
quit;

7 REPLIES 7
ballardw
Super User

Are you actually using the _X _Y and _D variables later? If not you might consider dropping them.

If you do but don't want them around for the Coor try

 

proc corr data=rwin (keep = permno n x y) <remaining code for CORR>

 

to reduce the number of variables read into the proc.

 

Another might be to break up the analysis into chunks of Permno or N values, create a separate output set for each and recombine afterwards.

 

 

Reeza
Super User

Usually it makes sense to create the big dataset and then do the calculations in one pass. 

 

In this case, if you're running out of space, it may be easier to write a macro to do each loop and then apppend the results into a table.

ballardw
Super User

Can you provide a few lines of data from the correlations3 data set, at least the variables used in making rwin?

And explain where the magic number 1116 comes from?

I tried simulating a data set but the results I get for rwin have an extremely large number of missing values for x and y so I likely have not guessed correctly quite what role 1116 plays.

pedrogares
Calcite | Level 5

Hi,

 

I am trying to compute the BAB factor in the Betting against Beta paper. I am using overlapping 3 days log returns of the market returns and the stock returns. I am assuming that every month has 31 days and as they look for at least 3 years of data this is the way I arrive to 1116. I impose to get at least 750 trading days with non-missing returns as they require in the paper. Thanks to assuming this, I do not have problems when computing the rolling windows. The last step is to use the proc corr in order to arrive to the correlations. I am following what some of you recommend me I am geting rid of the \view and trying to run the proc corr for the essential elements I need x y PERMNO and N

 

 

mkeintz
PROC Star

First, months may have 30 to 31 calendar days,but they typically have only 20 trading days.  Three years of trading with daily data is actually closer to 3*12*20=720 daily records, not 1116.

 

Second you only want PERMNO's with three years or more of trading.  Don't use the daily returns with the array to find out which permno's qualify.  Go to the other CRSP datasets.  There's one out there that has the earliest and latest trading dates for each permno.  You can use that to identify the ones you want to keep.

 

Or use PROC SUMMARY on your returns data to get the date range and count of records for each PERMNO, as in

 

proc summary data=mydata noprint n min max;

   class permno;

   var date;

   output n(date)=ndates min(date)=mindate max(date)=maxdate  out=permno_stats;

run;

 

Then you can merge that with your data and filter which permno's to process using conditions on NDATES or on MINDATE/MAXDATE.

 

But most important  -  don't drop the view technique.  Instead drop the brute force calculation of correlations by providing complete windows of original returns data.  Instead use proc  expand to create rolling windows of SSCP  (preceded by a very helpful data set view).  From the sscp data, it's easy to calculate correlations. 

 

My example using sashelp.stocks shows my recommendation.  It has 5 steps:

 

  1. PROC SORT: The data in my example is monthly, sorted by STOCK (like PERMNO) and DATE.

  2. PROC SUMMARY: The proc summary generates the dataset stock_stats, with one record per stock, and variables NDATES,MINDATE,MAXDATE. This will allow you to filter either  on date count (ndates), or on years covered.(mindate, maxdate).

  3. DATA NEED/VIEW=NEED.  Because proc expand can't directly generate rolling cross products, a preceding data set VIEW generates record-by-record cross products for proc expand to make rolling window SSCP's.  And this is where I put a filter on which stocks to process.

  4. PROC EXPAND: generate rolling windows (12 months size in my case) to make all the sscp values, means, and std's to calculate correlations. Note each window is identified by stock and date, where date is the last date of the window.  Change all the 12's in my proc expand to whatever window size you want to analyse.

  5. DATA CORRDATA:  Calculate correlations.

 

This should eliminate a lot of calculations - primarily because proc expand will generate rolling sscp's by adding the latest squares and cross-products while subtracting the 12-month old equivalents.  Not by reprocessing all the records in the new window.  The larger the window size, the more beneficial this will be.

 

/* Get monthly stock data for ibm, intel,msft */
proc sort data=sashelp.stocks (keep=stock date open close) 
   out=stocks (rename=(open=x close=y));
 by stock date;
run;

/* Get the profiles of each stock */
 proc summary data=sashelp.stocks noprint n min max  nway;
  class stock;
  var date;
  output n(date)=ndates min(date)=mindate max(date)=maxdate 
    out=stock_stats (keep=stock ndates mindate maxdate);
run;

/* Prepare to use PROC EXPAND to generate rolling SSCP values */
/* and simultaneously only keep stock with the desired date range */

 data need  /view=need;
   merge stocks  stock_stats;
   by stock;
   if ndates>200;
   xx=x*x; xy=x*y; yx=y*x; yy=y*y;
 run;

/* Now make rolling 12-month windows with sums of square and cross products*/
/* and also rolling means and std                                          */
proc expand data=need out=sscpdata (where=(_n=12)) method= none;
  by stock;
  id date;
  convert y=_n / transformin=(*0) transformout=(+1 MOVSUM 12);
  convert x=xsum y=ysum xy=xysum xx=xxsum yy=yysum / transformout=(MOVSUM 12);
  convert x=xmean y=ymean / transformout=(MOVAVE 12);
  convert x=xstd  y=ystd  / transformout=(MOVSTD 12);
run;

/* Now get the correlation values */
data corrdata;
 set sscpdata;
 corrxy=(xysum-_n*xmean*ymean)/((_n-1)*xstd*ystd);
run;

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Rick_SAS
SAS Super FREQ

Just a comment on mkeintz's excellent response. The trick of using PROC EXPAND and the DATA step to create a rolling correlation is explained in Vora (2008) "Easy Rolling Statistics with PROC EXPAND."

pedrogares
Calcite | Level 5

Dear all,

 

Thanks for all your comments. I found that this temporary utility problem is mainly coming from the use of \view. I  considered now all the observations and I did not have any problem about running proc corr statement (It ony took more time to do it). On the other hand, I used this 31 days assumption in order to take into account missing returns in some days. At the end, I am always imposing to get at least 750 trading days per stock but this way it is very easy to create the rolling windows. The only thing I do is to use a macro in order to compute the number of non-missing returns in these windows and impuse at least 750 days.

 

Thanks a lot to all of you

 

Best

 

Pedro

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1528 views
  • 3 likes
  • 5 in conversation