Solved: Re: Rolling Skewness with Moving Window

Junyong · Posted 08-27-2020 07:13 PM

I have 5000 daily time series (var0001-var5000) from 01/01/2019 to 12/31/2019 and want to compute their rolling skewness each month. This post introduces proc summary for the rolling skewness.

data have(drop=i);
format t yymmddn6.;
array var(5000) var0001-var5000;
do t="1jan2019"d to "31dec2019"d;
do i=1 to 5000;
var(i)=rannor(1);
end;
output;
end;
run;

proc summary nway;
class t;
format t yymmn6.;
var var0001-var5000;
output out=want skew=;
run;

So I can quickly compute the 5000 skewness in January, February, and so on. Now suppose I want to compute each monthly skewness using two—or N more generally—consecutive months rather than just one month—for example, the February skewness using January and February data, the March skewness using February and March data, and so on. Can proc summary do this?

mkeintz · Posted 08-27-2020 09:24 PM

OK, I've thought some more about it. Here's a solution that submits an expanded data set (data set view not data set file) to proc summary.

Test this out by using, say 10 or so variables (%let NV=10). Then you can examine results more closely.

%let nv=5000;
%let nv=0010;
data have(drop=i);
  format t yymmddn8.;
  array var{&nv} var0001-var&nv;
  do t="1jan2019"d to "31dec2019"d;
    do i=1 to &nv;
      var(i)=rannor(1);
    end;
    output;
  end;
run;

%let win_size=5;
%let up_bound=%eval(&win_size-1);

data need (rename=(t=t_end) drop=_:)  / view=need ;
  retain window_id;
  format t_beg yymmddn8. ;
  set have;
  t_beg=lag&up_bound(t);

  array vars {&nv} var0001-var&nv ;
  array win_data{0:&up_bound,&nv} _temporary_;

  _row=mod(_n_,&win_size);
  do _v=1 to dim(vars);
    win_data{_row,_v}=vars{_v};
  end;

  if _n_>=&win_size;
  window_id+1;

  output ;   /* Output the current record */
  /* Now output the earlier window constituents */
  do _obs= _n_-1 to _n_-&up_bound by -1;
    _row=mod(_obs,&win_size);
	do _v=1 to dim(vars);
	  vars{_v}=win_data{_row,_v};
	end;
	output;
  end;
run;

proc summary data=need noprint ;
  by window_id t_beg t_end ;
  var var: ;
  output out=want  skew= / autoname;
run;

The strategy above is to keep a 2-dimensional array with 5 rows (for winsize=5) and 5000 columns (number of variables. Each incoming observation replaces the row of the array that is 5 observations old. Also for each incoming observation calculate t_beg=lag4(t) - i.e. the starting date of the size 5 window.

Then comes the output. Once you have a full window, increment window_id by 1. OUTPUT the record-in-hand. Then for the prior 4 days, transcribe the corresponding row of the array to the original variables, and output. You now have a dataset, sorted by WINDOW_ID (with associated variables T_BEG and T_END) with 5 records per id. Note it doesn't matter if the observations within the window are in chronological order - you're calculating skewness, not some time-series statistic.

Then submit to PROC SUMMARY, with a BY WINDOW_ID T_BEG T_END; statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 08-27-2020 08:32 PM

The only way to get proc summary to generate rolling window skewness (or rolling window anything) is to expand the original data set.

Let's say your rolling window sizes are 5. The you could make a data set sorted by the new variable WINDOW_ID, in sequences of 5 records. So for window_id=1, you would have records 1-5, for window_id=2, you would have records 2-6, etc. So that's a dataset 5 times your original dataset. Imagine if you want window sizes of, say, 20 to 30.

But in any case, you cuold then run the PROC SUMMARY with a BY WINDOW_ID statement. I don't suggest this.

Editted note: However, you might not use too many resources by making a data set VIEW instead of a dataset FILE.

Do you have SAS/ETS (econometrics times series)? If so, then proc expand is what you want.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 08-27-2020 09:24 PM

OK, I've thought some more about it. Here's a solution that submits an expanded data set (data set view not data set file) to proc summary.

Test this out by using, say 10 or so variables (%let NV=10). Then you can examine results more closely.

%let nv=5000;
%let nv=0010;
data have(drop=i);
  format t yymmddn8.;
  array var{&nv} var0001-var&nv;
  do t="1jan2019"d to "31dec2019"d;
    do i=1 to &nv;
      var(i)=rannor(1);
    end;
    output;
  end;
run;

%let win_size=5;
%let up_bound=%eval(&win_size-1);

data need (rename=(t=t_end) drop=_:)  / view=need ;
  retain window_id;
  format t_beg yymmddn8. ;
  set have;
  t_beg=lag&up_bound(t);

  array vars {&nv} var0001-var&nv ;
  array win_data{0:&up_bound,&nv} _temporary_;

  _row=mod(_n_,&win_size);
  do _v=1 to dim(vars);
    win_data{_row,_v}=vars{_v};
  end;

  if _n_>=&win_size;
  window_id+1;

  output ;   /* Output the current record */
  /* Now output the earlier window constituents */
  do _obs= _n_-1 to _n_-&up_bound by -1;
    _row=mod(_obs,&win_size);
	do _v=1 to dim(vars);
	  vars{_v}=win_data{_row,_v};
	end;
	output;
  end;
run;

proc summary data=need noprint ;
  by window_id t_beg t_end ;
  var var: ;
  output out=want  skew= / autoname;
run;

The strategy above is to keep a 2-dimensional array with 5 rows (for winsize=5) and 5000 columns (number of variables. Each incoming observation replaces the row of the array that is 5 observations old. Also for each incoming observation calculate t_beg=lag4(t) - i.e. the starting date of the size 5 window.

Then comes the output. Once you have a full window, increment window_id by 1. OUTPUT the record-in-hand. Then for the prior 4 days, transcribe the corresponding row of the array to the original variables, and output. You now have a dataset, sorted by WINDOW_ID (with associated variables T_BEG and T_END) with 5 records per id. Note it doesn't matter if the observations within the window are in chronological order - you're calculating skewness, not some time-series statistic.

Then submit to PROC SUMMARY, with a BY WINDOW_ID T_BEG T_END; statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

RichardDeVen · Posted 08-28-2020 07:22 AM

Using a data row in a rolling window computation means the data row needs to be replicated for association to each window it will be part of (i.e. partitions).

A data row will thus need to be repeated with an appropriate window identifier that iterates from window (date) to window+N (of date). A month based window id from a date (your variable t) can be computed with INTCK('month',...)

Example:

Using @mkeintz data generator and summary.

* generate some data;

%let nv=5000;
*%let nv=0010;
data have(drop=i);
  format t yymmddn8.;
  array var{&nv} var0001-var&nv;
  do t="1jan2019"d to "31dec2019"d;
    do i=1 to &nv;
      var(i)=rannor(1);
    end;
    output;
  end;
  format var: 8.4;
run;

* compute max month based window id;
proc sql noprint;
  select max(intck('month', '01jan1960'd, t)) 
  into :max_seqnum 
  from have;

%let MONTH_WINDOW = 2;

data stage(index=(seqdate));
  length seqnum 8;

  set have;
  by t;  * enforce requirement of data being sorted by t;

  * convert date to a month based sequence number (window id);
  seqnum = intck('month', '01jan1960'd, t); 

  * create replicates of row, associating data to N different month window partitions;
  do seqnum = seqnum to min(seqnum + &MONTH_WINDOW - 1, &MAX_SEQNUM);
    seqdate = intnx('month', '01jan1960'd, seqnum);  * first of month for later reporting;
    output;
  end;
  format seqdate monyy7.;
run;

proc summary data=stage;
  by seqdate;
  var var:;
  output out=want skew= / autoname;
run;

Sample output:

_FREQ_ is number of dates contributing to MONTH_WINDOW = 2 rolling month statistic (skew).

Rolling Skewness with Moving Window

Re: Rolling Skewness with Moving Window

Re: Rolling Skewness with Moving Window

Re: Rolling Skewness with Moving Window

Re: Rolling Skewness with Moving Window

Registration is open

SAS Training: Just a Click Away