DATA Step, Macro, Functions and more

conditional rolling standard deviation

Accepted Solution Solved
Reply
Super Contributor
Posts: 441
Accepted Solution

conditional rolling standard deviation

[ Edited ]

Hi,

 

based on Ksharp's answer in https://communities.sas.com/t5/SAS-Procedures/rolling-standard-deviation-calculation/m-p/160027#M417... I want to calculate the rolling standard deviation but only for certain observations - the ones that have flag=1:

 

data have;
infile cards dlm=',' truncover;
input code year x flag;
cards;
255956,1980,,
255956,1981,,
255956,1982,,
255956,1983,,
255956,1984,5,
255956,1985,7,
255956,1986,4,
255956,1987,6,1
255956,1988,2,
255956,1989,1,
255964,1980,5,
255964,1981,7,
255964,1982,,
255964,1983,,
255964,1984,5,
255964,1985,3,
255964,1986,7,
255964,1987,3,
255964,1988,8,1
255964,1989,7,
;
run;

When I did the following code (also based on Ksharp's code):

 

proc sql;
create table want as
 select *,(select std(x) from have where year between a.year-2 and a.year and code=a.code) as rolling_std
  from have as a
where flag =1;
quit;

I obtained a data set containing only 2 observations. Indeed these are the observations where flag=1, but I would like to keep the original data as well. Also, I assume that here the rolling standard deviation was calculated for all observations where possible, but since my data is very big, it might be helpful to calculate the rolling standard deviation ONLY for the observations that have flag=1.

 

Thank you!

 

 


Accepted Solutions
Solution
3 weeks ago
PROC Star
Posts: 102

Re: conditional rolling standard deviation

@ilikesas:

I think that what you want to do is this:

proc sql;
  create table want as 
  select *, 
    case flag   
      when 1 then (select std(x) from have where year between a.year-2 and a.year and code=a.code)
      else .  
    end as rolling_std  
from have as a;
quit;

So that you get all observations, but only calculate the standard deviation when Flag=1.

 

View solution in original post


All Replies
Solution
3 weeks ago
PROC Star
Posts: 102

Re: conditional rolling standard deviation

@ilikesas:

I think that what you want to do is this:

proc sql;
  create table want as 
  select *, 
    case flag   
      when 1 then (select std(x) from have where year between a.year-2 and a.year and code=a.code)
      else .  
    end as rolling_std  
from have as a;
quit;

So that you get all observations, but only calculate the standard deviation when Flag=1.

 

Super Contributor
Posts: 441

Re: conditional rolling standard deviation

Hi s_lassen,

 

thank you for the code!

 

As I said to mkeint below, in reality I have daily stock price data and I need the standard deviation over the past 4 years.

 

I ran your code for a sample of my data, and it took a lot of time. Is it possible to make the code run faster?

 

Thanks!

PROC Star
Posts: 102

Re: conditional rolling standard deviation

@ilikesas:

If you do not have any missing dates in your table, you can calculate the standard deviation for 1461 days, (4 years exactly, unless you are around year 1900 or 2100) like this: 

data want;
  set have;
  by id;
  array values(0:1460) 8 _temporary_;
  if first.id then
    call missing(of values(*));
  values(mod(_N_,1461))=x;
  if flag=1 then
    rolling_std=std(of values(*));
run;

If some days are missing in your table, it becomes more complicated. One way is to use SET with POINT= to find the values that you want to delete from the array:

data want;
  set have;
  by id;
  array values(0:1460) 8 _temporary_;
  if first.id then do;
    call missing(of values(*));
    _p_first=_N_;
    _date_first=date;
    retain _p_first _date_first;
    drop _p_first _date_first;
    end;
  values(mod(_N_,1461))=x;
  do while(_date_first<date-1460);
    _p_first+1;
    set have(keep=date rename=(date=_date_first)) point=_p_first;
    call missing(values(mod(_p_first,1461)));
    end;
  if flag=1 then
    rolling_std=std(of values(*));
run;

Notice that I zero-based the array so that the MOD function can be used to calculate an index directly.

 

While this approach may be a bit slower than calculating the sum, N, and sum-of-squares, it is also a lot simpler.

Trusted Advisor
Posts: 1,022

Re: conditional rolling standard deviation

I understand the appeal of using PROC SQL to do this code, but using expressions like " where year between a.year-2 and a.year and code=a.code" will introduce a lot of background record comparisons.  This is mostly unnecessary since it fails to take advantage of the apparent order of your data, which seem to be sorted by CODE/YEAR.  It's likely that a DATA step like this will be better:

 

data have;
infile cards dlm=',' dsd truncover;
input code year x flag;
cards;
255956,1980,,
255956,1981,,
255956,1982,,
255956,1983,,
255956,1984,5,
255956,1985,7,
255956,1986,4,
255956,1987,6,1
255956,1988,2,
255956,1989,1,
255964,1980,5,
255964,1981,7,
255964,1982,,
255964,1983,,
255964,1984,5,
255964,1985,3,
255964,1986,7,
255964,1987,3,
255964,1988,8,1
255964,1989,7,
;
run;

data want;
  set have;
  by code;
  array history{1978:1990} _temporary_;
  if first.code then call missing(of history{*});
  history{year}=x;
  if flag=1 then rolling_std=std(history{year-2},history{year-1},history{year});
run;

 

This program assumes (1) data are sorted by code/year, and (2) year is always an integer between 1980 and 1990.

 

The lower bound of the HISTORY array is set to 1978, because the earliest expected year is 1980.  This prevents the calculation of rolling_std from generating an "array subscript out of range" message is a 1980 record has flag=1.

 

Super Contributor
Posts: 441

Re: conditional rolling standard deviation

[ Edited ]

Hi mkeintz,

 

I probably should have mentioned it in the question, but in my real data instead of years I have actual dates, and I need for a specific flag date to calculate the rolling stdev for the previous 4 years, so the stdev is calculated between date and date-1440    (since 365*4 = 1440).

 

my data is bw 2001 and 2016, so I guess that the "history" array in your code would have to be at least of size 17*365=6205.

 

And the part 

if flag=1 then rolling_std=std(history{year-2},history{year-1},history{year})

would need to have 1440 parameters for the std function.

 

Given this new information, is it still possible to find a more efficient solution with the DATA step?

 

Thanks! 

 

 

PS: the data is indeed sorted by CODE/YEAR

Trusted Advisor
Posts: 1,022

Re: conditional rolling standard deviation

[ Edited ]

Using dates, not years, won't be a problem in terms of programming.  You can easily have an array with 17*365 elements (actually 21*365 if you want to avoid problems with FLAG=1 within the first 4 years.  

 

I see two problems.

 

  1.   The easy problem - how to transparently generate lower and upper bounds for the array.  It would be nice if you could use a statement like
        array history{'01jan1997'd:'31dec2016'd) _temporary_;
    But the array statement doesn't accept date literals as bounds specification.  So use the macro facility, as in:

    %let begdate=01jan1997;
    %let enddate=31dec2016;
    %let LB=%sysfunc(inputn(&begdate,date9.));
    %let UP=%sysfunc(inputn(&enddate,date9.));

    array history {&LB:&UB} _temporary_;

  2. But the real problem is in calculating the rolling STD.  If you were to use my earlier suggestion, you would have two issues:
    1. the problem of how to compactly specify 1440 arguments to the STD function (SAS does not accept syntax like
          STD(of X{I-1440}-X{I})
      and
    2. you would be neglecting the fact that the STD for DATE i-1440 through i  (call it STD{i}) can be expressed as a function of three arguments:      STD{i-1}  ,X{i-1441}, and X{i}.  In other words, use the preceding STD, and the values of the case being dropped and the case being added.

      [Editted additional note].  You'd also need a couple other arguments:  the Sum-of-squares corresponding to STD{i-1} and the Mean corresponding to STD{i-1}. 

      This also might allow you to efficiently calculate STD for every date, not just flag=1.

 

Before I make a suggestion, are you generating

Trusted Advisor
Posts: 1,022

Re: conditional rolling standard deviation

[ Edited ]

Let's assume dataset HAVE has three variables: ID,DATE, and X, sorted by ID/DATE, where DATE is a SAS data value that ranges from 01jan2001 through 31dec2016.   You want STD of daily values for rolling 4 year windows.

 

Take advantage of the fact that

 

      STD(x)   = SQRT{  (1/(n-1)   [SUM(X{i}**2)  - N*(mean(x)**2  ]   }

 

which means you can keep track of rolling stats such as rolling sums-of-squares and rolling_sums.  With each new observation you can add to rolling 4-year sums of squares and rolling 4-year sums.  And you can delete any newly bypassed old observations from those sums:

 

This program is untested:

%let begdate=01jan2001;
%let enddate=31dec2016;

%let LB=%sysfunc(intnx(year,%sysfunc(inputn(&begdate,date9.)),-4,sameday));
%let UB=%sysfunc(inputn(&enddate,date9.));


data want (drop=_:);
  set have (keep=id date x);
  by id;

  retain _rolling_ss   /* Rolling sum of squares*/
         _rolling_sum  /* Rolling sum */
         _rolling_n    /* Rolling count */
; array history {&lb:&ub} _temporary_; if first.id then call missing(of history{*},of _rolling:); history{date}=x; if x^=. then do; _rolling_n+ 1; _rolling_sum+ x; _rolling_ss+ x**2; end; _most_recent_drop_date=intnx('year',date,-4,'sameday'); _prior_most_recent_dd=lag(_most_recent_drop_date); if first.id=0 then do _drop_date=_prior_most_recent_dd+1 to _most_recent_drop_date; if history{_drop_date} ^= . then do; _rolling_n+ (-1); _rolling_sum+ (-history{_drop_date}); _rolling_ss+ (-history{_drop_date}**2); end; end; _rolling_mean=_rolling_sum/_rolling_n; if _rolling_n>1 then rolling_std=sqrt( ( _rolling_ss - _rolling_n*(_rolling_mean**2))/(_rolling_n-1) ); run;

 

 

 

 

 

Super Contributor
Posts: 441

Re: conditional rolling standard deviation

Hi mkeintz,

 

I tried running the code but got many errors. I just changed id to cusip and x to prc

 

array history {&lb:&ub} _temporary_;
                     -       -
                     22      200
                     200
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
              a numeric constant, a datetime constant, a missing value, (, -, :, ;, _ALL_,
              _CHARACTER_, _CHAR_, _NUMERIC_.

ERROR 200-322: The symbol is not recognized and will be ignored.

411!   array history {&lb:&ub} _temporary_;
                             -
                             22
ERROR 22-322: Syntax error, expecting one of the following: a quoted string, a numeric constant,
              a datetime constant, a missing value, iterator, (, ), ','.

412
413    if first.cusip then call missing(of history{*});
ERROR: Undeclared array referenced: history.
ERROR: The ARRAYNAME[*] specification requires an array.
414
415    history{date}=prc;
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
416
417    if prc^=. then do;
418      _rolling_n+ 1;
419      _rolling_sum+ x;
420      _rolling_ss+ x**2;
421    end;
422
423    _most_recent_drop_date=intnx('year',date,-4,'sameday');
424    _prior_most_recent_dd=lag(_most_recent_drop_date)+1;
425
426    if first.cusip=0 then do _drop_date=_prior_most_recent_dd+1 to _most_recent_drop_date;
427      if history{_drop_date} ^= . then do;
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
428        _rolling_n+ (-1);
429        _rolling_sum+ (-history{_drop_date});
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
430        _rolling_ss+ (-history{_drop_date}**2);
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
431      end;
432    end;
Trusted Advisor
Posts: 1,022

Re: conditional rolling standard deviation

Consider what "the symbol is not recognized and will be ignored" must mean.  In your case it means the macrovars (macro symbols) LB and UB are not defined.  You did not show the macro statement that I include assigning values to them, so that's your first task.

Super Contributor
Posts: 441

Re: conditional rolling standard deviation

In fact I did include them the first time (it just didn't show on what I pasted)

 

%let begdate=01jan2001;
%let enddate=31dec2016;

%let LB=%sysfunc(intnx(year,%sysfunc(inputn(&begdate,date9.)),-4,sameday));
%let UB=%sysfunc(inputn(&enddate,date9.));

But I still get the error messages.

Trusted Advisor
Posts: 1,022

Re: conditional rolling standard deviation

Then it's time to debug.  Issue the

 

%put _user_;

 

statement after the macrovar assignments, to confirm that they were created as intended.

 

Super Contributor
Posts: 441

Re: conditional rolling standard deviation

I did, and  I see

 

GLOBAL BEGDATE 01jan2001

 

GLOBAL ENDDATE 31dec2016

 

 

 

Is it ok if I attach to you a small sample of my data?

Trusted Advisor
Posts: 1,022

Re: conditional rolling standard deviation

And what values do you get for macrovars  LB and UB?

Super Contributor
Posts: 441

Re: conditional rolling standard deviation

here is what I get:

 

46   %put _user_;
GLOBAL LB 13515
GLOBAL BEGDATE 01jan2001
GLOBAL UB 20819
GLOBAL ENDDATE 31dec2016

so apparently LB and UB are in number form instead of date form.

In case you are interested, here I attach a sample of my data that contains 2 cusips (companies) and their daily prices.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 21 replies
  • 246 views
  • 13 likes
  • 3 in conversation