BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ilikesas
Barite | Level 11

Hi,

 

based on Ksharp's answer in https://communities.sas.com/t5/SAS-Procedures/rolling-standard-deviation-calculation/m-p/160027#M417... I want to calculate the rolling standard deviation but only for certain observations - the ones that have flag=1:

 

data have;
infile cards dlm=',' truncover;
input code year x flag;
cards;
255956,1980,,
255956,1981,,
255956,1982,,
255956,1983,,
255956,1984,5,
255956,1985,7,
255956,1986,4,
255956,1987,6,1
255956,1988,2,
255956,1989,1,
255964,1980,5,
255964,1981,7,
255964,1982,,
255964,1983,,
255964,1984,5,
255964,1985,3,
255964,1986,7,
255964,1987,3,
255964,1988,8,1
255964,1989,7,
;
run;

When I did the following code (also based on Ksharp's code):

 

proc sql;
create table want as
 select *,(select std(x) from have where year between a.year-2 and a.year and code=a.code) as rolling_std
  from have as a
where flag =1;
quit;

I obtained a data set containing only 2 observations. Indeed these are the observations where flag=1, but I would like to keep the original data as well. Also, I assume that here the rolling standard deviation was calculated for all observations where possible, but since my data is very big, it might be helpful to calculate the rolling standard deviation ONLY for the observations that have flag=1.

 

Thank you!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
s_lassen
Meteorite | Level 14

@ilikesas:

I think that what you want to do is this:

proc sql;
  create table want as 
  select *, 
    case flag   
      when 1 then (select std(x) from have where year between a.year-2 and a.year and code=a.code)
      else .  
    end as rolling_std  
from have as a;
quit;

So that you get all observations, but only calculate the standard deviation when Flag=1.

 

View solution in original post

21 REPLIES 21
s_lassen
Meteorite | Level 14

@ilikesas:

I think that what you want to do is this:

proc sql;
  create table want as 
  select *, 
    case flag   
      when 1 then (select std(x) from have where year between a.year-2 and a.year and code=a.code)
      else .  
    end as rolling_std  
from have as a;
quit;

So that you get all observations, but only calculate the standard deviation when Flag=1.

 

ilikesas
Barite | Level 11

Hi s_lassen,

 

thank you for the code!

 

As I said to mkeint below, in reality I have daily stock price data and I need the standard deviation over the past 4 years.

 

I ran your code for a sample of my data, and it took a lot of time. Is it possible to make the code run faster?

 

Thanks!

s_lassen
Meteorite | Level 14

@ilikesas:

If you do not have any missing dates in your table, you can calculate the standard deviation for 1461 days, (4 years exactly, unless you are around year 1900 or 2100) like this: 

data want;
  set have;
  by id;
  array values(0:1460) 8 _temporary_;
  if first.id then
    call missing(of values(*));
  values(mod(_N_,1461))=x;
  if flag=1 then
    rolling_std=std(of values(*));
run;

If some days are missing in your table, it becomes more complicated. One way is to use SET with POINT= to find the values that you want to delete from the array:

data want;
  set have;
  by id;
  array values(0:1460) 8 _temporary_;
  if first.id then do;
    call missing(of values(*));
    _p_first=_N_;
    _date_first=date;
    retain _p_first _date_first;
    drop _p_first _date_first;
    end;
  values(mod(_N_,1461))=x;
  do while(_date_first<date-1460);
    _p_first+1;
    set have(keep=date rename=(date=_date_first)) point=_p_first;
    call missing(values(mod(_p_first,1461)));
    end;
  if flag=1 then
    rolling_std=std(of values(*));
run;

Notice that I zero-based the array so that the MOD function can be used to calculate an index directly.

 

While this approach may be a bit slower than calculating the sum, N, and sum-of-squares, it is also a lot simpler.

mkeintz
PROC Star

I understand the appeal of using PROC SQL to do this code, but using expressions like " where year between a.year-2 and a.year and code=a.code" will introduce a lot of background record comparisons.  This is mostly unnecessary since it fails to take advantage of the apparent order of your data, which seem to be sorted by CODE/YEAR.  It's likely that a DATA step like this will be better:

 

data have;
infile cards dlm=',' dsd truncover;
input code year x flag;
cards;
255956,1980,,
255956,1981,,
255956,1982,,
255956,1983,,
255956,1984,5,
255956,1985,7,
255956,1986,4,
255956,1987,6,1
255956,1988,2,
255956,1989,1,
255964,1980,5,
255964,1981,7,
255964,1982,,
255964,1983,,
255964,1984,5,
255964,1985,3,
255964,1986,7,
255964,1987,3,
255964,1988,8,1
255964,1989,7,
;
run;

data want;
  set have;
  by code;
  array history{1978:1990} _temporary_;
  if first.code then call missing(of history{*});
  history{year}=x;
  if flag=1 then rolling_std=std(history{year-2},history{year-1},history{year});
run;

 

This program assumes (1) data are sorted by code/year, and (2) year is always an integer between 1980 and 1990.

 

The lower bound of the HISTORY array is set to 1978, because the earliest expected year is 1980.  This prevents the calculation of rolling_std from generating an "array subscript out of range" message is a 1980 record has flag=1.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ilikesas
Barite | Level 11

Hi mkeintz,

 

I probably should have mentioned it in the question, but in my real data instead of years I have actual dates, and I need for a specific flag date to calculate the rolling stdev for the previous 4 years, so the stdev is calculated between date and date-1440    (since 365*4 = 1440).

 

my data is bw 2001 and 2016, so I guess that the "history" array in your code would have to be at least of size 17*365=6205.

 

And the part 

if flag=1 then rolling_std=std(history{year-2},history{year-1},history{year})

would need to have 1440 parameters for the std function.

 

Given this new information, is it still possible to find a more efficient solution with the DATA step?

 

Thanks! 

 

 

PS: the data is indeed sorted by CODE/YEAR

mkeintz
PROC Star

Using dates, not years, won't be a problem in terms of programming.  You can easily have an array with 17*365 elements (actually 21*365 if you want to avoid problems with FLAG=1 within the first 4 years.  

 

I see two problems.

 

  1.   The easy problem - how to transparently generate lower and upper bounds for the array.  It would be nice if you could use a statement like
        array history{'01jan1997'd:'31dec2016'd) _temporary_;
    But the array statement doesn't accept date literals as bounds specification.  So use the macro facility, as in:

    %let begdate=01jan1997;
    %let enddate=31dec2016;
    %let LB=%sysfunc(inputn(&begdate,date9.));
    %let UP=%sysfunc(inputn(&enddate,date9.));

    array history {&LB:&UB} _temporary_;

  2. But the real problem is in calculating the rolling STD.  If you were to use my earlier suggestion, you would have two issues:
    1. the problem of how to compactly specify 1440 arguments to the STD function (SAS does not accept syntax like
          STD(of X{I-1440}-X{I})
      and
    2. you would be neglecting the fact that the STD for DATE i-1440 through i  (call it STD{i}) can be expressed as a function of three arguments:      STD{i-1}  ,X{i-1441}, and X{i}.  In other words, use the preceding STD, and the values of the case being dropped and the case being added.

      [Editted additional note].  You'd also need a couple other arguments:  the Sum-of-squares corresponding to STD{i-1} and the Mean corresponding to STD{i-1}. 

      This also might allow you to efficiently calculate STD for every date, not just flag=1.

 

Before I make a suggestion, are you generating

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

Let's assume dataset HAVE has three variables: ID,DATE, and X, sorted by ID/DATE, where DATE is a SAS data value that ranges from 01jan2001 through 31dec2016.   You want STD of daily values for rolling 4 year windows.

 

Take advantage of the fact that

 

      STD(x)   = SQRT{  (1/(n-1)   [SUM(X{i}**2)  - N*(mean(x)**2  ]   }

 

which means you can keep track of rolling stats such as rolling sums-of-squares and rolling_sums.  With each new observation you can add to rolling 4-year sums of squares and rolling 4-year sums.  And you can delete any newly bypassed old observations from those sums:

 

This program is untested:

%let begdate=01jan2001;
%let enddate=31dec2016;

%let LB=%sysfunc(intnx(year,%sysfunc(inputn(&begdate,date9.)),-4,sameday));
%let UB=%sysfunc(inputn(&enddate,date9.));


data want (drop=_:);
  set have (keep=id date x);
  by id;

  retain _rolling_ss   /* Rolling sum of squares*/
         _rolling_sum  /* Rolling sum */
         _rolling_n    /* Rolling count */
; array history {&lb:&ub} _temporary_; if first.id then call missing(of history{*},of _rolling:); history{date}=x; if x^=. then do; _rolling_n+ 1; _rolling_sum+ x; _rolling_ss+ x**2; end; _most_recent_drop_date=intnx('year',date,-4,'sameday'); _prior_most_recent_dd=lag(_most_recent_drop_date); if first.id=0 then do _drop_date=_prior_most_recent_dd+1 to _most_recent_drop_date; if history{_drop_date} ^= . then do; _rolling_n+ (-1); _rolling_sum+ (-history{_drop_date}); _rolling_ss+ (-history{_drop_date}**2); end; end; _rolling_mean=_rolling_sum/_rolling_n; if _rolling_n>1 then rolling_std=sqrt( ( _rolling_ss - _rolling_n*(_rolling_mean**2))/(_rolling_n-1) ); run;

 

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ilikesas
Barite | Level 11

Hi mkeintz,

 

I tried running the code but got many errors. I just changed id to cusip and x to prc

 

array history {&lb:&ub} _temporary_;
                     -       -
                     22      200
                     200
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
              a numeric constant, a datetime constant, a missing value, (, -, :, ;, _ALL_,
              _CHARACTER_, _CHAR_, _NUMERIC_.

ERROR 200-322: The symbol is not recognized and will be ignored.

411!   array history {&lb:&ub} _temporary_;
                             -
                             22
ERROR 22-322: Syntax error, expecting one of the following: a quoted string, a numeric constant,
              a datetime constant, a missing value, iterator, (, ), ','.

412
413    if first.cusip then call missing(of history{*});
ERROR: Undeclared array referenced: history.
ERROR: The ARRAYNAME[*] specification requires an array.
414
415    history{date}=prc;
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
416
417    if prc^=. then do;
418      _rolling_n+ 1;
419      _rolling_sum+ x;
420      _rolling_ss+ x**2;
421    end;
422
423    _most_recent_drop_date=intnx('year',date,-4,'sameday');
424    _prior_most_recent_dd=lag(_most_recent_drop_date)+1;
425
426    if first.cusip=0 then do _drop_date=_prior_most_recent_dd+1 to _most_recent_drop_date;
427      if history{_drop_date} ^= . then do;
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
428        _rolling_n+ (-1);
429        _rolling_sum+ (-history{_drop_date});
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
430        _rolling_ss+ (-history{_drop_date}**2);
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
431      end;
432    end;
mkeintz
PROC Star

Consider what "the symbol is not recognized and will be ignored" must mean.  In your case it means the macrovars (macro symbols) LB and UB are not defined.  You did not show the macro statement that I include assigning values to them, so that's your first task.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ilikesas
Barite | Level 11

In fact I did include them the first time (it just didn't show on what I pasted)

 

%let begdate=01jan2001;
%let enddate=31dec2016;

%let LB=%sysfunc(intnx(year,%sysfunc(inputn(&begdate,date9.)),-4,sameday));
%let UB=%sysfunc(inputn(&enddate,date9.));

But I still get the error messages.

mkeintz
PROC Star

Then it's time to debug.  Issue the

 

%put _user_;

 

statement after the macrovar assignments, to confirm that they were created as intended.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ilikesas
Barite | Level 11

I did, and  I see

 

GLOBAL BEGDATE 01jan2001

 

GLOBAL ENDDATE 31dec2016

 

 

 

Is it ok if I attach to you a small sample of my data?

mkeintz
PROC Star

And what values do you get for macrovars  LB and UB?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ilikesas
Barite | Level 11

here is what I get:

 

46   %put _user_;
GLOBAL LB 13515
GLOBAL BEGDATE 01jan2001
GLOBAL UB 20819
GLOBAL ENDDATE 31dec2016

so apparently LB and UB are in number form instead of date form.

In case you are interested, here I attach a sample of my data that contains 2 cusips (companies) and their daily prices.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 1910 views
  • 13 likes
  • 3 in conversation