10242017 08:11 PM  edited 10242017 10:29 PM
Hi,
based on Ksharp's answer in https://communities.sas.com/t5/SASProcedures/rollingstandarddeviationcalculation/mp/160027#M417... I want to calculate the rolling standard deviation but only for certain observations  the ones that have flag=1:
data have;
infile cards dlm=',' truncover;
input code year x flag;
cards;
255956,1980,,
255956,1981,,
255956,1982,,
255956,1983,,
255956,1984,5,
255956,1985,7,
255956,1986,4,
255956,1987,6,1
255956,1988,2,
255956,1989,1,
255964,1980,5,
255964,1981,7,
255964,1982,,
255964,1983,,
255964,1984,5,
255964,1985,3,
255964,1986,7,
255964,1987,3,
255964,1988,8,1
255964,1989,7,
;
run;
When I did the following code (also based on Ksharp's code):
proc sql;
create table want as
select *,(select std(x) from have where year between a.year2 and a.year and code=a.code) as rolling_std
from have as a
where flag =1;
quit;
I obtained a data set containing only 2 observations. Indeed these are the observations where flag=1, but I would like to keep the original data as well. Also, I assume that here the rolling standard deviation was calculated for all observations where possible, but since my data is very big, it might be helpful to calculate the rolling standard deviation ONLY for the observations that have flag=1.
Thank you!
10252017 06:15 AM
I think that what you want to do is this:
proc sql; create table want as select *, case flag when 1 then (select std(x) from have where year between a.year2 and a.year and code=a.code) else . end as rolling_std from have as a; quit;
So that you get all observations, but only calculate the standard deviation when Flag=1.
10252017 06:15 AM
I think that what you want to do is this:
proc sql; create table want as select *, case flag when 1 then (select std(x) from have where year between a.year2 and a.year and code=a.code) else . end as rolling_std from have as a; quit;
So that you get all observations, but only calculate the standard deviation when Flag=1.
10252017 04:53 PM
Hi s_lassen,
thank you for the code!
As I said to mkeint below, in reality I have daily stock price data and I need the standard deviation over the past 4 years.
I ran your code for a sample of my data, and it took a lot of time. Is it possible to make the code run faster?
Thanks!
11012017 08:42 AM
If you do not have any missing dates in your table, you can calculate the standard deviation for 1461 days, (4 years exactly, unless you are around year 1900 or 2100) like this:
data want; set have; by id; array values(0:1460) 8 _temporary_; if first.id then call missing(of values(*)); values(mod(_N_,1461))=x; if flag=1 then rolling_std=std(of values(*)); run;
If some days are missing in your table, it becomes more complicated. One way is to use SET with POINT= to find the values that you want to delete from the array:
data want; set have; by id; array values(0:1460) 8 _temporary_; if first.id then do; call missing(of values(*)); _p_first=_N_; _date_first=date; retain _p_first _date_first; drop _p_first _date_first; end; values(mod(_N_,1461))=x; do while(_date_first<date1460); _p_first+1; set have(keep=date rename=(date=_date_first)) point=_p_first; call missing(values(mod(_p_first,1461))); end; if flag=1 then rolling_std=std(of values(*)); run;
Notice that I zerobased the array so that the MOD function can be used to calculate an index directly.
While this approach may be a bit slower than calculating the sum, N, and sumofsquares, it is also a lot simpler.
10252017 10:17 AM
I understand the appeal of using PROC SQL to do this code, but using expressions like " where year between a.year2 and a.year and code=a.code" will introduce a lot of background record comparisons. This is mostly unnecessary since it fails to take advantage of the apparent order of your data, which seem to be sorted by CODE/YEAR. It's likely that a DATA step like this will be better:
data have;
infile cards dlm=',' dsd truncover;
input code year x flag;
cards;
255956,1980,,
255956,1981,,
255956,1982,,
255956,1983,,
255956,1984,5,
255956,1985,7,
255956,1986,4,
255956,1987,6,1
255956,1988,2,
255956,1989,1,
255964,1980,5,
255964,1981,7,
255964,1982,,
255964,1983,,
255964,1984,5,
255964,1985,3,
255964,1986,7,
255964,1987,3,
255964,1988,8,1
255964,1989,7,
;
run;
data want;
set have;
by code;
array history{1978:1990} _temporary_;
if first.code then call missing(of history{*});
history{year}=x;
if flag=1 then rolling_std=std(history{year2},history{year1},history{year});
run;
This program assumes (1) data are sorted by code/year, and (2) year is always an integer between 1980 and 1990.
The lower bound of the HISTORY array is set to 1978, because the earliest expected year is 1980. This prevents the calculation of rolling_std from generating an "array subscript out of range" message is a 1980 record has flag=1.
10252017 10:41 AM  edited 10252017 10:43 AM
Hi mkeintz,
I probably should have mentioned it in the question, but in my real data instead of years I have actual dates, and I need for a specific flag date to calculate the rolling stdev for the previous 4 years, so the stdev is calculated between date and date1440 (since 365*4 = 1440).
my data is bw 2001 and 2016, so I guess that the "history" array in your code would have to be at least of size 17*365=6205.
And the part
if flag=1 then rolling_std=std(history{year2},history{year1},history{year})
would need to have 1440 parameters for the std function.
Given this new information, is it still possible to find a more efficient solution with the DATA step?
Thanks!
PS: the data is indeed sorted by CODE/YEAR
10252017 11:38 AM  edited 10252017 11:42 AM
Using dates, not years, won't be a problem in terms of programming. You can easily have an array with 17*365 elements (actually 21*365 if you want to avoid problems with FLAG=1 within the first 4 years.
I see two problems.
Before I make a suggestion, are you generating
10252017 02:45 PM  edited 10312017 10:13 PM
Let's assume dataset HAVE has three variables: ID,DATE, and X, sorted by ID/DATE, where DATE is a SAS data value that ranges from 01jan2001 through 31dec2016. You want STD of daily values for rolling 4 year windows.
Take advantage of the fact that
STD(x) = SQRT{ (1/(n1) [SUM(X{i}**2)  N*(mean(x)**2 ] }
which means you can keep track of rolling stats such as rolling sumsofsquares and rolling_sums. With each new observation you can add to rolling 4year sums of squares and rolling 4year sums. And you can delete any newly bypassed old observations from those sums:
This program is untested:
%let begdate=01jan2001;
%let enddate=31dec2016;
%let LB=%sysfunc(intnx(year,%sysfunc(inputn(&begdate,date9.)),4,sameday));
%let UB=%sysfunc(inputn(&enddate,date9.));
data want (drop=_:);
set have (keep=id date x);
by id;
retain _rolling_ss /* Rolling sum of squares*/
_rolling_sum /* Rolling sum */
_rolling_n /* Rolling count */
;
array history {&lb:&ub} _temporary_;
if first.id then call missing(of history{*},of _rolling:);
history{date}=x;
if x^=. then do;
_rolling_n+ 1;
_rolling_sum+ x;
_rolling_ss+ x**2;
end;
_most_recent_drop_date=intnx('year',date,4,'sameday');
_prior_most_recent_dd=lag(_most_recent_drop_date);
if first.id=0 then do _drop_date=_prior_most_recent_dd+1 to _most_recent_drop_date;
if history{_drop_date} ^= . then do;
_rolling_n+ (1);
_rolling_sum+ (history{_drop_date});
_rolling_ss+ (history{_drop_date}**2);
end;
end;
_rolling_mean=_rolling_sum/_rolling_n;
if _rolling_n>1 then rolling_std=sqrt( ( _rolling_ss  _rolling_n*(_rolling_mean**2))/(_rolling_n1) );
run;
10252017 04:47 PM
Hi mkeintz,
I tried running the code but got many errors. I just changed id to cusip and x to prc
array history {&lb:&ub} _temporary_;
 
22 200
200
ERROR 22322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, (, , :, ;, _ALL_,
_CHARACTER_, _CHAR_, _NUMERIC_.
ERROR 200322: The symbol is not recognized and will be ignored.
411! array history {&lb:&ub} _temporary_;

22
ERROR 22322: Syntax error, expecting one of the following: a quoted string, a numeric constant,
a datetime constant, a missing value, iterator, (, ), ','.
412
413 if first.cusip then call missing(of history{*});
ERROR: Undeclared array referenced: history.
ERROR: The ARRAYNAME[*] specification requires an array.
414
415 history{date}=prc;
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
416
417 if prc^=. then do;
418 _rolling_n+ 1;
419 _rolling_sum+ x;
420 _rolling_ss+ x**2;
421 end;
422
423 _most_recent_drop_date=intnx('year',date,4,'sameday');
424 _prior_most_recent_dd=lag(_most_recent_drop_date)+1;
425
426 if first.cusip=0 then do _drop_date=_prior_most_recent_dd+1 to _most_recent_drop_date;
427 if history{_drop_date} ^= . then do;
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
428 _rolling_n+ (1);
429 _rolling_sum+ (history{_drop_date});
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
430 _rolling_ss+ (history{_drop_date}**2);
ERROR: Undeclared array referenced: history.
ERROR: Variable history has not been declared as an array.
431 end;
432 end;
10252017 04:58 PM
Consider what "the symbol is not recognized and will be ignored" must mean. In your case it means the macrovars (macro symbols) LB and UB are not defined. You did not show the macro statement that I include assigning values to them, so that's your first task.
10252017 05:05 PM
In fact I did include them the first time (it just didn't show on what I pasted)
%let begdate=01jan2001;
%let enddate=31dec2016;
%let LB=%sysfunc(intnx(year,%sysfunc(inputn(&begdate,date9.)),4,sameday));
%let UB=%sysfunc(inputn(&enddate,date9.));
But I still get the error messages.
10252017 05:16 PM
Then it's time to debug. Issue the
%put _user_;
statement after the macrovar assignments, to confirm that they were created as intended.
10252017 06:40 PM
I did, and I see
GLOBAL BEGDATE 01jan2001
GLOBAL ENDDATE 31dec2016
Is it ok if I attach to you a small sample of my data?
10262017 08:03 AM
And what values do you get for macrovars LB and UB?
10262017 03:39 PM
here is what I get:
46 %put _user_;
GLOBAL LB 13515
GLOBAL BEGDATE 01jan2001
GLOBAL UB 20819
GLOBAL ENDDATE 31dec2016
so apparently LB and UB are in number form instead of date form.
In case you are interested, here I attach a sample of my data that contains 2 cusips (companies) and their daily prices.
Need further help from the community? Please ask a new question.