Hi All,
I have some data that look like this:
year month Company x
1970 1 A 0.01
1970 1 B 0.02
1970 1 C 0.03
. ..
1970 2 A 0.05
1970 2 B 0.04
I would like to calculate the moving 90th percentile (p90) of x, across all companies and dates, using 15 years of lagged data.
So, the variable I want would start having non-missing values in 1/1985, and always be calculated from the distribution of 15 years of past data, i.e., the value in 1/1985 would be based on the distribution from 1/1970-12/1984, the value in 2/1985 would be based on 2/1970-1/1985, etc.
Thank you very much in advance.
Costas
@costasRO wrote:
Hi All,
I have some data that look like this:
year month Company x
1970 1 A 0.01
1970 1 B 0.02
1970 1 C 0.03
. ..
1970 2 A 0.05
1970 2 B 0.04
I would like to calculate the moving 90th percentile (p90) of x, across all companies and dates, using 15 years of lagged data.
So, the variable I want would start having non-missing values in 1/1985, and always be calculated from the distribution of 15 years of past data, i.e., the value in 1/1985 would be based on the distribution from 1/1970-12/1984, the value in 2/1985 would be based on 2/1970-1/1985, etc.
Thank you very much in advance.
Costas
data have;
set sashelp.stocks;
run;
*sort for faster processing and add index;
proc sort data=have out=have (index=(date));
by date;
run;
options mprint symbolgen;
%macro rolling_p90(ds_in = , window = , ds_out=);
*select minimum date and maximum date;
proc sql noprint;
select min(date) into :min_date
from &ds_in.;
select max(date) into :max_date
from &ds_in.;
quit;
*determine number of months between start and end date;
%let n_months = %sysfunc(intck(month, &min_date, &max_date));
%put &n_months.;
%put &window.;
*start of loop;
%do i=1 %to %eval(&n_months - &window.);
*get start of period;
%let period_start = %sysfunc(intnx(month, &min_date., %eval(&i-1), b));
%put &period_start;
*calculate end of period;
%let period_end = %sysfunc(intnx(month, &period_start., &window.-1, e));
*calculate summary statistics;
proc means data=&ds_in. noprint nway;
where date between &period_start. and &period_end.;
var open;
output out=summary p90=p90;
run;
*add in date window for comparison;
data summary;
set summary;
period_start = &period_start;
period_end = &period_end.;
format period_: date9.;
run;
*append to main data set;
proc append base=&ds_out data=summary force;
run;
*drop table between loops;
proc sql;
drop table summary;
quit;
%end;
%mend;
%rolling_p90(ds_in = have , window = 24 , ds_out= want);
A simple search of the SAS Community finds a bunch of possible solutions, such as
https://communities.sas.com/t5/SAS-Procedures/Computation-of-backward-moving-percentiles/td-p/129651
I leave it up to you to find others.
I have tried this code without success as I do not have a set number of lags (here 252) because the number of companies changes. So the code must determine the lags based on time.
You can use this method in which thousands of regressions are performed (scroll down to the section entitled "The BY way for many models"). Instead of running PROC REG, use PROC UNIVARIATE to compute the desired percentiles.
@costasRO wrote:
I have tried this code without success as I do not have a set number of lags (here 252) because the number of companies changes. So the code must determine the lags based on time.
One way might be to make sure all the companies are present for all the time periods. Missing values for the value of the variable(s) you are calculating statistics for should be handled correctly. You would have to calculate the lag value I very much doubt your data matches the OP in that other thread. The magic number 252 was based on his data.
@Reeza wrote:
You probably want to create a macro ...
Why not BY groups, and no macro at all?
Well, that is the BY group trick for rolling statistics, and I guess the best method depends on how big the data set really is.
One thing that is not clear in the description of the problem from @costasRO is exactly what dimensions are we doing the rolling in. Do we want rolling statistics within a company? Or do we perform rolling statistics across all companies and time periods? Or something else?
Ok, that's a good example of computing rolling statistics. I'm not convinced it would be faster than the BY variable approach, but unless I choose to program both approaches up, I am not certain which would be faster.
@costasRO wrote:
Hi All,
I have some data that look like this:
year month Company x
1970 1 A 0.01
1970 1 B 0.02
1970 1 C 0.03
. ..
1970 2 A 0.05
1970 2 B 0.04
I would like to calculate the moving 90th percentile (p90) of x, across all companies and dates, using 15 years of lagged data.
So, the variable I want would start having non-missing values in 1/1985, and always be calculated from the distribution of 15 years of past data, i.e., the value in 1/1985 would be based on the distribution from 1/1970-12/1984, the value in 2/1985 would be based on 2/1970-1/1985, etc.
Thank you very much in advance.
Costas
data have;
set sashelp.stocks;
run;
*sort for faster processing and add index;
proc sort data=have out=have (index=(date));
by date;
run;
options mprint symbolgen;
%macro rolling_p90(ds_in = , window = , ds_out=);
*select minimum date and maximum date;
proc sql noprint;
select min(date) into :min_date
from &ds_in.;
select max(date) into :max_date
from &ds_in.;
quit;
*determine number of months between start and end date;
%let n_months = %sysfunc(intck(month, &min_date, &max_date));
%put &n_months.;
%put &window.;
*start of loop;
%do i=1 %to %eval(&n_months - &window.);
*get start of period;
%let period_start = %sysfunc(intnx(month, &min_date., %eval(&i-1), b));
%put &period_start;
*calculate end of period;
%let period_end = %sysfunc(intnx(month, &period_start., &window.-1, e));
*calculate summary statistics;
proc means data=&ds_in. noprint nway;
where date between &period_start. and &period_end.;
var open;
output out=summary p90=p90;
run;
*add in date window for comparison;
data summary;
set summary;
period_start = &period_start;
period_end = &period_end.;
format period_: date9.;
run;
*append to main data set;
proc append base=&ds_out data=summary force;
run;
*drop table between loops;
proc sql;
drop table summary;
quit;
%end;
%mend;
%rolling_p90(ds_in = have , window = 24 , ds_out= want);
Dear @Reeza
Indeed this is WRDS data, and I have thousands of companies per year-month.
The macro below works perfectly.
Thank you very much!
Assuming there are no missing values of X, and that Companies are ignored for the computation, we can use sashelp.stocks as an example data set. Ksharp's code is adapted and just for verification a window size of 4 is used and PCTL = 50 (median) is computed. You could modify to 180 for window and 90 for PCTL.
data stocks;
set sashelp.stocks (keep = stock date close);
run;
%let window_size = 4;
data want;
set stocks;
array k[0:%eval(&window_size. -1)] _temporary_;
k[mod(_n_, &window_size)] = close;
pct_50 = pctl(50,of k[*]);
if _n_ < &window_size then pct_50 = .;
run;
Hope this helps.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.