BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
costasRO
Fluorite | Level 6

Hi All,

 

I have some data that look like this:

 

year   month      Company     x

1970     1             A                 0.01

1970      1            B                 0.02

1970      1            C                  0.03

.  .. 

1970     2             A                   0.05

1970     2             B                   0.04

 

I would like to calculate the moving  90th percentile (p90) of x, across all companies and dates, using 15 years of lagged data.

 

So,  the variable I want  would start having non-missing values in 1/1985, and always be calculated from the distribution of 15 years of past data, i.e., the value in 1/1985 would be based on the distribution from 1/1970-12/1984, the value in 2/1985 would be based on 2/1970-1/1985, etc. 

 

Thank you very much in advance.

 

Costas

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

@costasRO wrote:

Hi All,

 

I have some data that look like this:

 

year   month      Company     x

1970     1             A                 0.01

1970      1            B                 0.02

1970      1            C                  0.03

.  .. 

1970     2             A                   0.05

1970     2             B                   0.04

 

I would like to calculate the moving  90th percentile (p90) of x, across all companies and dates, using 15 years of lagged data.

 

So,  the variable I want  would start having non-missing values in 1/1985, and always be calculated from the distribution of 15 years of past data, i.e., the value in 1/1985 would be based on the distribution from 1/1970-12/1984, the value in 2/1985 would be based on 2/1970-1/1985, etc. 

 

Thank you very much in advance.

 

Costas

 


data have;
set sashelp.stocks;
run;

*sort for faster processing and add index;
proc sort data=have out=have (index=(date));
by date;
run;


options mprint symbolgen;
%macro rolling_p90(ds_in = , window = , ds_out=);

*select minimum date and maximum date;
proc sql noprint;
select min(date) into :min_date
from &ds_in.;
select max(date) into :max_date
from &ds_in.;
quit;

*determine number of months between start and end date;
%let n_months = %sysfunc(intck(month, &min_date, &max_date));

%put &n_months.;
%put &window.;

*start of loop;

%do i=1 %to %eval(&n_months - &window.);

*get start of period;
%let period_start = %sysfunc(intnx(month, &min_date., %eval(&i-1), b));
%put &period_start;
*calculate end of period;
%let period_end = %sysfunc(intnx(month, &period_start., &window.-1, e));

*calculate summary statistics;
proc means data=&ds_in. noprint nway;
where date between  &period_start. and &period_end.;
var open;
output out=summary p90=p90;
run;

*add in date window for comparison;
data summary;
set summary;
period_start = &period_start;
period_end = &period_end.;
format period_: date9.;
run;

*append to main data set;
proc append base=&ds_out data=summary force;
run;

*drop table between loops;
proc sql;
drop table summary;
quit;

%end;

%mend;

%rolling_p90(ds_in = have , window = 24 , ds_out= want);

View solution in original post

13 REPLIES 13
PaigeMiller
Diamond | Level 26

A simple search of the SAS Community finds a bunch of possible solutions, such as 

https://communities.sas.com/t5/SAS-Procedures/Computation-of-backward-moving-percentiles/td-p/129651

 

I leave it up to you to find others.

--
Paige Miller
costasRO
Fluorite | Level 6

I have tried this code without success as I do not have a set number of lags (here 252) because the number of companies changes. So the code must determine the lags based on time.

PaigeMiller
Diamond | Level 26

You can use this method in which thousands of regressions are performed (scroll down to the section entitled "The BY way for many models"). Instead of running PROC REG, use PROC UNIVARIATE to compute the desired percentiles. 

--
Paige Miller
ballardw
Super User

@costasRO wrote:

I have tried this code without success as I do not have a set number of lags (here 252) because the number of companies changes. So the code must determine the lags based on time.


One way might be to make sure all the companies are present for all the time periods. Missing values for the value of the variable(s) you are calculating statistics for should be handled correctly. You would have to calculate the lag value I very much doubt your data matches the OP in that other thread. The magic number 252 was based on his data.

Reeza
Super User
You probably want to create a macro then that filters based on the date and loops across the dates.

First combine year/month to create a date variable. If your data set is large, I'd suggest creating an index on that variable.
Then filter based on dates, 15*12 = 180 month using a WHERE statement, and calculate your stats using PROC MEANS/SUMMARY/UNIVARIATE.

Once that is accomplished, you can use a macro loop to iterate through and do all your intervals and combine the results. In this method the number of companies is irrelevant.
PaigeMiller
Diamond | Level 26

@Reeza wrote:
You probably want to create a macro ...

Why not BY groups, and no macro at all?

--
Paige Miller
Reeza
Super User
Because this screams WRDS data that is big and for BY group and moving stats you typically need to replicate the data as well unless I'm missing some trick for BY groups handling moving windows.

PaigeMiller
Diamond | Level 26

Well, that is the BY group trick for rolling statistics, and I guess the best method depends on how big the data set really is.

 

One thing that is not clear in the description of the problem from @costasRO is exactly what dimensions are we doing the rolling in. Do we want rolling statistics within a company? Or do we perform rolling statistics across all companies and time periods? Or something else?

--
Paige Miller
Reeza
Super User
From what I understood it's across all companies, monthly rolling for essentially 180 month window.

They've stated that they can't use arrays as it's across all companies so its not within companies. SASHELP.STOCKS it a fairly similar structure I would guess based on what they've posted so far, I just don't have time to code a solution.

Something roughly like this would be my guess - the macro loop would need to be adjusted.
https://gist.github.com/statgeek/e5e43ff45a4ba1f64d0873ff3bc35974
PaigeMiller
Diamond | Level 26

Ok, that's a good example of computing rolling statistics. I'm not convinced it would be faster than the BY variable approach, but unless I choose to program both approaches up, I am not certain which would be faster.

--
Paige Miller
Reeza
Super User

@costasRO wrote:

Hi All,

 

I have some data that look like this:

 

year   month      Company     x

1970     1             A                 0.01

1970      1            B                 0.02

1970      1            C                  0.03

.  .. 

1970     2             A                   0.05

1970     2             B                   0.04

 

I would like to calculate the moving  90th percentile (p90) of x, across all companies and dates, using 15 years of lagged data.

 

So,  the variable I want  would start having non-missing values in 1/1985, and always be calculated from the distribution of 15 years of past data, i.e., the value in 1/1985 would be based on the distribution from 1/1970-12/1984, the value in 2/1985 would be based on 2/1970-1/1985, etc. 

 

Thank you very much in advance.

 

Costas

 


data have;
set sashelp.stocks;
run;

*sort for faster processing and add index;
proc sort data=have out=have (index=(date));
by date;
run;


options mprint symbolgen;
%macro rolling_p90(ds_in = , window = , ds_out=);

*select minimum date and maximum date;
proc sql noprint;
select min(date) into :min_date
from &ds_in.;
select max(date) into :max_date
from &ds_in.;
quit;

*determine number of months between start and end date;
%let n_months = %sysfunc(intck(month, &min_date, &max_date));

%put &n_months.;
%put &window.;

*start of loop;

%do i=1 %to %eval(&n_months - &window.);

*get start of period;
%let period_start = %sysfunc(intnx(month, &min_date., %eval(&i-1), b));
%put &period_start;
*calculate end of period;
%let period_end = %sysfunc(intnx(month, &period_start., &window.-1, e));

*calculate summary statistics;
proc means data=&ds_in. noprint nway;
where date between  &period_start. and &period_end.;
var open;
output out=summary p90=p90;
run;

*add in date window for comparison;
data summary;
set summary;
period_start = &period_start;
period_end = &period_end.;
format period_: date9.;
run;

*append to main data set;
proc append base=&ds_out data=summary force;
run;

*drop table between loops;
proc sql;
drop table summary;
quit;

%end;

%mend;

%rolling_p90(ds_in = have , window = 24 , ds_out= want);
costasRO
Fluorite | Level 6

Dear @Reeza 

Indeed this is WRDS data, and I have thousands of companies per year-month.

The macro below works perfectly.

Thank you very much!

 

 

 

KachiM
Rhodochrosite | Level 12

Assuming there are no missing values of X, and that Companies are ignored for the computation, we can use sashelp.stocks as an example data set. Ksharp's code is adapted and just for verification a window size of 4 is used and PCTL = 50 (median) is computed. You could modify to 180 for window and  90 for PCTL.

data stocks;
	set sashelp.stocks (keep =  stock date close);
run;

%let window_size = 4;

data want;
	set stocks;
	array k[0:%eval(&window_size. -1)] _temporary_;
	k[mod(_n_, &window_size)] = close;
	pct_50 = pctl(50,of k[*]);
	if _n_ < &window_size then pct_50 = .; 
run;

Hope this helps.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 1309 views
  • 0 likes
  • 5 in conversation