- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have a sample of 40 observations. The plan is to:
- Take the first 20 observations (i.e., obs 1 to 20).
- Run a regression
- Keep the result
- Take the next 20 observation (i.e., obs 2 to 21).
- Repeat until you reach the maximum (i.e., obs 21 to obs 40).
Any help will be greatly appreciated. If possible, I would like to keep the line as little as possible as I have quite a lot of observations.
Thanks in advance.
Best,
David
data have; input x1 x2 y; datalines; 69 0.901225752 0.535917754 26 0.979065064 0.328516725 21 0.462043623 0.928984209 70 0.372254307 0.020944067 26 0.70178162 0.015753384 22 0.755669857 0.287674489 45 0.621848244 0.876177263 23 0.049950103 0.524321342 75 0.323187042 0.458245799 62 0.988737694 0.528055761 59 0.047655873 0.637754194 59 0.499943648 0.997049298 25 0.597043632 0.556921095 92 0.679854964 0.433165676 29 0.686543133 0.720057785 35 0.861474371 0.344762962 79 0.591574823 0.876250534 53 0.741976868 0.430488835 71 0.146723447 0.605425714 75 0.435811738 0.861299126 96 0.785859466 0.717045755 99 0.520055579 0.784497337 58 0.394083913 0.712998746 61 0.875958722 0.667760878 15 0.561292166 0.836592031 56 0.652887463 0.236309836 9 0.344428535 0.224147253 6 0.785126736 0.435528686 72 0.847421203 0.160322663 78 0.663642034 0.466674263 83 0.333921138 0.029331531 77 0.486751999 0.706800698 92 0.801502963 0.788823137 9 0.578615463 0.347239912 41 0.053458445 0.094070384 68 0.251134914 0.103399917 2 0.49184594 0.102494486 97 0.120083405 0.055981267 78 0.5423958 0.181672902 17 0.834739212 0.142969285 ; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You want rolling window regressions, with window size of 20.
Before I get into the code for doing this, how many regressions do you expect to produce. For 1,000 observations, you are asking for 981 regressions. So you better decide how you want the output of those regressions produced (I don't think you want 981 regression reports in the form of printouts).
Here is code to produce a data set of two windows (1-20 and 2-21), identified as windows W=1 and W=2 respectively:
%let window_size=20;
data have;
input x1 x2 y;
datalines;
69 0.901225752 0.535917754
26 0.979065064 0.328516725
21 0.462043623 0.928984209
70 0.372254307 0.020944067
26 0.70178162 0.015753384
22 0.755669857 0.287674489
45 0.621848244 0.876177263
23 0.049950103 0.524321342
75 0.323187042 0.458245799
62 0.988737694 0.528055761
59 0.047655873 0.637754194
59 0.499943648 0.997049298
25 0.597043632 0.556921095
92 0.679854964 0.433165676
29 0.686543133 0.720057785
35 0.861474371 0.344762962
79 0.591574823 0.876250534
53 0.741976868 0.430488835
71 0.146723447 0.605425714
75 0.435811738 0.861299126
96 0.785859466 0.717045755
99 0.520055579 0.784497337
58 0.394083913 0.712998746
61 0.875958722 0.667760878
15 0.561292166 0.836592031
56 0.652887463 0.236309836
9 0.344428535 0.224147253
6 0.785126736 0.435528686
72 0.847421203 0.160322663
78 0.663642034 0.466674263
83 0.333921138 0.029331531
77 0.486751999 0.706800698
92 0.801502963 0.788823137
9 0.578615463 0.347239912
41 0.053458445 0.094070384
68 0.251134914 0.103399917
2 0.49184594 0.102494486
97 0.120083405 0.055981267
78 0.5423958 0.181672902
17 0.834739212 0.142969285
run;
data rolling_windows (drop=i) / view=rolling_windows;
array _y {&window_size} _temporary_;
array _x1 {&window_size} _temporary_;
array _x2 {&window_size} _temporary_;
set have ;
_y{mod(_n_,&window_size)+1}=y;
_x1{mod(_n_,&window_size)+1}=x1;
_x2{mod(_n_,&window_size)+1}=x2;
if _n_>=20;
W+1 ; /*window number*/
do i=1 to &window_size;
y=_y{i};
x1=_x1{i};
x2=_x2{i};
output;
end;
if w=2 then stop;
run;
Take a look at the resulting data set view, and decide what PROC REG can do for you that you want for the 2 windows, and what format you want to see the results of PROC REG (i.e. what should the resulting dataset look like). Then one can construct some sort of
proc reg data=rolling_windows ..... ;
by W;
model ... ;
... other proc reg statements .... ;
run;
Run a few of the above with PROC REG statements you determine that you need. Then get back to us.
BTW, the larger the window size, the longer this takes. Even though there are maringally fewer windows, the PROC REG has to calculate SSCP for more observations - meaning each observation enters into more proc regs. One could consider generating rolling SSCP's, but I wouldn't recommend that at this poin.
Edtted note: Oh yes, is this a single series? Or is there some sort of ID that changes over the course of the data stream?
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your favorite internet search engine finds lots of hits for "Rolling window regression in SAS"
Code has already been written for this, so you don't have to write it yourself.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @DavidLie ,
You didn't specify but I assume the output of each loop is the regression coefficients and fit statistics.
Please see below an example macro which takes the input data and preferred step size to create a loop of regressions which get collapsed into a single performance table. Is that what you were after?
*Create Data;
data have;
input x1 x2 y ;
datalines;
69 0.901225752 0.535917754
26 0.979065064 0.328516725
21 0.462043623 0.928984209
70 0.372254307 0.020944067
26 0.70178162 0.015753384
22 0.755669857 0.287674489
45 0.621848244 0.876177263
23 0.049950103 0.524321342
75 0.323187042 0.458245799
62 0.988737694 0.528055761
59 0.047655873 0.637754194
59 0.499943648 0.997049298
25 0.597043632 0.556921095
92 0.679854964 0.433165676
29 0.686543133 0.720057785
35 0.861474371 0.344762962
79 0.591574823 0.876250534
53 0.741976868 0.430488835
71 0.146723447 0.605425714
75 0.435811738 0.861299126
96 0.785859466 0.717045755
99 0.520055579 0.784497337
58 0.394083913 0.712998746
61 0.875958722 0.667760878
15 0.561292166 0.836592031
56 0.652887463 0.236309836
9 0.344428535 0.224147253
6 0.785126736 0.435528686
72 0.847421203 0.160322663
78 0.663642034 0.466674263
83 0.333921138 0.029331531
77 0.486751999 0.706800698
92 0.801502963 0.788823137
9 0.578615463 0.347239912
41 0.053458445 0.094070384
68 0.251134914 0.103399917
2 0.49184594 0.102494486
97 0.120083405 0.055981267
78 0.5423958 0.181672902
17 0.834739212 0.142969285
;
run;
*Define Macro;
%macro regressionLoop(input,step,out);
%let tblin=work.have;
*add rownumber to input table;
data indx;
set &input;
rownum+1;
run;
*get row count;
proc sql noprint; select count(*) into :numRecs from &input;quit;
%let stepping = %eval(&step - 1);
*derive query parameters;
data ranges(keep=start_num end_num);
cnt = &step;
do while(cnt lt &numRecs+1);
start_num+1;
end_num = start_num + &stepping;
cnt+1;
output;
end;
run;
*get number of loops needed by step;
proc sql noprint; select count(*) into :numQueries from ranges;quit;
%put("starting loop");
%put("number of loops: &numQueries");
*loop build regressions;
%do i=1 %to &numQueries;
%put("loop number &i");
*get upper limit;
proc sql noprint; select end_num into :upper from ranges where start_num = &&i.;quit;
*get dataset based on range;
proc sql noprint; create table regIn as
select x1,x2,y from indx where rownum between &&i. and &upper.;quit;
*generate regression;
proc reg data=regIn alpha=0.05 outest=outest noprint;
model y=x1 x2/ aic bic edf;
run;
quit;
data reg_&&i.;
set outest;
_MODEL_ = "Range &&i. - %cmpres(&upper.)";
run;
%end;
%put("ending loop");
*collect reg tables into output table;
data &out;
set reg_1 - reg_%cmpres(&numQueries.);
run;
%mend regressionLoop;
*Call Macro;
%regressionLoop(have,20,out_stats);
Output looks something like this:
Thanks
Harry
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@DavidLie wrote:
Hi all,
I have a sample of 40 observations. The plan is to:
- Take the first 20 observations (i.e., obs 1 to 20).
- Run a regression
- Keep the result
- Take the next 20 observation (i.e., obs 2 to 21).
- Repeat until you reach the maximum (i.e., obs 21 to obs 40).
Any help will be greatly appreciated. If possible, I would like to keep the line as little as possible as I have quite a lot of observations.
Thanks in advance.
Best,
Daviddata have; input x1 x2 y; datalines; 69 0.901225752 0.535917754 26 0.979065064 0.328516725 21 0.462043623 0.928984209 70 0.372254307 0.020944067 26 0.70178162 0.015753384 22 0.755669857 0.287674489 45 0.621848244 0.876177263 23 0.049950103 0.524321342 75 0.323187042 0.458245799 62 0.988737694 0.528055761 59 0.047655873 0.637754194 59 0.499943648 0.997049298 25 0.597043632 0.556921095 92 0.679854964 0.433165676 29 0.686543133 0.720057785 35 0.861474371 0.344762962 79 0.591574823 0.876250534 53 0.741976868 0.430488835 71 0.146723447 0.605425714 75 0.435811738 0.861299126 96 0.785859466 0.717045755 99 0.520055579 0.784497337 58 0.394083913 0.712998746 61 0.875958722 0.667760878 15 0.561292166 0.836592031 56 0.652887463 0.236309836 9 0.344428535 0.224147253 6 0.785126736 0.435528686 72 0.847421203 0.160322663 78 0.663642034 0.466674263 83 0.333921138 0.029331531 77 0.486751999 0.706800698 92 0.801502963 0.788823137 9 0.578615463 0.347239912 41 0.053458445 0.094070384 68 0.251134914 0.103399917 2 0.49184594 0.102494486 97 0.120083405 0.055981267 78 0.5423958 0.181672902 17 0.834739212 0.142969285 ; run;
You want to create a data set (or view) with each observation group identified with a variable that can be used in a BY statement for the stat proc you want to call. That will be more efficient that looping the whole process as you describe above.
data have;
infile cards expandtabs;
_obs_=_n_;
input x1 x2 y;
datalines;
69 0.901225752 0.535917754
26 0.979065064 0.328516725
21 0.462043623 0.928984209
70 0.372254307 0.020944067
26 0.70178162 0.015753384
22 0.755669857 0.287674489
45 0.621848244 0.876177263
23 0.049950103 0.524321342
75 0.323187042 0.458245799
62 0.988737694 0.528055761
59 0.047655873 0.637754194
59 0.499943648 0.997049298
25 0.597043632 0.556921095
92 0.679854964 0.433165676
29 0.686543133 0.720057785
35 0.861474371 0.344762962
79 0.591574823 0.876250534
53 0.741976868 0.430488835
71 0.146723447 0.605425714
75 0.435811738 0.861299126
96 0.785859466 0.717045755
99 0.520055579 0.784497337
58 0.394083913 0.712998746
61 0.875958722 0.667760878
15 0.561292166 0.836592031
56 0.652887463 0.236309836
9 0.344428535 0.224147253
6 0.785126736 0.435528686
72 0.847421203 0.160322663
78 0.663642034 0.466674263
83 0.333921138 0.029331531
77 0.486751999 0.706800698
92 0.801502963 0.788823137
9 0.578615463 0.347239912
41 0.053458445 0.094070384
68 0.251134914 0.103399917
2 0.49184594 0.102494486
97 0.120083405 0.055981267
78 0.5423958 0.181672902
17 0.834739212 0.142969285
;
run;
%let s = 20; /*sample size*/
data analysis / view=analysis;;
s = &s;
f = 1;
l = s;
do w = 1 by 1 while(l le nobs);
do i = f to l while(l le nobs);
set have nobs=nobs point=i;
output;
end;
f = f+1;
l = l+1;
end;
call symputx('w',w-1);
stop;
run;
%put NOTE: &=w;
sasfile have load;
proc reg noprint data=analysis outest=est;
by w;
model y = x: / aic bic edf;
run;
sasfile have close;
proc print data=est;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The downside of the SET POINT=, in my experience, is that it can generate a lot more input/ouput activity than necessary.
Each obs in this case would be read in 20 times. The actual input activity would be mitigated of course by operating system caching. But window size 90? Then read each obs 90 times. At some point this will have a major impact on performance.
In anticipation of that possibility, I would STONGLY recommend use of a memory-resident technique, so that each obs is read in once, but written out 20 times. That's the main purpose of suggesting _TEMPORARY_ arrays to store the rolling series of X1, X2, and Y.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@mkeintz This is why I used SASFILE to load HAVE into memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@data_null__ wrote:
@mkeintz This is why I used SASFILE to load HAVE into memory.
Oops.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------