SAS Regression Loop in Set of 20 Observations

DavidLie · Posted 05-18-2022 09:13 AM

Hi all,

I have a sample of 40 observations. The plan is to:

Take the first 20 observations (i.e., obs 1 to 20).
Run a regression
Keep the result
Take the next 20 observation (i.e., obs 2 to 21).
Repeat until you reach the maximum (i.e., obs 21 to obs 40).

Any help will be greatly appreciated. If possible, I would like to keep the line as little as possible as I have quite a lot of observations.

Thanks in advance.

Best,
David

data have;
input x1 x2 y;
datalines;
69	0.901225752	0.535917754
26	0.979065064	0.328516725
21	0.462043623	0.928984209
70	0.372254307	0.020944067
26	0.70178162	0.015753384
22	0.755669857	0.287674489
45	0.621848244	0.876177263
23	0.049950103	0.524321342
75	0.323187042	0.458245799
62	0.988737694	0.528055761
59	0.047655873	0.637754194
59	0.499943648	0.997049298
25	0.597043632	0.556921095
92	0.679854964	0.433165676
29	0.686543133	0.720057785
35	0.861474371	0.344762962
79	0.591574823	0.876250534
53	0.741976868	0.430488835
71	0.146723447	0.605425714
75	0.435811738	0.861299126
96	0.785859466	0.717045755
99	0.520055579	0.784497337
58	0.394083913	0.712998746
61	0.875958722	0.667760878
15	0.561292166	0.836592031
56	0.652887463	0.236309836
9	0.344428535	0.224147253
6	0.785126736	0.435528686
72	0.847421203	0.160322663
78	0.663642034	0.466674263
83	0.333921138	0.029331531
77	0.486751999	0.706800698
92	0.801502963	0.788823137
9	0.578615463	0.347239912
41	0.053458445	0.094070384
68	0.251134914	0.103399917
2	0.49184594	0.102494486
97	0.120083405	0.055981267
78	0.5423958	0.181672902
17	0.834739212	0.142969285
;
run;

mkeintz · Posted 05-18-2022 10:26 AM

You want rolling window regressions, with window size of 20.

Before I get into the code for doing this, how many regressions do you expect to produce. For 1,000 observations, you are asking for 981 regressions. So you better decide how you want the output of those regressions produced (I don't think you want 981 regression reports in the form of printouts).

Here is code to produce a data set of two windows (1-20 and 2-21), identified as windows W=1 and W=2 respectively:

%let window_size=20;
data have;
input x1 x2 y;
datalines;
69	0.901225752	0.535917754
26	0.979065064	0.328516725
21	0.462043623	0.928984209
70	0.372254307	0.020944067
26	0.70178162	0.015753384
22	0.755669857	0.287674489
45	0.621848244	0.876177263
23	0.049950103	0.524321342
75	0.323187042	0.458245799
62	0.988737694	0.528055761
59	0.047655873	0.637754194
59	0.499943648	0.997049298
25	0.597043632	0.556921095
92	0.679854964	0.433165676
29	0.686543133	0.720057785
35	0.861474371	0.344762962
79	0.591574823	0.876250534
53	0.741976868	0.430488835
71	0.146723447	0.605425714
75	0.435811738	0.861299126
96	0.785859466	0.717045755
99	0.520055579	0.784497337
58	0.394083913	0.712998746
61	0.875958722	0.667760878
15	0.561292166	0.836592031
56	0.652887463	0.236309836
9	0.344428535	0.224147253
6	0.785126736	0.435528686
72	0.847421203	0.160322663
78	0.663642034	0.466674263
83	0.333921138	0.029331531
77	0.486751999	0.706800698
92	0.801502963	0.788823137
9	0.578615463	0.347239912
41	0.053458445	0.094070384
68	0.251134914	0.103399917
2	0.49184594	0.102494486
97	0.120083405	0.055981267
78	0.5423958	0.181672902
17	0.834739212	0.142969285
run;
data rolling_windows (drop=i) / view=rolling_windows;
  array _y  {&window_size} _temporary_;
  array _x1 {&window_size} _temporary_;
  array _x2 {&window_size} _temporary_;

  set have ;
  _y{mod(_n_,&window_size)+1}=y;
  _x1{mod(_n_,&window_size)+1}=x1;
  _x2{mod(_n_,&window_size)+1}=x2;

  if _n_>=20;
  W+1 ;    /*window number*/
  do i=1 to &window_size;
    y=_y{i};
    x1=_x1{i};
    x2=_x2{i};
    output;
  end;
  if w=2 then stop;
run;

Take a look at the resulting data set view, and decide what PROC REG can do for you that you want for the 2 windows, and what format you want to see the results of PROC REG (i.e. what should the resulting dataset look like). Then one can construct some sort of

proc reg data=rolling_windows ..... ;
  by W;
  model ... ;
  ... other proc reg statements .... ;
run;

Run a few of the above with PROC REG statements you determine that you need. Then get back to us.

BTW, the larger the window size, the longer this takes. Even though there are maringally fewer windows, the PROC REG has to calculate SSCP for more observations - meaning each observation enters into more proc regs. One could consider generating rolling SSCP's, but I wouldn't recommend that at this poin.

Edtted note: Oh yes, is this a single series? Or is there some sort of ID that changes over the course of the data stream?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PaigeMiller · Posted 05-18-2022 10:31 AM

Your favorite internet search engine finds lots of hits for "Rolling window regression in SAS"

Code has already been written for this, so you don't have to write it yourself.

--
Paige Miller

HarrySnart · Posted 05-18-2022 10:49 AM

Hi @DavidLie ,

You didn't specify but I assume the output of each loop is the regression coefficients and fit statistics.

Please see below an example macro which takes the input data and preferred step size to create a loop of regressions which get collapsed into a single performance table. Is that what you were after?

*Create Data;
data have;
input x1 x2 y ;
datalines;
69 0.901225752 0.535917754
26 0.979065064 0.328516725
21 0.462043623 0.928984209
70 0.372254307 0.020944067
26 0.70178162 0.015753384
22 0.755669857 0.287674489
45 0.621848244 0.876177263
23 0.049950103 0.524321342
75 0.323187042 0.458245799
62 0.988737694 0.528055761
59 0.047655873 0.637754194
59 0.499943648 0.997049298
25 0.597043632 0.556921095
92 0.679854964 0.433165676
29 0.686543133 0.720057785
35 0.861474371 0.344762962
79 0.591574823 0.876250534
53 0.741976868 0.430488835
71 0.146723447 0.605425714
75 0.435811738 0.861299126
96 0.785859466 0.717045755
99 0.520055579 0.784497337
58 0.394083913 0.712998746
61 0.875958722 0.667760878
15 0.561292166 0.836592031
56 0.652887463 0.236309836
9 0.344428535 0.224147253
6 0.785126736 0.435528686
72 0.847421203 0.160322663
78 0.663642034 0.466674263
83 0.333921138 0.029331531
77 0.486751999 0.706800698
92 0.801502963 0.788823137
9 0.578615463 0.347239912
41 0.053458445 0.094070384
68 0.251134914 0.103399917
2 0.49184594 0.102494486
97 0.120083405 0.055981267
78 0.5423958 0.181672902
17 0.834739212 0.142969285
;
run;

*Define Macro;
%macro regressionLoop(input,step,out);
%let tblin=work.have;

*add rownumber to input table;
data indx;
set &input;
rownum+1;
run;

*get row count;
proc sql noprint; select count(*) into :numRecs from &input;quit;

%let stepping = %eval(&step - 1);

*derive query parameters;
data ranges(keep=start_num end_num);
cnt = &step;
do while(cnt lt &numRecs+1);
start_num+1;
end_num = start_num + &stepping;
cnt+1;
output;
end;
run;

*get number of loops needed by step;
proc sql noprint; select count(*) into :numQueries from ranges;quit;
%put("starting loop");

%put("number of loops: &numQueries");

*loop build regressions;
%do i=1 %to &numQueries;
%put("loop number &i");

*get upper limit;
proc sql noprint; select end_num into :upper from ranges where start_num = &&i.;quit;

*get dataset based on range;
proc sql noprint; create table regIn as 
select x1,x2,y from indx where rownum between &&i. and &upper.;quit; 

*generate regression;
proc reg data=regIn alpha=0.05  outest=outest   noprint;
	model y=x1 x2/ aic bic edf;
	run;
quit;

data reg_&&i.;
set outest;
_MODEL_ = "Range &&i. - %cmpres(&upper.)";
run;

%end;

%put("ending loop");

*collect reg tables into output table;

data &out;
set reg_1 - reg_%cmpres(&numQueries.);
run;

%mend regressionLoop;

*Call Macro;
%regressionLoop(have,20,out_stats);

Output looks something like this:

Thanks

Harry

data_null__ · Posted 05-18-2022 01:12 PM

@DavidLie wrote:

Hi all,

I have a sample of 40 observations. The plan is to:

Take the first 20 observations (i.e., obs 1 to 20).
Run a regression
Keep the result
Take the next 20 observation (i.e., obs 2 to 21).
Repeat until you reach the maximum (i.e., obs 21 to obs 40).

Any help will be greatly appreciated. If possible, I would like to keep the line as little as possible as I have quite a lot of observations.

Thanks in advance.

Best,
David

data have;
input x1 x2 y;
datalines;
69	0.901225752	0.535917754
26	0.979065064	0.328516725
21	0.462043623	0.928984209
70	0.372254307	0.020944067
26	0.70178162	0.015753384
22	0.755669857	0.287674489
45	0.621848244	0.876177263
23	0.049950103	0.524321342
75	0.323187042	0.458245799
62	0.988737694	0.528055761
59	0.047655873	0.637754194
59	0.499943648	0.997049298
25	0.597043632	0.556921095
92	0.679854964	0.433165676
29	0.686543133	0.720057785
35	0.861474371	0.344762962
79	0.591574823	0.876250534
53	0.741976868	0.430488835
71	0.146723447	0.605425714
75	0.435811738	0.861299126
96	0.785859466	0.717045755
99	0.520055579	0.784497337
58	0.394083913	0.712998746
61	0.875958722	0.667760878
15	0.561292166	0.836592031
56	0.652887463	0.236309836
9	0.344428535	0.224147253
6	0.785126736	0.435528686
72	0.847421203	0.160322663
78	0.663642034	0.466674263
83	0.333921138	0.029331531
77	0.486751999	0.706800698
92	0.801502963	0.788823137
9	0.578615463	0.347239912
41	0.053458445	0.094070384
68	0.251134914	0.103399917
2	0.49184594	0.102494486
97	0.120083405	0.055981267
78	0.5423958	0.181672902
17	0.834739212	0.142969285
;
run;

You want to create a data set (or view) with each observation group identified with a variable that can be used in a BY statement for the stat proc you want to call. That will be more efficient that looping the whole process as you describe above.

data have;
   infile cards expandtabs;
   _obs_=_n_;
   input x1 x2 y;
   datalines;
69	0.901225752	0.535917754
26	0.979065064	0.328516725
21	0.462043623	0.928984209
70	0.372254307	0.020944067
26	0.70178162	0.015753384
22	0.755669857	0.287674489
45	0.621848244	0.876177263
23	0.049950103	0.524321342
75	0.323187042	0.458245799
62	0.988737694	0.528055761
59	0.047655873	0.637754194
59	0.499943648	0.997049298
25	0.597043632	0.556921095
92	0.679854964	0.433165676
29	0.686543133	0.720057785
35	0.861474371	0.344762962
79	0.591574823	0.876250534
53	0.741976868	0.430488835
71	0.146723447	0.605425714
75	0.435811738	0.861299126
96	0.785859466	0.717045755
99	0.520055579	0.784497337
58	0.394083913	0.712998746
61	0.875958722	0.667760878
15	0.561292166	0.836592031
56	0.652887463	0.236309836
9	0.344428535	0.224147253
6	0.785126736	0.435528686
72	0.847421203	0.160322663
78	0.663642034	0.466674263
83	0.333921138	0.029331531
77	0.486751999	0.706800698
92	0.801502963	0.788823137
9	0.578615463	0.347239912
41	0.053458445	0.094070384
68	0.251134914	0.103399917
2	0.49184594	0.102494486
97	0.120083405	0.055981267
78	0.5423958	0.181672902
17	0.834739212	0.142969285
;
run;


%let s = 20; /*sample size*/
data analysis / view=analysis;;
   s = &s;
   f = 1;
   l = s;
   do w = 1 by 1 while(l le nobs);
      do i = f to l while(l le nobs);
         set have nobs=nobs point=i;      
         output;
         end;
      f = f+1;
      l = l+1;
      end;
   call symputx('w',w-1);
   stop;
   run;
%put NOTE: &=w;
sasfile have load;
proc reg noprint data=analysis outest=est;
   by w;
   model y = x: / aic bic edf;
   run;
sasfile have close;
proc print data=est;
   run;

mkeintz · Posted 05-18-2022 02:57 PM

@data_null__

The downside of the SET POINT=, in my experience, is that it can generate a lot more input/ouput activity than necessary.

Each obs in this case would be read in 20 times. The actual input activity would be mitigated of course by operating system caching. But window size 90? Then read each obs 90 times. At some point this will have a major impact on performance.

In anticipation of that possibility, I would STONGLY recommend use of a memory-resident technique, so that each obs is read in once, but written out 20 times. That's the main purpose of suggesting _TEMPORARY_ arrays to store the rolling series of X1, X2, and Y.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

data_null__ · Posted 05-18-2022 03:02 PM

@mkeintz This is why I used SASFILE to load HAVE into memory.

mkeintz · Posted 05-18-2022 03:04 PM

@data_null__ wrote:

@mkeintz This is why I used SASFILE to load HAVE into memory.

Oops.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Regression Loop in Set of 20 Observations

Re: SAS Regression Loop in Set of 20 Observations

Re: SAS Regression Loop in Set of 20 Observations

Re: SAS Regression Loop in Set of 20 Observations

Re: SAS Regression Loop in Set of 20 Observations

Re: SAS Regression Loop in Set of 20 Observations

Re: SAS Regression Loop in Set of 20 Observations

Re: SAS Regression Loop in Set of 20 Observations

SAS Innovate 2025: Call for Content

Classroom Training Available!