Solved: Re: How do I run a do-loop by groups?

eramirez · Posted 08-19-2020 06:59 PM

Hello

I found some sas code online that calculates a moving average. I don't have SAS/ETS. It works well, but I have some new data that is structured vertically. It consists of five variables, Date, Zipcode, PCT, COUNT, and TOTAL. I want to run the code through each group of zipcodes, calculating the new moving average for each. Once it encounters 60602, a new moving average is calculated. I have a feeling it's something simple but I can't think of it. Thank you.

data zips1;
  infile datalines dsd truncover;
  input Date:DATE9. zipcode:32. pct:32. count:32. total:32.;
  format Date DATE9.;
  datalines;
01JAN2020 60601 16.666667 1 6
02JAN2020 60601 0 0 8
03JAN2020 60601 14.285714 1 7
04JAN2020 60601 0 0 5
05JAN2020 60601 0 0 7
06JAN2020 60601 0 0 8
07JAN2020 60601 0 0 6
08JAN2020 60601 0 0 8
09JAN2020 60601 20 1 5
10JAN2020 60601 0 0 6
11JAN2020 60601 0 0 8
12JAN2020 60601 0 0 4
13JAN2020 60601 0 0 8
14JAN2020 60601 0 0 10
15JAN2020 60601 0 0 9
16JAN2020 60601 25 1 4
17JAN2020 60601 0 0 4
18JAN2020 60601 0 0 6
19JAN2020 60601 0 0 4
20JAN2020 60601 0 0 6

data zips2 ; keep date zipcode pct count total n meanxi sumxi;
set zips1;
 if missing(count ) then
 do;
 OBS = 0;
 count = 0.0;
 end;
 else OBS = 1;
 XI7 = lag7(count );
 OBS7 = lag7(obs);
 if missing(xi7) then xi7 = 0.0;
 if missing(obs7) then obs7 = 0;
 LDATE = lag2(date);
 format ldate date9. ; 

 if _N_ = 1 then
 do;
 SUMXI = 0.0;
 N = 0;
 end;
 else;
 sumxi = sumxi + count - xi7;
 n = n + obs - obs7;
 MEANXI = sumxi / n ;
 retain sumxi n;
run;

Reeza · Posted 08-19-2020 07:49 PM

Here’s a quick example.
https://gist.github.com/statgeek/27e23c015eae7953eff2

Change the min/max to mean/median or whatever stat you’re calculating and of course the array lengths as needed.

@eramirez wrote:

Hello

I found some sas code online that calculates a moving average. I don't have SAS/ETS. It works well, but I have some new data that is structured vertically. It consists of five variables, Date, Zipcode, PCT, COUNT, and TOTAL. I want to run the code through each group of zipcodes, calculating the new moving average for each. Once it encounters 60602, a new moving average is calculated. I have a feeling it's something simple but I can't think of it. Thank you.
data zips1;
  infile datalines dsd truncover;
  input Date:DATE9. zipcode:32. pct:32. count:32. total:32.;
  format Date DATE9.;
  datalines;
01JAN2020 60601 16.666667 1 6
02JAN2020 60601 0 0 8
03JAN2020 60601 14.285714 1 7
04JAN2020 60601 0 0 5
05JAN2020 60601 0 0 7
06JAN2020 60601 0 0 8
07JAN2020 60601 0 0 6
08JAN2020 60601 0 0 8
09JAN2020 60601 20 1 5
10JAN2020 60601 0 0 6
11JAN2020 60601 0 0 8
12JAN2020 60601 0 0 4
13JAN2020 60601 0 0 8
14JAN2020 60601 0 0 10
15JAN2020 60601 0 0 9
16JAN2020 60601 25 1 4
17JAN2020 60601 0 0 4
18JAN2020 60601 0 0 6
19JAN2020 60601 0 0 4
20JAN2020 60601 0 0 6

data zips2 ; keep date zipcode pct count total n meanxi sumxi;
set zips1;
 if missing(count ) then
 do;
 OBS = 0;
 count = 0.0;
 end;
 else OBS = 1;
 XI7 = lag7(count );
 OBS7 = lag7(obs);
 if missing(xi7) then xi7 = 0.0;
 if missing(obs7) then obs7 = 0;
 LDATE = lag2(date);
 format ldate date9. ; 

 if _N_ = 1 then
 do;
 SUMXI = 0.0;
 N = 0;
 end;
 else;
 sumxi = sumxi + count - xi7;
 n = n + obs - obs7;
 MEANXI = sumxi / n ;
 retain sumxi n;
run;

View solution in original post

mkeintz · Posted 08-19-2020 07:46 PM

What do you want the output to look like? Since you are doing 7-day rolling statistics, do you want to start each zip code with the 7th observation, such that it is the first with a completely populated 7-day window?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

eramirez · Posted 08-20-2020 09:44 AM

Hello

thanks for the reply. The results can start with the 7th observation if that helps, the variable N will indicate when the 7th observation begins so when I overlay the MEANXI (line) values over the count (bar), I can use N>6 to exclude those first six values.

Thanks

Enrique

Reeza · Posted 08-19-2020 07:49 PM

Here’s a quick example.
https://gist.github.com/statgeek/27e23c015eae7953eff2

Change the min/max to mean/median or whatever stat you’re calculating and of course the array lengths as needed.

@eramirez wrote:

Hello

I found some sas code online that calculates a moving average. I don't have SAS/ETS. It works well, but I have some new data that is structured vertically. It consists of five variables, Date, Zipcode, PCT, COUNT, and TOTAL. I want to run the code through each group of zipcodes, calculating the new moving average for each. Once it encounters 60602, a new moving average is calculated. I have a feeling it's something simple but I can't think of it. Thank you.
data zips1;
  infile datalines dsd truncover;
  input Date:DATE9. zipcode:32. pct:32. count:32. total:32.;
  format Date DATE9.;
  datalines;
01JAN2020 60601 16.666667 1 6
02JAN2020 60601 0 0 8
03JAN2020 60601 14.285714 1 7
04JAN2020 60601 0 0 5
05JAN2020 60601 0 0 7
06JAN2020 60601 0 0 8
07JAN2020 60601 0 0 6
08JAN2020 60601 0 0 8
09JAN2020 60601 20 1 5
10JAN2020 60601 0 0 6
11JAN2020 60601 0 0 8
12JAN2020 60601 0 0 4
13JAN2020 60601 0 0 8
14JAN2020 60601 0 0 10
15JAN2020 60601 0 0 9
16JAN2020 60601 25 1 4
17JAN2020 60601 0 0 4
18JAN2020 60601 0 0 6
19JAN2020 60601 0 0 4
20JAN2020 60601 0 0 6

data zips2 ; keep date zipcode pct count total n meanxi sumxi;
set zips1;
 if missing(count ) then
 do;
 OBS = 0;
 count = 0.0;
 end;
 else OBS = 1;
 XI7 = lag7(count );
 OBS7 = lag7(obs);
 if missing(xi7) then xi7 = 0.0;
 if missing(obs7) then obs7 = 0;
 LDATE = lag2(date);
 format ldate date9. ; 

 if _N_ = 1 then
 do;
 SUMXI = 0.0;
 N = 0;
 end;
 else;
 sumxi = sumxi + count - xi7;
 n = n + obs - obs7;
 MEANXI = sumxi / n ;
 retain sumxi n;
run;

eramirez · Posted 08-20-2020 05:29 PM

Thank you, this works also and useful for other stats!

mkeintz · Posted 08-20-2020 03:54 PM

In cases like this, I would suggest maintaining an array containing the most recent 7 values of the variables in question. Here's an example getting the 7-day rolling mean of COUNT:

data zips1;
infile datalines truncover;
input Date:DATE9. zipcode:32. pct:32. count:32. total:32.;
format Date DATE9.;
datalines;
01JAN2020 60601 16.666667 1 6
02JAN2020 60601 0 0 8
03JAN2020 60601 14.285714 1 7
04JAN2020 60601 0 0 5
05JAN2020 60601 0 0 7
06JAN2020 60601 0 0 8
07JAN2020 60601 0 0 6
08JAN2020 60601 0 0 8
09JAN2020 60601 20 1 5
10JAN2020 60601 0 0 6
11JAN2020 60601 0 0 8
12JAN2020 60601 0 0 4
13JAN2020 60601 0 0 8
14JAN2020 60601 0 0 10
15JAN2020 60601 0 0 9
16JAN2020 60601 25 1 4
17JAN2020 60601 0 0 4
18JAN2020 60601 0 0 6
19JAN2020 60601 0 0 4
20JAN2020 60601 0 0 6
run;
data zips2;
  set zips1;
  by zipcode;
  obs+1;
  if first.zipcode then obs=1;
  array _cntarray {0:6} _temporary_;
  _cntarray{mod(obs,7)}=count;
  if obs>=7;
  mean_count=mean(of _cntarray{*});
run;

Now, if you really want to maintain a rolling sum to which you add the current COUNT and subtract lag7(count), followed by division-by-7, you could do this:

data zips2;
  set zips1;
  by zipcode date ;
  obs+1;
  if first.zipcode then obs=1;
  sum_count + count + -coalesce(lag7(count),0);
  if obs>=7;
  mean_count=sum_count/7;
run;

The problem with the second approach is that, for long time series, you could accumulate some minor computational rounding errors, such that the end of the series might not excactly equal one-seventh of the sum of the last 7 obs.

OTOH, the second approach could be a bit faster, especially if you want, say, a 40-day rolling window instead of a 7-day window.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

eramirez · Posted 08-20-2020 05:24 PM

Thank you so much! The first approach works great. I potentially may have long time series data so thank you for the 2nd method. I will keep them both handy.

SAS Innovate 2025: Call for Content

Classroom Training Available!