Re: Calculate a moving sum/average depending on date

mat_n · Posted 08-22-2018 08:59 AM

My data has the following structure

data have;
    infile datalines delimiter='    ';
    input customerid : 8.
        date : date9.
        opens : 3.
    ;
    datalines;
2123780 11APR2017   0
2123780 13APR2017   0
2123780 16APR2017   1
2123780 18APR2017   0
2123780 19APR2017   2
2123780 20APR2017   0
2123780 21APR2017   0
2123780 23APR2017   0
2123780 25APR2017   0
2123780 26APR2017   0
2123780 28APR2017   0
2123780 29APR2017   3
2123780 01MAY2017   3
2123780 03MAY2017   2
2123780 04MAY2017   5
2123780 05MAY2017   1
2123780 07MAY2017   2
2123780 09MAY2017   2
2123780 11MAY2017   3
2123780 13MAY2017   3
2123780 14MAY2017   0
2123780 16MAY2017   2
2123780 17MAY2017   2
;
run;

What I like to achieve is a moving total, average, sd, etc. for the opens variable (and many more), which contain the values which fall within the last 7, 14, 30, etc. days PRIOR to the current observation, by customerid. As you can see, the observations occur irregularly. Sometimes there are large gaps between, sometimes there are several on the same day. Therefore, I cannot use PROC EXPAND (correct me if I'm wrong). Furthermore, I do not want to compress my date into e.g. one observation per week but keep them as they are.

The solution I came up with is an ugly piece of LAG()-coding and if-clauses. An example for one variable and 7 days:

%macro loop;

    data want(drop= lag_kdnr_num -- lag_mahnung min7 -- min365 minimum);
        set have;
        week_opens=0;

        %do i=1 %to 500;
            lag_customerID=lag&i.(customerID);
            date_7=lag&i.(date)+7;
            lag_opens=lag&i(opens);

            if ((customerID=lag_customerID) and (dsate < date_7)) then
                do;
                    week_opens=sum(week_opens + lag_opens);
                end;
        %end;

        min7=minimum + 7;

        if date < min7 then
            do;
                week_opens=.;
            end;
    run;

%MEND;

%loop;

which is giving me this:

data want2;
    infile datalines delimiter='    ';
    input customerid : 8.
        date : date9.
        opens : 3.
        week_opens : 3.
    ;
    datalines;
2123780 11APR2017   0   .
2123780 13APR2017   0   .
2123780 16APR2017   1   .
2123780 18APR2017   0   1
2123780 19APR2017   2   1
2123780 20APR2017   0   3
2123780 21APR2017   0   3
2123780 23APR2017   0   2
2123780 25APR2017   0   2
2123780 26APR2017   0   0
2123780 28APR2017   0   0
2123780 29APR2017   3   0
2123780 01MAY2017   3   3
2123780 03MAY2017   2   6
2123780 04MAY2017   5   8
2123780 05MAY2017   1   13
2123780 07MAY2017   2   11
2123780 09MAY2017   2   10
2123780 11MAY2017   3   5
2123780 13MAY2017   3   7
2123780 14MAY2017   0   8
2123780 16MAY2017   2   6
2123780 17MAY2017   2   8
;
run;

Due to the huge amount of data, this is really slow and creates a lot of unused variables which I drop at the end. Is there a faster, more elegant way to get this result e.g. via a temporary array or SAS/ETS? Thank you in advance!

Final remark: I want to use this information as covariates in a survival analysis.

Kurt_Bremser · Posted 08-22-2018 10:05 AM

See if this satisfies your needs:

data have;
    infile datalines delimiter='    ';
    input customerid : 8.
        date : date9.
        opens : 3.
    ;
    format date yymmddd10.;
    datalines;
2123780 11APR2017   0
2123780 13APR2017   0
2123780 16APR2017   1
2123780 18APR2017   0
2123780 19APR2017   2
2123780 20APR2017   0
2123780 21APR2017   0
2123780 23APR2017   0
2123780 25APR2017   0
2123780 26APR2017   0
2123780 28APR2017   0
2123780 29APR2017   3
2123780 01MAY2017   3
2123780 03MAY2017   2
2123780 04MAY2017   5
2123780 05MAY2017   1
2123780 07MAY2017   2
2123780 09MAY2017   2
2123780 11MAY2017   3
2123780 13MAY2017   3
2123780 14MAY2017   0
2123780 16MAY2017   2
2123780 17MAY2017   2
;
run;

%let window=7;

data want;
set have;
by customerid;
array window {&window};
retain window:;
format lastdate yymmddd10.;
lastdate = lag(date);
if first.customerid
then do;
  do i = 1 to &window - 1;
    window{i} = 0;
  end;
  window{dim(window)} = opens;
  total = opens;
end;
else do;
  diff = date - lastdate;
  do i = 1 to dim(window) - diff;
    window{i} = window{i + diff};
  end;
  do i = dim(window) - diff + 1 to dim(window) - 1;
    window{i} = 0;
  end;
  window{dim(window)} = opens;
  total = sum(of window:);
end;
run;

I intentionally did not drop anything, so you can see the evolution of the array.

If you want to get more than one window, define additional arrays and implement the code for every array (with a separate totals variable).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mat_n · Posted 08-22-2018 10:39 AM

Looks great! the only minor shortcomming is that I do not want to have the current value included, but it is easy to adjust for this. Thanks a lot and I will evaluate it in detail soon 🙂

mat_n · Posted 08-24-2018 04:50 AM

After I have checked your solution in more detail I noticed a problem which I did not highlight so far. Sometimes there are two or more observations on the same day. I can put them in the right order since I also do have a timestamp. Due to that, there are some 7-day-intervals which contain more than 7 observations. I would not like to sum them up since this would result in meaningless resluts for calcuations like standard deviation.

Is there any way to create a temporary array for each line which has as many elements as there are observations in the last 7 days and fill the array with these values?

I learned a lot from your first answer, so thank you again!

data have;
    infile datalines delimiter='    ';
    input customerid : 8.
        date : date9.
        opens : 3.
    ;
    format date yymmddd10.;
    datalines;
2123780 11APR2017   0
2123780 13APR2017   0
2123780 16APR2017   1
2123780 18APR2017   0
2123780 19APR2017   2
2123780 20APR2017   0
2123780 21APR2017   0
2123780 23APR2017   0
2123780 25APR2017   0
2123780 26APR2017   0
2123780 28APR2017   0
2123780 29APR2017   3
2123780 01MAY2017   3
2123780 01MAY2017   2
2123780 04MAY2017   5
2123781 05MAY2017   1
2123781 07MAY2017   2
2123781 09MAY2017   2
2123781 11MAY2017   3
2123781 13MAY2017   3
2123781 14MAY2017   0
2123781 16MAY2017   2
2123781 17MAY2017   2
;
run;

Kurt_Bremser · Posted 08-24-2018 04:59 AM

Sum the data up first:

proc sql;
create table int as
select customerid, date, sum(opens) as opens
from have
group by customerid, date;
quit;

and use that as input for my previous code.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mat_n · Posted 08-24-2018 05:43 AM

I guess your solution also contains a hint that there is no such way to use 'flexible' arrays. Thank you anyway!

Kurt_Bremser · Posted 08-24-2018 06:22 AM

I'm quite sure there is a way to do it, but I wouldn't want to read the resulting code 😉

The sum by groups OTOH is very simple code. You can also do a sort and use proc summary to achieve the same state.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Calculate a moving sum/average depending on date