My data has the following structure
data have;
infile datalines delimiter=' ';
input customerid : 8.
date : date9.
opens : 3.
;
datalines;
2123780 11APR2017 0
2123780 13APR2017 0
2123780 16APR2017 1
2123780 18APR2017 0
2123780 19APR2017 2
2123780 20APR2017 0
2123780 21APR2017 0
2123780 23APR2017 0
2123780 25APR2017 0
2123780 26APR2017 0
2123780 28APR2017 0
2123780 29APR2017 3
2123780 01MAY2017 3
2123780 03MAY2017 2
2123780 04MAY2017 5
2123780 05MAY2017 1
2123780 07MAY2017 2
2123780 09MAY2017 2
2123780 11MAY2017 3
2123780 13MAY2017 3
2123780 14MAY2017 0
2123780 16MAY2017 2
2123780 17MAY2017 2
;
run;
What I like to achieve is a moving total, average, sd, etc. for the opens
variable (and many more), which contain the values which fall within the last 7, 14, 30, etc. days PRIOR to the current observation, by customerid
. As you can see, the observations occur irregularly. Sometimes there are large gaps between, sometimes there are several on the same day. Therefore, I cannot use PROC EXPAND
(correct me if I'm wrong). Furthermore, I do not want to compress my date into e.g. one observation per week but keep them as they are.
The solution I came up with is an ugly piece of LAG()
-coding and if-clauses. An example for one variable and 7 days:
%macro loop;
data want(drop= lag_kdnr_num -- lag_mahnung min7 -- min365 minimum);
set have;
week_opens=0;
%do i=1 %to 500;
lag_customerID=lag&i.(customerID);
date_7=lag&i.(date)+7;
lag_opens=lag&i(opens);
if ((customerID=lag_customerID) and (dsate < date_7)) then
do;
week_opens=sum(week_opens + lag_opens);
end;
%end;
min7=minimum + 7;
if date < min7 then
do;
week_opens=.;
end;
run;
%MEND;
%loop;
which is giving me this:
data want2;
infile datalines delimiter=' ';
input customerid : 8.
date : date9.
opens : 3.
week_opens : 3.
;
datalines;
2123780 11APR2017 0 .
2123780 13APR2017 0 .
2123780 16APR2017 1 .
2123780 18APR2017 0 1
2123780 19APR2017 2 1
2123780 20APR2017 0 3
2123780 21APR2017 0 3
2123780 23APR2017 0 2
2123780 25APR2017 0 2
2123780 26APR2017 0 0
2123780 28APR2017 0 0
2123780 29APR2017 3 0
2123780 01MAY2017 3 3
2123780 03MAY2017 2 6
2123780 04MAY2017 5 8
2123780 05MAY2017 1 13
2123780 07MAY2017 2 11
2123780 09MAY2017 2 10
2123780 11MAY2017 3 5
2123780 13MAY2017 3 7
2123780 14MAY2017 0 8
2123780 16MAY2017 2 6
2123780 17MAY2017 2 8
;
run;
Due to the huge amount of data, this is really slow and creates a lot of unused variables which I drop at the end. Is there a faster, more elegant way to get this result e.g. via a temporary array or SAS/ETS? Thank you in advance!
Final remark: I want to use this information as covariates in a survival analysis.