BookmarkSubscribeRSS Feed
mat_n
Obsidian | Level 7

My data has the following structure

data have;
    infile datalines delimiter='    ';
    input customerid : 8.
        date : date9.
        opens : 3.
    ;
    datalines;
2123780 11APR2017   0
2123780 13APR2017   0
2123780 16APR2017   1
2123780 18APR2017   0
2123780 19APR2017   2
2123780 20APR2017   0
2123780 21APR2017   0
2123780 23APR2017   0
2123780 25APR2017   0
2123780 26APR2017   0
2123780 28APR2017   0
2123780 29APR2017   3
2123780 01MAY2017   3
2123780 03MAY2017   2
2123780 04MAY2017   5
2123780 05MAY2017   1
2123780 07MAY2017   2
2123780 09MAY2017   2
2123780 11MAY2017   3
2123780 13MAY2017   3
2123780 14MAY2017   0
2123780 16MAY2017   2
2123780 17MAY2017   2
;
run;

What I like to achieve is a moving total, average, sd, etc. for the opens variable (and many more), which contain the values which fall within the last 7, 14, 30, etc. days PRIOR to the current observation, by customerid. As you can see, the observations occur irregularly. Sometimes there are large gaps between, sometimes there are several on the same day. Therefore, I cannot use PROC EXPAND (correct me if I'm wrong). Furthermore, I do not want to compress my date into e.g. one observation per week but keep them as they are.

The solution I came up with is an ugly piece of LAG()-coding and if-clauses. An example for one variable and 7 days:

%macro loop;

    data want(drop= lag_kdnr_num -- lag_mahnung min7 -- min365 minimum);
        set have;
        week_opens=0;

        %do i=1 %to 500;
            lag_customerID=lag&i.(customerID);
            date_7=lag&i.(date)+7;
            lag_opens=lag&i(opens);

            if ((customerID=lag_customerID) and (dsate < date_7)) then
                do;
                    week_opens=sum(week_opens + lag_opens);
                end;
        %end;

        min7=minimum + 7;

        if date < min7 then
            do;
                week_opens=.;
            end;
    run;

%MEND;

%loop;

which is giving me this:

data want2;
    infile datalines delimiter='    ';
    input customerid : 8.
        date : date9.
        opens : 3.
        week_opens : 3.
    ;
    datalines;
2123780 11APR2017   0   .
2123780 13APR2017   0   .
2123780 16APR2017   1   .
2123780 18APR2017   0   1
2123780 19APR2017   2   1
2123780 20APR2017   0   3
2123780 21APR2017   0   3
2123780 23APR2017   0   2
2123780 25APR2017   0   2
2123780 26APR2017   0   0
2123780 28APR2017   0   0
2123780 29APR2017   3   0
2123780 01MAY2017   3   3
2123780 03MAY2017   2   6
2123780 04MAY2017   5   8
2123780 05MAY2017   1   13
2123780 07MAY2017   2   11
2123780 09MAY2017   2   10
2123780 11MAY2017   3   5
2123780 13MAY2017   3   7
2123780 14MAY2017   0   8
2123780 16MAY2017   2   6
2123780 17MAY2017   2   8
;
run;

Due to the huge amount of data, this is really slow and creates a lot of unused variables which I drop at the end. Is there a faster, more elegant way to get this result e.g. via a temporary array or SAS/ETS? Thank you in advance!

Final remark: I want to use this information as covariates in a survival analysis.

6 REPLIES 6
Kurt_Bremser
Super User

See if this satisfies your needs:

data have;
    infile datalines delimiter='    ';
    input customerid : 8.
        date : date9.
        opens : 3.
    ;
    format date yymmddd10.;
    datalines;
2123780 11APR2017   0
2123780 13APR2017   0
2123780 16APR2017   1
2123780 18APR2017   0
2123780 19APR2017   2
2123780 20APR2017   0
2123780 21APR2017   0
2123780 23APR2017   0
2123780 25APR2017   0
2123780 26APR2017   0
2123780 28APR2017   0
2123780 29APR2017   3
2123780 01MAY2017   3
2123780 03MAY2017   2
2123780 04MAY2017   5
2123780 05MAY2017   1
2123780 07MAY2017   2
2123780 09MAY2017   2
2123780 11MAY2017   3
2123780 13MAY2017   3
2123780 14MAY2017   0
2123780 16MAY2017   2
2123780 17MAY2017   2
;
run;

%let window=7;

data want;
set have;
by customerid;
array window {&window};
retain window:;
format lastdate yymmddd10.;
lastdate = lag(date);
if first.customerid
then do;
  do i = 1 to &window - 1;
    window{i} = 0;
  end;
  window{dim(window)} = opens;
  total = opens;
end;
else do;
  diff = date - lastdate;
  do i = 1 to dim(window) - diff;
    window{i} = window{i + diff};
  end;
  do i = dim(window) - diff + 1 to dim(window) - 1;
    window{i} = 0;
  end;
  window{dim(window)} = opens;
  total = sum(of window:);
end;
run;

I intentionally did not drop anything, so you can see the evolution of the array.

If you want to get more than one window, define additional arrays and implement the code for every array (with a separate totals variable).

mat_n
Obsidian | Level 7
Looks great! the only minor shortcomming is that I do not want to have the current value included, but it is easy to adjust for this. Thanks a lot and I will evaluate it in detail soon 🙂
mat_n
Obsidian | Level 7

After I have checked your solution in more detail I noticed a problem which I did not highlight so far. Sometimes there are two or more observations on the same day. I can put them in the right order since I also do have a timestamp. Due to that, there are some 7-day-intervals which contain more than 7 observations. I would not like to sum them up since this would result in meaningless resluts for calcuations like standard deviation.

Is there any way to create a temporary array for each line which has as many elements as there are observations in the last 7 days and fill the array with these values?

 

I learned a lot from your first answer, so thank you again!

 

data have;
    infile datalines delimiter='    ';
    input customerid : 8.
        date : date9.
        opens : 3.
    ;
    format date yymmddd10.;
    datalines;
2123780 11APR2017   0
2123780 13APR2017   0
2123780 16APR2017   1
2123780 18APR2017   0
2123780 19APR2017   2
2123780 20APR2017   0
2123780 21APR2017   0
2123780 23APR2017   0
2123780 25APR2017   0
2123780 26APR2017   0
2123780 28APR2017   0
2123780 29APR2017   3
2123780 01MAY2017   3
2123780 01MAY2017   2
2123780 04MAY2017   5
2123781 05MAY2017   1
2123781 07MAY2017   2
2123781 09MAY2017   2
2123781 11MAY2017   3
2123781 13MAY2017   3
2123781 14MAY2017   0
2123781 16MAY2017   2
2123781 17MAY2017   2
;
run;
Kurt_Bremser
Super User

Sum the data up first:

proc sql;
create table int as
select customerid, date, sum(opens) as opens
from have
group by customerid, date;
quit;

and use that as input for my previous code.

mat_n
Obsidian | Level 7

I guess your solution also contains a hint that there is no such way to use 'flexible' arrays. Thank you anyway!

Kurt_Bremser
Super User

I'm quite sure there is a way to do it, but I wouldn't want to read the resulting code 😉

The sum by groups OTOH is very simple code. You can also do a sort and use proc summary to achieve the same state.

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 2975 views
  • 0 likes
  • 2 in conversation