DATA Step, Macro, Functions and more

Summing the Future for a Given Time Period

Reply
Occasional Contributor
Posts: 18

Summing the Future for a Given Time Period

Hello kind person,

In a large data set I have observations consisting of variables ID, Date, and X. I want to form a new variable, Y, that for each observation, is the sum of X for all observations that occurred in the 120 days following that observation, for that ID.

I realize that I can sort by descending date to make this problem a little easier. Using a specified number of lags is difficult because 120 days will contain different numbers of observations.

Any suggestions as to how to create this variable are greatly appreciated.
Trusted Advisor
Posts: 2,115

Re: Summing the Future for a Given Time Period

PROC SQL; will do it fairly reasonably. However, your question is not well framed as I can interpret it at least two ways that yield very different results. Perhaps you could show an example with your base data and desired results.

Doc Muhlbaier
Duke
Occasional Contributor
Posts: 18

Re: Summing the Future for a Given Time Period

Thank you,

Consider the following:

ID Date X Desired Y
1 10000 0 1
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 0
2 90000 0 1
2 90001 1 0
2 90150 1 1 Message was edited by: SColby
Super User
Posts: 10,018

Re: Summing the Future for a Given Time Period

I am confused
[pre]
ID Date X Desired Y
1 10000 0 1 <---- Is it correct ? in 120 days ,it should include '10119'
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 0
2 90000 0 1
2 90001 1 0
2 90150 1 1



Ksharp

Message was edited by: Ksharp
Occasional Contributor
Posts: 18

Re: Summing the Future for a Given Time Period

Yes, that is correct because the next observation with x = 1 occurred on the same date, and so does not count. The "1" in the Y column comes from the 10119 observation.

Thanks,
Scott
Super User
Posts: 10,018

Re: Summing the Future for a Given Time Period

OK.
But I have another question.
[pre]
ID Date X Desired Y
1 10000 0 1
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 0<-----it is zero.
2 90000 0 1
2 90001 1 0
2 90150 1 1<-------is it correct ?but the above group is 0.


Assuming it is 0.





data temp;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;
proc sort data=temp;
by id date;
run;
data result(drop=_: count);
set temp end=last;
count+1;
y=0;
if not last then do;
_count=count ;
do until(_date-date gt 120 or id ne _id);
_count+1;
set temp(rename=(id=_id date=_date x=_x)) point=_count;
if _date-date ge 1 and _date-date le 120 and id eq _id then y+_x;
end;
end;
run;





Ksharp Message was edited by: Ksharp
Occasional Contributor
Posts: 18

Re: Summing the Future for a Given Time Period

You are correct: that last 1 under "Desired Y" ought to be 0.

I ran that code and it is still running 4 hours later. Maybe I adapted it incorrectly, but for now I am going to work with the solution offered by Zineg.

Thank for your interest and help. This has been tremendously useful.

Scott
Super User
Posts: 10,018

Re: Summing the Future for a Given Time Period

May be proc sort waste lots of your time.
If your data has sorted so plz remove it.


Ksharp
Super User
Posts: 10,018

Re: Summing the Future for a Given Time Period

Hi.
But after run Zineg 's code I found a problem:
[pre]
data sample_input;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

proc sql;
create table sample_output as
select ID
, date
, x
, (select sum(b.x) from sample_input b where a.id=b.id and b.date between a.date+1 and a.date+120) as y
from sample_input a
;
quit;

proc print noobs ;run;





ID Date X y

1 10000 0 1
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 .<-------it should be zero
2 90000 0 1
2 90001 1 .<------it should be zero
2 90150 1 .<------it should be zero





It is what you can accept?


Ksharp Message was edited by: Ksharp
Occasional Contributor
Posts: 18

Re: Summing the Future for a Given Time Period

That is interesting because when I ran it on the actual data set there are no missing values generated and everything works fine. Where there should be zeros there are zeros. I get the same thing as you when I run the test data. Hm.

The data is presorted so I don't sort first.

Thanks,
Scott
Super User
Posts: 10,018

Re: Summing the Future for a Given Time Period

If you like sql.I fixed it.
[pre]
data sample_input;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

proc sql;
create table sample_output as
select ID
, date
, x
, coalesce((select sum(b.x) from sample_input b where a.id=b.id and b.date between a.date+1 and a.date+120),0) as y
from sample_input a
;
quit;

proc print noobs ;run;
[pre]



Ksharp
Super User
Posts: 10,018

Re: Summing the Future for a Given Time Period

I rethink and refine my code ,maybe it will be more efficient.
[pre]
data temp;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

data result(drop=_: );
set temp nobs=_nobs;
y=0;
count=_n_ ;
do until(_date-date gt 120 or id ne _id or count ge _nobs);
count+1;
if count le _nobs then do;
set temp(rename=(id=_id date=_date x=_x)) point=count;
if _date-date ge 1 and _date-date le 120 and id eq _id then y+_x;
end;
end;
run;
[/pre]



Ksharp Message was edited by: Ksharp
SAS Employee
Posts: 174

Re: Summing the Future for a Given Time Period

Perhaps something like this ?
[pre]data sample_input;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

proc sql;
create table sample_output as
select ID
, date
, x
, (select sum(b.x) from sample_input b where a.id=b.id and b.date between a.date and a.date+120) as y
from sample_input a
;
quit;
[/pre]
Occasional Contributor
Posts: 18

Re: Summing the Future for a Given Time Period

Posted in reply to GertNissen
Zineg,

After being confused about the meaning of "a" and "b" I am able to get your code to work. Should I think of their use as denoting two separate copies of the data set?

I am very happy that this code has worked, and I hate to be greedy, but are there any obvious ways to make this procedure more efficient? I have to do this to 51 data sets that each contain ~150,000 observations. It took 32 minutes to do it to one. (I have already parsed the data set to only those variables I need for this operation.)

Thanks again,
Scott

Message was edited by: SColby Message was edited by: SColby
SAS Employee
Posts: 174

Re: Summing the Future for a Given Time Period

The quick solution is to sort your data by id and date AND add an index - try that and see how that performs.
Ask a Question
Discussion stats
  • 14 replies
  • 213 views
  • 0 likes
  • 4 in conversation