BookmarkSubscribeRSS Feed
SColby
Calcite | Level 5
Hello kind person,

In a large data set I have observations consisting of variables ID, Date, and X. I want to form a new variable, Y, that for each observation, is the sum of X for all observations that occurred in the 120 days following that observation, for that ID.

I realize that I can sort by descending date to make this problem a little easier. Using a specified number of lags is difficult because 120 days will contain different numbers of observations.

Any suggestions as to how to create this variable are greatly appreciated.
14 REPLIES 14
Doc_Duke
Rhodochrosite | Level 12
PROC SQL; will do it fairly reasonably. However, your question is not well framed as I can interpret it at least two ways that yield very different results. Perhaps you could show an example with your base data and desired results.

Doc Muhlbaier
Duke
SColby
Calcite | Level 5
Thank you,

Consider the following:

ID Date X Desired Y
1 10000 0 1
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 0
2 90000 0 1
2 90001 1 0
2 90150 1 1 Message was edited by: SColby
Ksharp
Super User
I am confused
[pre]
ID Date X Desired Y
1 10000 0 1 <---- Is it correct ? in 120 days ,it should include '10119'
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 0
2 90000 0 1
2 90001 1 0
2 90150 1 1



Ksharp

Message was edited by: Ksharp
SColby
Calcite | Level 5
Yes, that is correct because the next observation with x = 1 occurred on the same date, and so does not count. The "1" in the Y column comes from the 10119 observation.

Thanks,
Scott
Ksharp
Super User
OK.
But I have another question.
[pre]
ID Date X Desired Y
1 10000 0 1
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 0<-----it is zero.
2 90000 0 1
2 90001 1 0
2 90150 1 1<-------is it correct ?but the above group is 0.


Assuming it is 0.





data temp;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;
proc sort data=temp;
by id date;
run;
data result(drop=_: count);
set temp end=last;
count+1;
y=0;
if not last then do;
_count=count ;
do until(_date-date gt 120 or id ne _id);
_count+1;
set temp(rename=(id=_id date=_date x=_x)) point=_count;
if _date-date ge 1 and _date-date le 120 and id eq _id then y+_x;
end;
end;
run;





Ksharp Message was edited by: Ksharp
SColby
Calcite | Level 5
You are correct: that last 1 under "Desired Y" ought to be 0.

I ran that code and it is still running 4 hours later. Maybe I adapted it incorrectly, but for now I am going to work with the solution offered by Zineg.

Thank for your interest and help. This has been tremendously useful.

Scott
Ksharp
Super User
May be proc sort waste lots of your time.
If your data has sorted so plz remove it.


Ksharp
Ksharp
Super User
Hi.
But after run Zineg 's code I found a problem:
[pre]
data sample_input;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

proc sql;
create table sample_output as
select ID
, date
, x
, (select sum(b.x) from sample_input b where a.id=b.id and b.date between a.date+1 and a.date+120) as y
from sample_input a
;
quit;

proc print noobs ;run;





ID Date X y

1 10000 0 1
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 .<-------it should be zero
2 90000 0 1
2 90001 1 .<------it should be zero
2 90150 1 .<------it should be zero





It is what you can accept?


Ksharp Message was edited by: Ksharp
SColby
Calcite | Level 5
That is interesting because when I ran it on the actual data set there are no missing values generated and everything works fine. Where there should be zeros there are zeros. I get the same thing as you when I run the test data. Hm.

The data is presorted so I don't sort first.

Thanks,
Scott
Ksharp
Super User
If you like sql.I fixed it.
[pre]
data sample_input;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

proc sql;
create table sample_output as
select ID
, date
, x
, coalesce((select sum(b.x) from sample_input b where a.id=b.id and b.date between a.date+1 and a.date+120),0) as y
from sample_input a
;
quit;

proc print noobs ;run;
[pre]



Ksharp
Ksharp
Super User
I rethink and refine my code ,maybe it will be more efficient.
[pre]
data temp;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

data result(drop=_: );
set temp nobs=_nobs;
y=0;
count=_n_ ;
do until(_date-date gt 120 or id ne _id or count ge _nobs);
count+1;
if count le _nobs then do;
set temp(rename=(id=_id date=_date x=_x)) point=count;
if _date-date ge 1 and _date-date le 120 and id eq _id then y+_x;
end;
end;
run;
[/pre]



Ksharp Message was edited by: Ksharp
GertNissen
Barite | Level 11
Perhaps something like this ?
[pre]data sample_input;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

proc sql;
create table sample_output as
select ID
, date
, x
, (select sum(b.x) from sample_input b where a.id=b.id and b.date between a.date and a.date+120) as y
from sample_input a
;
quit;
[/pre]
SColby
Calcite | Level 5
Zineg,

After being confused about the meaning of "a" and "b" I am able to get your code to work. Should I think of their use as denoting two separate copies of the data set?

I am very happy that this code has worked, and I hate to be greedy, but are there any obvious ways to make this procedure more efficient? I have to do this to 51 data sets that each contain ~150,000 observations. It took 32 minutes to do it to one. (I have already parsed the data set to only those variables I need for this operation.)

Thanks again,
Scott

Message was edited by: SColby Message was edited by: SColby
GertNissen
Barite | Level 11
The quick solution is to sort your data by id and date AND add an index - try that and see how that performs.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 1798 views
  • 0 likes
  • 4 in conversation