BookmarkSubscribeRSS Feed
SColby
Calcite | Level 5
Hello kind person,

In a large data set I have observations consisting of variables ID, Date, and X. I want to form a new variable, Y, that for each observation, is the sum of X for all observations that occurred in the 120 days following that observation, for that ID.

I realize that I can sort by descending date to make this problem a little easier. Using a specified number of lags is difficult because 120 days will contain different numbers of observations.

Any suggestions as to how to create this variable are greatly appreciated.
14 REPLIES 14
Doc_Duke
Rhodochrosite | Level 12
PROC SQL; will do it fairly reasonably. However, your question is not well framed as I can interpret it at least two ways that yield very different results. Perhaps you could show an example with your base data and desired results.

Doc Muhlbaier
Duke
SColby
Calcite | Level 5
Thank you,

Consider the following:

ID Date X Desired Y
1 10000 0 1
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 0
2 90000 0 1
2 90001 1 0
2 90150 1 1 Message was edited by: SColby
Ksharp
Super User
I am confused
[pre]
ID Date X Desired Y
1 10000 0 1 <---- Is it correct ? in 120 days ,it should include '10119'
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 0
2 90000 0 1
2 90001 1 0
2 90150 1 1



Ksharp

Message was edited by: Ksharp
SColby
Calcite | Level 5
Yes, that is correct because the next observation with x = 1 occurred on the same date, and so does not count. The "1" in the Y column comes from the 10119 observation.

Thanks,
Scott
Ksharp
Super User
OK.
But I have another question.
[pre]
ID Date X Desired Y
1 10000 0 1
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 0<-----it is zero.
2 90000 0 1
2 90001 1 0
2 90150 1 1<-------is it correct ?but the above group is 0.


Assuming it is 0.





data temp;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;
proc sort data=temp;
by id date;
run;
data result(drop=_: count);
set temp end=last;
count+1;
y=0;
if not last then do;
_count=count ;
do until(_date-date gt 120 or id ne _id);
_count+1;
set temp(rename=(id=_id date=_date x=_x)) point=_count;
if _date-date ge 1 and _date-date le 120 and id eq _id then y+_x;
end;
end;
run;





Ksharp Message was edited by: Ksharp
SColby
Calcite | Level 5
You are correct: that last 1 under "Desired Y" ought to be 0.

I ran that code and it is still running 4 hours later. Maybe I adapted it incorrectly, but for now I am going to work with the solution offered by Zineg.

Thank for your interest and help. This has been tremendously useful.

Scott
Ksharp
Super User
May be proc sort waste lots of your time.
If your data has sorted so plz remove it.


Ksharp
Ksharp
Super User
Hi.
But after run Zineg 's code I found a problem:
[pre]
data sample_input;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

proc sql;
create table sample_output as
select ID
, date
, x
, (select sum(b.x) from sample_input b where a.id=b.id and b.date between a.date+1 and a.date+120) as y
from sample_input a
;
quit;

proc print noobs ;run;





ID Date X y

1 10000 0 1
1 10000 1 1
1 10010 0 2
1 10119 1 2
1 10121 1 1
1 10125 0 1
1 10131 1 .<-------it should be zero
2 90000 0 1
2 90001 1 .<------it should be zero
2 90150 1 .<------it should be zero





It is what you can accept?


Ksharp Message was edited by: Ksharp
SColby
Calcite | Level 5
That is interesting because when I ran it on the actual data set there are no missing values generated and everything works fine. Where there should be zeros there are zeros. I get the same thing as you when I run the test data. Hm.

The data is presorted so I don't sort first.

Thanks,
Scott
Ksharp
Super User
If you like sql.I fixed it.
[pre]
data sample_input;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

proc sql;
create table sample_output as
select ID
, date
, x
, coalesce((select sum(b.x) from sample_input b where a.id=b.id and b.date between a.date+1 and a.date+120),0) as y
from sample_input a
;
quit;

proc print noobs ;run;
[pre]



Ksharp
Ksharp
Super User
I rethink and refine my code ,maybe it will be more efficient.
[pre]
data temp;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

data result(drop=_: );
set temp nobs=_nobs;
y=0;
count=_n_ ;
do until(_date-date gt 120 or id ne _id or count ge _nobs);
count+1;
if count le _nobs then do;
set temp(rename=(id=_id date=_date x=_x)) point=count;
if _date-date ge 1 and _date-date le 120 and id eq _id then y+_x;
end;
end;
run;
[/pre]



Ksharp Message was edited by: Ksharp
GertNissen
Barite | Level 11
Perhaps something like this ?
[pre]data sample_input;
input ID Date X;
datalines;
1 10000 0
1 10000 1
1 10010 0
1 10119 1
1 10121 1
1 10125 0
1 10131 1
2 90000 0
2 90001 1
2 90150 1
;
run;

proc sql;
create table sample_output as
select ID
, date
, x
, (select sum(b.x) from sample_input b where a.id=b.id and b.date between a.date and a.date+120) as y
from sample_input a
;
quit;
[/pre]
SColby
Calcite | Level 5
Zineg,

After being confused about the meaning of "a" and "b" I am able to get your code to work. Should I think of their use as denoting two separate copies of the data set?

I am very happy that this code has worked, and I hate to be greedy, but are there any obvious ways to make this procedure more efficient? I have to do this to 51 data sets that each contain ~150,000 observations. It took 32 minutes to do it to one. (I have already parsed the data set to only those variables I need for this operation.)

Thanks again,
Scott

Message was edited by: SColby Message was edited by: SColby
GertNissen
Barite | Level 11
The quick solution is to sort your data by id and date AND add an index - try that and see how that performs.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1167 views
  • 0 likes
  • 4 in conversation