Hi all, I have the following situation. I have two data sets (DS1 and DS2) that I want to use as a source to create a third data set (DS3). In short, the third data set (DS3) will be DS1 with an extra variable as the sum of variable X. To get the sum, you use the dates in DS1 and then sum any X variable between the previous date and current date defined in DS1. For example for A1 Visit 1, sum(X)={sum of X where date<=07/01/14}; for A1 Visit 2, sum(X)={sum of X where 07/01/14<date<=08/02/14}; etc. This code would have have to work across different subjects. I tried to use merge, first.Subject last.Subject, etc but this one stumps me. I'm not sure if a hash table would work here.
I am not proficient in proc sql so if there is a data step option would be preferred but an sql statement will do as well. Thank you for your help!
DS1
Subject | Visit | Date |
---|---|---|
A1 | 1 | 07/01/14 |
A1 | 2 | 08/02/14 |
A1 | 3 | 08/15/14 |
A2 | 1 | 05/14/15 |
A2 | 2 | 07/22/15 |
data ds1;
input subid $ visit date mmddyy8.;
datalines;
A1 1 07/01/14
A1 2 08/02/14
A1 3 08/15/14
A2 1 05/14/15
A2 2 07/22/15
;
DS2
Subject | Date | X |
---|---|---|
A1 | 06/02/14 | 3 |
A1 | 07/01/14 | 0 |
A1 | 08/02/14 | 1 |
A1 | 08/10/14 | 10 |
A1 | 08/15/14 | 2 |
A2 | 05/14/15 | 2 |
A2 | 05/23/15 | 3 |
A2 | 06/01/15 | 0 |
A2 | 07/22/15 | 7 |
data ds2;
input subid $ date mmddyy8. x;
datalines;
A1 06/02/14 3
A1 07/01/14 0
A1 08/02/14 1
A1 08/10/14 10
A1 08/15/14 2
A2 05/14/15 2
A2 05/23/15 3
A2 06/01/15 0
A2 07/22/15 7
;
DS3
Subject | Visit | Date | Sum(X) |
---|---|---|---|
A1 | 1 | 07/01/14 | 3+0=3 |
A1 | 2 | 08/02/14 | 1 |
A1 | 3 | 08/15/14 | 10+2=12 |
A2 | 1 | 05/14/15 | 2 |
A2 | 2 | 07/22/15 | 3+0+7=10 |
Here's a fairly straightforward DATA step version:
data want (drop=X);
set ds2 (in=from_ds2) ds1 (in=from_ds1);
by subid date;
if first.subid then total_x=0;
if from_ds2 then total_x + x;
if from_ds1;
output;
total_x=0;
run;
There are a few sophisticated pieces here, but the code is pretty short and each piece should be comprehensible.
Good luck.
I believe one can do it using data step point= or Hash, and it should be faster than the following proc sql approach, except that the latter is easier to code, let us know this is sufficient or we will move on to the data step approach:
data ds1;
infile cards truncover dlm='09'x;
input Subject$ Visit Date :mmddyy8.;
cards;
A1 1 07/01/14
A1 2 08/02/14
A1 3 08/15/14
A2 1 05/14/15
A2 2 07/22/15
;
data ds2;
infile cards truncover dlm='09'x;
input Subject$ Date :mmddyy8. X;
cards;
A1 06/02/14 3
A1 07/01/14 0
A1 08/02/14 1
A1 08/10/14 10
A1 08/15/14 2
A2 05/14/15 2
A2 05/23/15 3
A2 06/01/15 0
A2 07/22/15 7
;
proc sql;
create table ds3 as
select *, (select sum(x) from ds2 where subject=a.subject and date <= a.date and
date > coalesce((select date from ds1 where subject = a.subject and visit=a.visit-1),.)) as sumx
from ds1 a
;
quit;
Thank you!!! This works as it should. That said, I humbly request a data step solution if possible. This would greatly help me understand how to do this for future data sets as well (my knowledge of sql is very low).
Here's a fairly straightforward DATA step version:
data want (drop=X);
set ds2 (in=from_ds2) ds1 (in=from_ds1);
by subid date;
if first.subid then total_x=0;
if from_ds2 then total_x + x;
if from_ds1;
output;
total_x=0;
run;
There are a few sophisticated pieces here, but the code is pretty short and each piece should be comprehensible.
Good luck.
nice! I think somehow you need to switch the table order in the 'set' statement or modify the downstream 'from_' variable to make it work.
You caught it ... fixed now.
Here is a data step approach,
data ds1;
input subid $ visit date mmddyy8.;
format date date9.;
datalines;
A1 1 07/01/14
A1 2 08/02/14
A1 3 08/15/14
A2 1 05/14/15
A2 2 07/22/15
;
data ds2;
input subid $ date mmddyy8. x;
format date date9.;
datalines;
A1 06/02/14 3
A1 07/01/14 0
A1 08/02/14 1
A1 08/10/14 10
A1 08/15/14 2
A2 05/14/15 2
A2 05/23/15 3
A2 06/01/15 0
A2 07/22/15 7
;
data want;
if _n_=1 then
do;
dcl hash h(dataset:'ds2(rename=date=_date)', multidata:'y');
h.definekey('subid');
h.definedata(all:
'y');
h.definedone();
call missing(_date, x);
end;
set ds1;
by subid;
pre_date=lag(date);
call missing(sumx);
if first.subid then
do;
call missing(pre_date);
end;
do rc=h.find() by 0 while (rc=0);
if pre_date < _date <= date then
sumx+x;
rc=h.find_next();
end;
keep subid vist date sumx;
run;
data ds1;
infile cards truncover ;
input Subject$ Visit Date :mmddyy8.;
cards;
A1 1 07/01/14
A1 2 08/02/14
A1 3 08/15/14
A2 1 05/14/15
A2 2 07/22/15
;
data ds2;
infile cards truncover ;
input Subject$ Date :mmddyy8. X;
cards;
A1 06/02/14 3
A1 07/01/14 0
A1 08/02/14 1
A1 08/10/14 10
A1 08/15/14 2
A2 05/14/15 2
A2 05/23/15 3
A2 06/01/15 0
A2 07/22/15 7
;
data ds1;
set ds1;
by subject;
start_date=lag(date);
if first.subject then call missing(start_date);
format start_date date mmddyy10.;
run;
proc sql;
create table want as
select a.Subject,a.visit,a.date,sum(x) as sum_x
from ds1 as a left join ds2 as b
on a.subject=b.subject and a.start_date <b.date<=a.date
group by a.Subject,a.visit,a.date;
quit;
To all, thank you for the tremendous help and multiple solutions!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.