I have each subject's series of exposure date and also series of adverse event date, and want to compare and find the last exposure date prior to each adverse event. For example, exposure dates are in the first two columns, adverse events dates are in the third and forth columns, and I want to get the last column for each adverse event. Can anyone give me some hints? Thanks
The first two columns are in one dataset (EX), and the 3rd and 4th columns are in the second dataset (AE); the 5th column is the column I want to add to have. I put them together for convenience.
ID | EX_date | ID | AE_date | Last EX_date before AE_date |
1 | 8-Mar-16 | 1 | 9-Mar-16 | 8-Mar-16 |
1 | 22-Mar-16 | 1 | 9-Mar-16 | 8-Mar-16 |
1 | 12-Apr-16 | 1 | 9-Mar-16 | 8-Mar-16 |
1 | 26-Apr-16 | 1 | 4-Mar-16 | |
2 | 30-Aug-16 | 1 | 20-Mar-16 | 8-Mar-16 |
2 | 16-Sep-16 | 1 | 9-Mar-16 | 8-Mar-16 |
2 | 30-Sep-16 | 1 | 3-Apr-16 | 22-Mar-16 |
2 | 14-Oct-16 | 1 | 26-Apr-16 | 12-Apr-16 |
2 | 9-Nov-16 | 1 | 26-Apr-16 | 12-Apr-16 |
3 | 13-Jul-16 | 1 | 20-May-16 | 26-Apr-16 |
3 | 27-Jul-16 | 1 | 16-May-16 | 26-Apr-16 |
3 | 14-Sep-16 | 2 | 12-Sep-16 | 30-Aug-16 |
3 | 12-Oct-16 | 2 | 12-Sep-16 | 30-Aug-16 |
3 | 26-Oct-16 | 2 | 28-Oct-16 | 14-Oct-16 |
3 | 9-Nov-16 | 2 | 28-Oct-16 | 14-Oct-16 |
4 | 19-Oct-16 | 3 | 27-Jul-16 | 13-Jul-16 |
4 | 2-Nov-16 | 3 | 14-Sep-16 | 27-Jul-16 |
4 | 16-Nov-16 | 3 | 12-Aug-16 | 27-Jul-16 |
4 | 30-Nov-16 | 3 | 10-Aug-16 | 27-Jul-16 |
5 | 30-Jun-16 | 3 | 14-Sep-16 | 27-Jul-16 |
5 | 14-Jul-16 | 3 | 19-Oct-16 | 12-Oct-16 |
5 | 20-Aug-16 | 4 | 5-Dec-16 | 30-Nov-16 |
5 | 13-Sep-16 | 4 | 15-Dec-16 | 30-Nov-16 |
5 | 27-Sep-16 | 4 | 15-Dec-16 | 30-Nov-16 |
5 | 11-Oct-16 | 4 | 25-Nov-16 | 30-Nov-16 |
5 | 25-Oct-16 | 4 | 13-Dec-16 | 30-Nov-16 |
5 | 8-Nov-16 | 4 | 27-Dec-16 | 30-Nov-16 |
5 | 22-Nov-16 | 5 | 19-Jul-16 | 14-Jul-16 |
5 | 6-Dec-16 | 5 | 16-Aug-16 | … |
5 | 27-Dec-16 | 5 | 30-Aug-16 | |
6 | 10-Jan-17 | 5 | 30-Aug-16 | |
6 | 1-Dec-16 | 5 | 13-Sep-16 | |
6 | 10-Nov-16 | 5 | 28-Sep-16 | |
6 | 28-Nov-16 | 5 | 28-Sep-16 | |
6 | 13-Dec-16 | 5 | 28-Sep-16 | |
6 | 28-Dec-16 | 5 | 28-Sep-16 | |
5 | 12-Oct-16 | |||
5 | 16-Oct-16 | |||
5 | 14-Oct-16 | |||
5 | 31-Oct-16 | |||
6 | 10-Jan-17 | |||
6 | 10-Jan-17 |
Assuming there are two separate tables, one for the Exposure date and one for the Adverse effects, you could create the fifth column like this:
PROC SQL;
SELECT tbl_A.ID
,tbl_A.AE_Date
,(SELECT MAX(tbl_B.EX_Date)
FROM tbl_Exposure AS tbl_B
WHERE tbl_A.ID = tbl_B.ID
AND tbl_A.AE_Date >= tbl_B.Exp_Date)
FROM tbl_AdvEff AS tbl_A;
;QUIT;
Is your data sorted by id? Your sample is not.
Assuming you have dataset EXDATES, sorted by ID and ex_date, and dataset AEDATES, sorted by ID and ae_date, then you can use the SET ... BY combination of statements to interleave the two datasets. You can retain last_exdate every time an ex_dates observation is read, and you can use the subsetting IF to only keep records read from AEDATES:
data exdates;
input ID EX_date :date9.;
format ex_date yymmddn8.;
datalines;
1 8mar2016
1 22mar2016
1 12Apr2016
1 26Apr2016
2 30Aug2016
2 16Sep2016
2 30Sep2016
2 14Oct2016
2 9Nov2016
3 13Jul2016
3 27Jul2016
3 14Sep2016
3 12Oct2016
3 26Oct2016
3 9Nov2016
4 19Oct2016
4 2Nov2016
4 16Nov2016
4 30Nov2016
5 30Jun2016
5 14Jul2016
5 20Aug2016
5 13Sep2016
5 27Sep2016
5 11Oct2016
5 25Oct2016
5 8Nov2016
5 22Nov2016
5 6Dec2016
5 27Dec2016
6 10Jan17
6 1Dec2016
6 10Nov2016
6 28Nov2016
6 13Dec2016
6 28Dec2016
run;
proc sort ; by id ex_date;run;
data aedates;
input ID AE_date :date9. ;
format ae_date yymmddn8.;
datalines;
1 9mar2016
1 9mar2016
1 9mar2016
1 4mar2016
1 20mar2016
1 9mar2016
1 3Apr2016
1 26Apr2016
1 26Apr2016
1 20May2016
1 16May2016
2 12Sep2016
2 12Sep2016
2 28Oct2016
2 28Oct2016
3 27Jul2016
3 14Sep2016
3 12Aug2016
3 10Aug2016
3 14Sep2016
3 19Oct2016
4 5Dec2016
4 15Dec2016
4 15Dec2016
4 25Nov2016
4 13Dec2016
4 27Dec2016
5 19Jul2016
5 16Aug2016
5 30Aug2016
5 30Aug2016
5 13Sep2016
5 28Sep2016
5 28Sep2016
5 28Sep2016
5 28Sep2016
5 12Oct2016
5 16Oct2016
5 14Oct2016
5 31Oct2016
6 10Jan2017
6 10Jan2017
run;
proc sort; by id ae_date;run;
data want;
set exdates (in=inex rename=(ex_date=ae_date))
aedates (in=inae);
by id ae_date;
retain last_exdate;
if first.id then last_exdate=.;
if inex then last_exdate=ae_date;
if inae;
format last_exdate yymmddn8.;
run;
Thank you, mkeintz. Your method should work too. Thank you very much for your time and input.
I always like seeing a DATA step approach for problems like this since SQL is more of my strong suit and always my first take on a solution. Seeing the same problem solved using both helps me see the problem from a different perspective and learn a bit more.
As long as the datasets are not large, I think you'll be happy with the performance of sql. But it is probably not as scalable as the data step approach, especially if your datasets are already sorted.
Assuming there are two separate tables, one for the Exposure date and one for the Adverse effects, you could create the fifth column like this:
PROC SQL;
SELECT tbl_A.ID
,tbl_A.AE_Date
,(SELECT MAX(tbl_B.EX_Date)
FROM tbl_Exposure AS tbl_B
WHERE tbl_A.ID = tbl_B.ID
AND tbl_A.AE_Date >= tbl_B.Exp_Date)
FROM tbl_AdvEff AS tbl_A;
;QUIT;
Thank you. This is a good clue.
I am thinking about left join
That would work too, it's largely a matter of preference on small datasets. On bigger datasets there may be some performance implications for choosing one way or another, at least when passing through to database, I'm not as experienced at pure PROC SQL performance tuning as I am at pass-through stuff.
@Sven111: Just in case you hadn't noticed, neither of the proposed solutions give you exactly what you asked for. To correct for that, given the SQL solution, change the line:
AND tbl_A.AE_Date >= tbl_B.Exp_Date
to
AND tbl_A.AE_Date > tbl_B.Exp_Date
HTH,
Art, CEO, AnalystFinder.com
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.