Dear community,
I am trying to merge EPOCH when SESTDTC<=ZEDTC<SEENDTC or ZEDTC>=SESTDTC (if SEENDTC is missing) and keep the latest value where this is true. Proc SQL is time consuming when dealing with huge datasets, so I am looking for alternatives.
USUBJID | TAETORD | EPOCH | SESTDTC | SEENDTC |
1234 | 1 | SCREENING | 2019-02-20 | 2019-03-04T08:25 |
1234 | 2 | TREATMENT | 2019-03-04T08:25 | 2019-03-09 |
1234 | 3 | FOLLOW-UP | 2019-03-09 | 2019-04-01 |
1000 | 1 | SCREENING | 2019-07-07 |
USUBJID | VISITNUM | ZESPID | ZEDTC | EPOCH |
1234 | 1 | 1 | 2019-02-24 | SCREENING |
1234 | 2 | 1 | 2019-03-04T08:25 | TREATMENT |
1234 | 2 | 2 | 2019-03-04 | TREATMENT |
1234 | 3 | 1 | 2019-03 | FOLLOW-UP |
1234 | 3 | 2 | 2019-03-09T12:35 | FOLLOW-UP |
1000 | 1 | 1 | 2019-07-20 | SCREENING |
Things to consider:
- the EPOCH variable in blue is the result after merging both tables
- all date variables are in ISO8601 format with or without time and ZEDTC can even have partial date values
Please let me know if you need additional information. Thanks.
Try just setting them by subject and date (start date / event date). Set the epoch dataset first and retain the EPOCH value from it going forward.
data via_ds;
set se(in=se rename=(sestdtc=zedtc)) ze (in=ze) ;
by usubjid zedtc ;
length new_epoch $9;
retain new_epoch;
if first.usubjid then new_epoch=' ';
if se then new_epoch=epoch;
if ze ;
keep USUBJID VISITNUM ZESPID ZEDTC new_epoch;
rename new_epoch=EPOCH;
run;
Results:
Obs USUBJID zedtc VISITNUM ZESPID EPOCH 1 1000 2019-07-20 1 1 SCREENING 2 1234 2019-02-24 1 1 SCREENING 3 1234 2019-03 3 1 SCREENING 4 1234 2019-03-04 2 2 SCREENING 5 1234 2019-03-04T08:25 2 1 TREATMENT 6 1234 2019-03-09T12:35 3 2 FOLLOW-UP
Please show us examples for the datasets before joining. Use data steps with datalines, so we can quickly recreate the datasets for testing. Use the "little running man" to post the data step codes.
Please see below:
data se;
length USUBJID $4 TAETORD 8 EPOCH $9 SESTDTC SEENDTC $16;
infile datalines dsd;
input USUBJID TAETORD EPOCH SESTDTC SEENDTC;
datalines;
1234,1,SCREENING,2019-02-20,2019-03-04T08:25
1234,2,TREATMENT,2019-03-04T08:25,2019-03-09
1234,3,FOLLOW-UP,2019-03-09,2019-04-01
1000,1,SCREENING,2019-07-07,
;
run;
data ze;
length USUBJID $4 VISITNUM 8 ZESPID $1 ZEDTC $16;
infile datalines dsd;
input USUBJID VISITNUM ZESPID ZEDTC;
datalines;
1234,1,1,2019-02-24
1234,2,1,2019-03-04T08:25
1234,2,2,2019-03-04
1234,3,1,2019-03
1234,3,2,2019-03-09T12:35
1000,1,1,2019-07-20
;
run;
How many observations do you have in both of these datasets in real life?
ZE between 20000 and 60000 and SE between 100 and 500.
I'm working on a SAS EG server and having 3-4 SQL procedures in one program can make the program run for 15-30 minutes.
What is the result you want from that data? In particular which of these two results do you want?
Obs USUBJID VISITNUM ZESPID ZEDTC EPOCH 5 1234 2 1 2019-03-04T08:25 TREATMENT 6 1234 2 1 2019-03-04T08:25 SCREENING
Try just setting them by subject and date (start date / event date). Set the epoch dataset first and retain the EPOCH value from it going forward.
data via_ds;
set se(in=se rename=(sestdtc=zedtc)) ze (in=ze) ;
by usubjid zedtc ;
length new_epoch $9;
retain new_epoch;
if first.usubjid then new_epoch=' ';
if se then new_epoch=epoch;
if ze ;
keep USUBJID VISITNUM ZESPID ZEDTC new_epoch;
rename new_epoch=EPOCH;
run;
Results:
Obs USUBJID zedtc VISITNUM ZESPID EPOCH 1 1000 2019-07-20 1 1 SCREENING 2 1234 2019-02-24 1 1 SCREENING 3 1234 2019-03 3 1 SCREENING 4 1234 2019-03-04 2 2 SCREENING 5 1234 2019-03-04T08:25 2 1 TREATMENT 6 1234 2019-03-09T12:35 3 2 FOLLOW-UP
This is what I wanted.
Thank you all for your suggestions.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.