I have an AE dataset and need to create 3 variables, lasteventfl, aegroup and firstdt, per the following specs:
BY subject aesoccalc aeptcalc temporary_suppae_qval_ctcae temporary_ae_aeterm
aestdat visno recordid
Within each subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm combination:
Rule: If a record has the same AESTDAT as another record in that group, it should have the same AEENDAT.
Implementation:
For each record, compare to the NEXT record in sort order within the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group:
Set LASTEVENTFL = 'N' if ANY of these conditions are true:
Otherwise, set LASTEVENTFL = 'Y'
Special rule: Among multiple records with the same AESTDAT, only the LAST record (by visno, recordid) should get the 'Y' or 'N' determined above. All other records with that same AESTDAT should get 'N'.
Within each subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm combination:
For each aegroup: FIRSTDT = minimum AESTDAT within that aegroup
Can somebody please help me get the correct code to what I need it to per the specs? I've been trying Claude for hours but for some reason, it cannot seem to get it to work.
Here is a sample input dataset, along with expected output:
data sample_input;
length subject $10 aesoccalc $50 aeptcalc $50
temporary_suppae_qval_ctcae $50 temporary_ae_aeterm $50
visno $20;
format aestdat aeendat date9.;
/* Example 1: Skin infection - all same start, no ends */
subject='001'; aesoccalc='Infections'; aeptcalc='Skin infection';
temporary_suppae_qval_ctcae='Skin infection'; temporary_ae_aeterm='SKIN INFECTION';
recordid=1; visno='Cycle 08'; aestdat='15OCT2023'd; aeendat=.; output;
recordid=2; visno='Cycle 15'; aestdat='15OCT2023'd; aeendat=.; output;
recordid=3; visno='Cycle 28'; aestdat='15OCT2023'd; aeendat=.; output;
/* Expected: All N except recordid=3 which is Y */
/* Example 2: Rash - two events same start, then gap, then another */
subject='002'; aesoccalc='Skin disorders'; aeptcalc='Rash';
temporary_suppae_qval_ctcae='Rash maculo-papular'; temporary_ae_aeterm='MACULO-PAPULAR RASH';
recordid=4; visno='Cycle 02'; aestdat='22APR2023'd; aeendat='15MAY2023'd; output;
recordid=5; visno='Cycle 03'; aestdat='22APR2023'd; aeendat='15MAY2023'd; output;
recordid=6; visno='Cycle 13'; aestdat='15MAY2023'd; aeendat='25MAY2023'd; output;
recordid=7; visno='Cycle 14'; aestdat='08APR2024'd; aeendat='28DEC2024'd; output;
/* Expected:
recordid=4: N (same start as next)
recordid=5: N (not last with this start date)
recordid=6: Y (last with this start, and next start != 25MAY or 26MAY)
recordid=7: Y (no next record)
*/
/* Example 3: Nausea - next starts on current end */
subject='003'; aesoccalc='GI disorders'; aeptcalc='Nausea';
temporary_suppae_qval_ctcae='Nausea'; temporary_ae_aeterm='NAUSEA';
recordid=8; visno='Cycle 01'; aestdat='01MAR2024'd; aeendat='28MAY2024'd; output;
recordid=9; visno='Cycle 02'; aestdat='28MAY2024'd; aeendat='30JUN2024'd; output;
/* Expected:
recordid=8: N (next start = current end)
recordid=9: Y (no next record)
*/
run;
/* EXPECTED OUTPUT */
data expected_output;
set sample_input;
length lasteventfl $1;
format firstdt date9.;
if recordid=1 then do; lasteventfl='N'; aegroup=1; firstdt='15OCT2023'd; end;
if recordid=2 then do; lasteventfl='N'; aegroup=1; firstdt='15OCT2023'd; end;
if recordid=3 then do; lasteventfl='Y'; aegroup=1; firstdt='15OCT2023'd; end;
if recordid=4 then do; lasteventfl='N'; aegroup=1; firstdt='22APR2023'd; end;
if recordid=5 then do; lasteventfl='N'; aegroup=1; firstdt='22APR2023'd; end;
if recordid=6 then do; lasteventfl='Y'; aegroup=2; firstdt='15MAY2023'd; end;
if recordid=7 then do; lasteventfl='Y'; aegroup=3; firstdt='08APR2024'd; end;
if recordid=8 then do; lasteventfl='N'; aegroup=1; firstdt='01MAR2024'd; end;
if recordid=9 then do; lasteventfl='Y'; aegroup=1; firstdt='01MAR2024'd; end;
run;
At what point in this process does the output not match what you're expecting? I assume this code is AI-generated as there are some strange things going on (my brain stopped working around the ...upcase(strip("variable_to_recode"n) part). I also see the LAG() function being used conditionally (...else if lag(...)...), which is basically never a good idea. Perhaps someone will have the patience to go through this, but really I would say you need to run this step by step and figure out exactly where it's going off-track, then, if needed, post that here.
Your specifications are unclear.
In Step 2 for example:
If there is only one end date you want to use that, Ok
What if there are multiple end dates? Which one do you want to use? The maximum? The minimum?
And where are you supposed to find a non-missing end date to carry forward? It would not make sense to use a date from some other subject or from some other AE for this subject.
You are trying to do steps 3 to 5 in the wrong order.
First determine the groups (step 4) then it will be trivial to find the first date (step 5) and to attach your flag to the last observation in a group(step 3).
We are looking for end dates within the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group. So, for example let's say we have a subject with 3 records for Skin rash. The first 2 records have end date=31MAY2023 but the last one has end date missing. In that case, we'd carry forward the 31MAY2023 end date for the last record. So you carry forward the non-missing end date as long as it's for the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group, and as long as the start date for the record with the non-missing end date is the same as the start date for the record with the missing end date.
Regarding the aegroup determination, the LASTEVENTFL has to be created first since aegroup determination depends on LASTEVENTFL. Hopefully that makes more sense.
LOCF is easy to implement using the UDPATE statement.
I don't like to re-type long list of variable names so let's put that list into a macro variable.
%let keys=subject aesoccalc aeptcalc
temporary_suppae_qval_ctcae temporary_ae_aeterm
;
proc sort data=sample_input out=step1;
BY &keys aestdat visno recordid ;
run;
data step2;
update step1(obs=0) step1;
by &keys aestdat;
output;
run;
But that will not back fill the missing dates that appeared first in a group.
You could fix that by changing the SORT ORDER so a non missing end date (if one exists) appears on the first observation.
proc sort data=sample_input out=step1;
by &keys aestdat descending aeendat ;
run;
If there are other variables (like the VISNO and RECORDID variable you mentioned) that you do not want carried forward then add an extra SET statement to re-read the observation after the UPDATE and before the OUTPUT.
For example to only carry forward AEENDAT you could use:
data step2;
update step1(obs=0) step1;
by &keys aestdat;
set step1(drop=aeendat);
output;
run;
Note the LOCF logic will not handle the cases where there are multiple different AEENDAT values for the same start date. For that you need to summarize first and then use the new summary variable instead of the original variable. So assuming you want to use the MAX AEENDAT in that case your code for step2 might look like this.
data step2;
do until(last.aestdat);
set step1;
by &keys aestdat;
max_end = max(max_end,aeendat);
end;
do until(last.aestdat);
set step1;
by &keys aestdat;
aeendat=max_end;
output;
end;
drop max_end;
run;
Once you have consistent AESTDAT and AEENDAT values then calculating the AEGROUP variable is simple and does not require any look ahead.
Just increment the AEGROUP variable when you see that AESTDAT is larger than the previous AEENDAT value. Be careful of the cases with only missing values of AEENDAT. Also do not get confused by end dates from a previous AE term. Reset the counter for each new subject.
data step5 ;
set step2 ;
by &keys aestdat;
if first.temporary_ae_aeterm
or (aestdat>lag(aeendat)>.Z) then aegroup+1;
if first.subject then aegroup=1;
run;
Now finding the first date and flagging the last observation is simple.
data want;
set step5;
by &keys aegroup ;
length lasteventfl $1 firstdt 8;
if first.aegroup then firstdt = aestdat;
if last.aegroup then lasteventfl='Y';
else lasteventfl='N';
retain firstdt;
format firstdt date9.;
run;
So running that on your example
data sample_input ;
infile datalines dsd truncover;
input subject :$10. aesoccalc :$50. aeptcalc :$50.
temporary_suppae_qval_ctcae :$50. temporary_ae_aeterm :$50.
visno :$20. aestdat :date. aeendat :date. recordid
;
format aestdat date9. aeendat date9. ;
datalines4;
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 08,15OCT2023,,1
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 15,15OCT2023,,2
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 28,15OCT2023,,3
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 02,22APR2023,15MAY2023,4
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 03,22APR2023,15MAY2023,5
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 13,15MAY2023,25MAY2023,6
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 14,08APR2024,28DEC2024,7
003,GI disorders,Nausea,Nausea,NAUSEA,Cycle 01,01MAR2024,28MAY2024,8
003,GI disorders,Nausea,Nausea,NAUSEA,Cycle 02,28MAY2024,30JUN2024,9
;;;;
data expected_output ;
infile datalines dsd truncover;
input subject :$10. aesoccalc :$50. aeptcalc :$50.
temporary_suppae_qval_ctcae :$50. temporary_ae_aeterm :$50.
visno :$20. aestdat :date. aeendat :date. recordid lasteventfl :$1.
firstdt :date. aegroup
;
format aestdat date9. aeendat date9. firstdt date9. ;
datalines4;
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 08,15OCT2023,,1,N,15OCT2023,1
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 15,15OCT2023,,2,N,15OCT2023,1
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 28,15OCT2023,,3,Y,15OCT2023,1
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 02,22APR2023,15MAY2023,4,N,22APR2023,1
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 03,22APR2023,15MAY2023,5,N,22APR2023,1
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 13,15MAY2023,25MAY2023,6,Y,15MAY2023,2
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 14,08APR2024,28DEC2024,7,Y,08APR2024,3
003,GI disorders,Nausea,Nausea,NAUSEA,Cycle 01,01MAR2024,28MAY2024,8,N,01MAR2024,1
003,GI disorders,Nausea,Nausea,NAUSEA,Cycle 02,28MAY2024,30JUN2024,9,Y,01MAR2024,1
;;;;
%let keys=subject aesoccalc aeptcalc
temporary_suppae_qval_ctcae temporary_ae_aeterm
;
* Step 1;
proc sort data=sample_input out=step1;
by &keys aestdat visno recordid ;
run;
* Step 2;
data step2;
do until(last.aestdat);
set step1;
by &keys aestdat;
max_end = max(max_end,aeendat);
end;
do until(last.aestdat);
set step1;
by &keys aestdat;
aeendat=max_end;
output;
end;
drop max_end;
run;
* Step 5;
data step5 ;
set step2 ;
by &keys aestdat;
if first.temporary_ae_aeterm
or (aestdat>lag(aeendat)>.Z) then aegroup+1;
if first.subject then aegroup=1;
run;
* Step 3 and 4 ;
data want;
set step5;
by &keys aegroup ;
length lasteventfl $1 firstdt 8;
if first.aegroup then firstdt = aestdat;
if last.aegroup then lasteventfl='Y';
else lasteventfl='N';
retain firstdt;
format firstdt date9.;
run;
proc print data=want(drop=aesoccalc aeptcalc temporary_suppae_qval_ctcae ); run;
proc print data=expected_output(drop=aesoccalc aeptcalc temporary_suppae_qval_ctcae ); run;
proc compare data=want compare=expected_output;
run;
finds one difference with your expected output.
For SUBJECT 003 your expected results collapsed records 8 and 9 into one group. Why? Is it because the start and end dates are the same? But you did not that for the similar situation in SUBJECT 002 between records 5 and 6.
If you don't want to collapse when the start = previous end then just change the test:
if first.temporary_ae_aeterm
or (aestdat+1>lag(aeendat)>.Z) then aegroup+1;
Results with adjacent records NOT collapsed:
Thank you, Tom! Sorry for any confusion and thank you for pointing out the inconsistency with subject 003. I just realized that there was an error with the expected results for this participant. Both records should be in aegroup 1 and only the last record should have LASTEVENTFL=Y.
@Merdock wrote:
We are looking for end dates within the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group. So, for example let's say we have a subject with 3 records for Skin rash. The first 2 records have end date=31MAY2023 but the last one has end date missing. In that case, we'd carry forward the 31MAY2023 end date for the last record. So you carry forward the non-missing end date as long as it's for the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group, and as long as the start date for the record with the non-missing end date is the same as the start date for the record with the missing end date.
Regarding the aegroup determination, the LASTEVENTFL has to be created first since aegroup determination depends on LASTEVENTFL. Hopefully that makes more sense.
It looks to me like the second bullet of step 2 conflicts with the bolded part above:
I think the bullet is saying you can carry forward an end date from an earlier record with a different start date. Which seems odd, and could lead to cases where the imputed end date is earlier than the start date.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.