BookmarkSubscribeRSS Feed
Merdock
Quartz | Level 8

 

I have an AE dataset and need to create 3 variables, lasteventfl, aegroup and firstdt, per the following specs:

Step 1: Sort Data

 
 
BY subject aesoccalc aeptcalc temporary_suppae_qval_ctcae temporary_ae_aeterm 
   aestdat visno recordid

Step 2: Fill Missing End Dates

Within each subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm combination:

Rule: If a record has the same AESTDAT as another record in that group, it should have the same AEENDAT.

Implementation:

  • For records with the same start date, use the non-missing end date if one exists
  • If no end date exists for that start date group, carry forward the most recent non-missing end date from prior records

Step 3: Determine LASTEVENTFL

For each record, compare to the NEXT record in sort order within the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group:

Set LASTEVENTFL = 'N' if ANY of these conditions are true:

  1. Next record's AESTDAT = Current record's AESTDAT
  2. Next record's AESTDAT = Current record's AEENDAT (filled)
  3. Next record's AESTDAT = Current record's AEENDAT (filled) + 1

Otherwise, set LASTEVENTFL = 'Y'

Special rule: Among multiple records with the same AESTDAT, only the LAST record (by visno, recordid) should get the 'Y' or 'N' determined above. All other records with that same AESTDAT should get 'N'.

Step 4: Create AEGROUP

Within each subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm combination:

  • First record: aegroup = 1
  • When you encounter a record where the PREVIOUS record had lasteventfl='Y': increment aegroup

Step 5: Calculate FIRSTDT

For each aegroup: FIRSTDT = minimum AESTDAT within that aegroup

 

Can somebody please help me get the correct code to what I need it to per the specs? I've been trying Claude for hours but for some reason, it cannot seem to get it to work.


Here is a sample input dataset, along with expected output:

data sample_input;
    length subject $10 aesoccalc $50 aeptcalc $50 
           temporary_suppae_qval_ctcae $50 temporary_ae_aeterm $50 
           visno $20;
    format aestdat aeendat date9.;
    
    /* Example 1: Skin infection - all same start, no ends */
    subject='001'; aesoccalc='Infections'; aeptcalc='Skin infection'; 
    temporary_suppae_qval_ctcae='Skin infection'; temporary_ae_aeterm='SKIN INFECTION';
    recordid=1; visno='Cycle 08'; aestdat='15OCT2023'd; aeendat=.; output;
    recordid=2; visno='Cycle 15'; aestdat='15OCT2023'd; aeendat=.; output;
    recordid=3; visno='Cycle 28'; aestdat='15OCT2023'd; aeendat=.; output;
    /* Expected: All N except recordid=3 which is Y */
    
    /* Example 2: Rash - two events same start, then gap, then another */
    subject='002'; aesoccalc='Skin disorders'; aeptcalc='Rash'; 
    temporary_suppae_qval_ctcae='Rash maculo-papular'; temporary_ae_aeterm='MACULO-PAPULAR RASH';
    recordid=4; visno='Cycle 02'; aestdat='22APR2023'd; aeendat='15MAY2023'd; output;
    recordid=5; visno='Cycle 03'; aestdat='22APR2023'd; aeendat='15MAY2023'd; output;
    recordid=6; visno='Cycle 13'; aestdat='15MAY2023'd; aeendat='25MAY2023'd; output;
    recordid=7; visno='Cycle 14'; aestdat='08APR2024'd; aeendat='28DEC2024'd; output;
    /* Expected: 
       recordid=4: N (same start as next)
       recordid=5: N (not last with this start date)
       recordid=6: Y (last with this start, and next start != 25MAY or 26MAY)
       recordid=7: Y (no next record)
    */
    
    /* Example 3: Nausea - next starts on current end */
    subject='003'; aesoccalc='GI disorders'; aeptcalc='Nausea'; 
    temporary_suppae_qval_ctcae='Nausea'; temporary_ae_aeterm='NAUSEA';
    recordid=8; visno='Cycle 01'; aestdat='01MAR2024'd; aeendat='28MAY2024'd; output;
    recordid=9; visno='Cycle 02'; aestdat='28MAY2024'd; aeendat='30JUN2024'd; output;
    /* Expected:
       recordid=8: N (next start = current end)
       recordid=9: Y (no next record)
    */
run;

/* EXPECTED OUTPUT */
data expected_output;
    set sample_input;
    length lasteventfl $1;
    format firstdt date9.;
    
    if recordid=1 then do; lasteventfl='N'; aegroup=1; firstdt='15OCT2023'd; end;
    if recordid=2 then do; lasteventfl='N'; aegroup=1; firstdt='15OCT2023'd; end;
    if recordid=3 then do; lasteventfl='Y'; aegroup=1; firstdt='15OCT2023'd; end;
    
    if recordid=4 then do; lasteventfl='N'; aegroup=1; firstdt='22APR2023'd; end;
    if recordid=5 then do; lasteventfl='N'; aegroup=1; firstdt='22APR2023'd; end;
    if recordid=6 then do; lasteventfl='Y'; aegroup=2; firstdt='15MAY2023'd; end;
    if recordid=7 then do; lasteventfl='Y'; aegroup=3; firstdt='08APR2024'd; end;
    
    if recordid=8 then do; lasteventfl='N'; aegroup=1; firstdt='01MAR2024'd; end;
    if recordid=9 then do; lasteventfl='Y'; aegroup=1; firstdt='01MAR2024'd; end;
run;

 

7 REPLIES 7
quickbluefish
Barite | Level 11

At what point in this process does the output not match what you're expecting?  I assume this code is AI-generated as there are some strange things going on (my brain stopped working around the ...upcase(strip("variable_to_recode"n) part).  I also see the LAG() function being used conditionally (...else if lag(...)...), which is basically never a good idea.  Perhaps someone will have the patience to go through this, but really I would say you need to run this step by step and figure out exactly where it's going off-track, then, if needed, post that here.  

Merdock
Quartz | Level 8
Thank you for your reply! Unfortunately, I'm in a bit of a time pinch at this point so to just simplify things, I removed my clunky code and just provided a sample dataset and expected output. Hopefully somebody can help me with some working code that does what I need it to.
Tom
Super User Tom
Super User

Your specifications are unclear.

 

In Step 2 for example:

If there is only one end date you want to use that, Ok

What if there are multiple end dates? Which one do you want to use? The maximum? The minimum?

And where are you supposed to find a non-missing end date to carry forward?  It would not make sense to use a date from some other subject or from some other AE for this subject.

 

You are trying to do steps 3 to 5 in the wrong order. 

 

First determine the groups (step 4) then it will be trivial to find the first date (step 5) and to attach your flag to the last observation in a group(step 3).

 

 

Merdock
Quartz | Level 8

We are looking for end dates within the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group. So, for example let's say we have a subject with 3 records for Skin rash. The first 2 records have end date=31MAY2023 but the last one has end date missing. In that case, we'd carry forward the 31MAY2023 end date for the last record. So you carry forward the non-missing end date as long as it's for the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group, and as long as the start date for the record with the non-missing end date is the same as the start date for the record with the missing end date.
Regarding the aegroup determination, the LASTEVENTFL has to be created first since aegroup determination depends on LASTEVENTFL. Hopefully that makes more sense.

Tom
Super User Tom
Super User

LOCF is easy to implement using the UDPATE statement.

 

I don't like to re-type long list of variable names so let's put that list into a macro variable.

%let keys=subject aesoccalc aeptcalc
 temporary_suppae_qval_ctcae temporary_ae_aeterm 
;

proc sort data=sample_input out=step1;
  BY &keys aestdat visno recordid ;
run;

data step2;
  update step1(obs=0) step1;
  by &keys aestdat;
  output;
run;

But that will not back fill the missing dates that appeared first in a group.

 

You could fix that by changing the SORT ORDER so a non missing end date (if one exists) appears on the first observation.

proc sort data=sample_input out=step1;
  by &keys aestdat descending aeendat ;
run;

 

If there are other variables (like the VISNO and RECORDID variable you mentioned) that you do not want carried forward then add an extra SET statement to re-read the observation after the UPDATE and before the OUTPUT.

 

For example to only carry forward AEENDAT you could use:

data step2;
  update step1(obs=0) step1;
  by &keys aestdat;
  set step1(drop=aeendat);
  output;
run;

 

Note the LOCF logic will not handle the cases where there are multiple different AEENDAT values for the same start date.  For that you need to summarize first and then use the new summary variable instead of the original variable.  So assuming you want to use the MAX AEENDAT in that case your code for step2 might look like this.

data step2;
  do until(last.aestdat);
    set step1;
    by &keys aestdat;
    max_end = max(max_end,aeendat);
  end;
  do until(last.aestdat);
    set step1;
    by &keys aestdat;
    aeendat=max_end;
    output;
  end;
  drop max_end;
run;

 

Once you have consistent AESTDAT and AEENDAT values then calculating the AEGROUP variable is simple and does not require any look ahead.

Just increment the AEGROUP variable when you see that AESTDAT is larger than the previous AEENDAT value.  Be careful of the cases with only missing values of AEENDAT.  Also do not get confused by end dates from a previous AE term.  Reset the counter for each new subject.

data step5 ;
   set step2 ;
   by &keys aestdat;
   if first.temporary_ae_aeterm
    or (aestdat>lag(aeendat)>.Z) then aegroup+1;
   if first.subject then aegroup=1;
run;

Now finding the first date and flagging the last observation is simple.

data want;
   set step5;
   by &keys aegroup ;
   length lasteventfl $1 firstdt 8;
   if first.aegroup then firstdt = aestdat;
   if last.aegroup then lasteventfl='Y';
   else lasteventfl='N';
   retain firstdt;
   format firstdt date9.;
run;

So running that on your example

Spoiler
data sample_input ;
  infile datalines dsd truncover;
  input subject :$10. aesoccalc :$50. aeptcalc :$50.
    temporary_suppae_qval_ctcae :$50. temporary_ae_aeterm :$50.
    visno :$20. aestdat :date. aeendat :date. recordid
  ;
  format aestdat date9. aeendat date9. ;
datalines4;
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 08,15OCT2023,,1
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 15,15OCT2023,,2
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 28,15OCT2023,,3
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 02,22APR2023,15MAY2023,4
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 03,22APR2023,15MAY2023,5
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 13,15MAY2023,25MAY2023,6
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 14,08APR2024,28DEC2024,7
003,GI disorders,Nausea,Nausea,NAUSEA,Cycle 01,01MAR2024,28MAY2024,8
003,GI disorders,Nausea,Nausea,NAUSEA,Cycle 02,28MAY2024,30JUN2024,9
;;;;

data expected_output ;
  infile datalines dsd truncover;
  input subject :$10. aesoccalc :$50. aeptcalc :$50.
    temporary_suppae_qval_ctcae :$50. temporary_ae_aeterm :$50.
    visno :$20. aestdat :date. aeendat :date. recordid lasteventfl :$1. 
    firstdt :date. aegroup
  ;
  format aestdat date9. aeendat date9. firstdt date9. ;
datalines4;
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 08,15OCT2023,,1,N,15OCT2023,1
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 15,15OCT2023,,2,N,15OCT2023,1
001,Infections,Skin infection,Skin infection,SKIN INFECTION,Cycle 28,15OCT2023,,3,Y,15OCT2023,1
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 02,22APR2023,15MAY2023,4,N,22APR2023,1
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 03,22APR2023,15MAY2023,5,N,22APR2023,1
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 13,15MAY2023,25MAY2023,6,Y,15MAY2023,2
002,Skin disorders,Rash,Rash maculo-papular,MACULO-PAPULAR RASH,Cycle 14,08APR2024,28DEC2024,7,Y,08APR2024,3
003,GI disorders,Nausea,Nausea,NAUSEA,Cycle 01,01MAR2024,28MAY2024,8,N,01MAR2024,1
003,GI disorders,Nausea,Nausea,NAUSEA,Cycle 02,28MAY2024,30JUN2024,9,Y,01MAR2024,1
;;;;

%let keys=subject aesoccalc aeptcalc
 temporary_suppae_qval_ctcae temporary_ae_aeterm 
;

* Step 1;
proc sort data=sample_input out=step1;
  by &keys aestdat visno recordid ;
run;

* Step 2;
data step2;
  do until(last.aestdat);
    set step1;
    by &keys aestdat;
    max_end = max(max_end,aeendat);
  end;
  do until(last.aestdat);
    set step1;
    by &keys aestdat;
    aeendat=max_end;
    output;
  end;
  drop max_end;
run;

* Step 5;
data step5 ;
   set step2 ;
   by &keys aestdat;
   if first.temporary_ae_aeterm
    or (aestdat>lag(aeendat)>.Z) then aegroup+1;
   if first.subject then aegroup=1;
run;

* Step 3 and 4 ;
data want;
   set step5;
   by &keys aegroup ;
   length lasteventfl $1 firstdt 8;
   if first.aegroup then firstdt = aestdat;
   if last.aegroup then lasteventfl='Y';
   else lasteventfl='N';
   retain firstdt;
   format firstdt date9.;
run;

proc print data=want(drop=aesoccalc aeptcalc temporary_suppae_qval_ctcae ); run;
proc print data=expected_output(drop=aesoccalc aeptcalc temporary_suppae_qval_ctcae ); run;
proc compare data=want compare=expected_output;
run;

finds one difference with your expected output. 

 

For SUBJECT 003 your expected results collapsed records 8 and 9 into one group.  Why?  Is it because the start and end dates are the same?  But you did not that for the similar situation in SUBJECT 002 between records 5 and 6.

 

If you don't want to collapse when the start = previous end then just change the test:

   if first.temporary_ae_aeterm
    or (aestdat+1>lag(aeendat)>.Z) then aegroup+1;

Results with adjacent records NOT collapsed:

Tom_0-1759770838843.png

 

 

 

 

Merdock
Quartz | Level 8

Thank you, Tom! Sorry for any confusion and thank you for pointing out the inconsistency with subject 003. I just realized that there was an error with the expected results for this participant. Both records should be in aegroup 1 and only the last record should have LASTEVENTFL=Y.

Quentin
Super User

@Merdock wrote:

We are looking for end dates within the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group. So, for example let's say we have a subject with 3 records for Skin rash. The first 2 records have end date=31MAY2023 but the last one has end date missing. In that case, we'd carry forward the 31MAY2023 end date for the last record. So you carry forward the non-missing end date as long as it's for the same subject/AESOCCALC/AEPTCALC/temporary_suppae_qval_ctcae/temporary_ae_aeterm group, and as long as the start date for the record with the non-missing end date is the same as the start date for the record with the missing end date.
Regarding the aegroup determination, the LASTEVENTFL has to be created first since aegroup determination depends on LASTEVENTFL. Hopefully that makes more sense.


It looks to me like the second bullet of step 2 conflicts with the bolded part above:

  • If no end date exists for that start date group, carry forward the most recent non-missing end date from prior records

I think the bullet is saying you can carry forward an end date from an earlier record with a different start date.  Which seems odd, and could lead to cases where the imputed end date is earlier than the start date.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 152 views
  • 5 likes
  • 4 in conversation