Re: Programming a time-varying cumulative covariate in long format

dwhitney · Posted 01-30-2026 09:53 AM

I have a dataset in a long format where each subsequent row per ID is a new dental surgery event; folks have anywhere from 2-5 distinct surgery dates (tx_date) within 30 days of the first tx_date. The outcome is examined up to 30 days after the first tx_date. For those that developed the outcome (postop_yn=1), the subsequent dental surgeries occur before or on the same day as the outcome. So, all unique tx_dates per ID are within 30 days of the very first tx_date through the end of the 30 day follow-up or on/before developing the outcome.

At each surgery, a certain number of teeth get extracted (Num_teeth_per_txdate). I need help in figuring out how to code the dataset so that I get a start (day 1) to end (the day before the subsequent tx_date) per tx_date while cumulatively summing the Num_teeth_per_txdate until the last tx_date.

1. One question is how to code the start/stop so that the stop is 1 day before the subsequent tx_date; right now, I am getting the stop to equal the next tx_date.

2. The other question is how to get the Num_teeth_per_txdate to be cumulatively summing over the tx_dates.

Below is the first 50 observations of the dataset. After that is the code I am struggling with.

id	tx_date	tx_visit	days_diff	Num_teeth_per_txdate	postop_yn	postop_days
120587	10/3/2022	1	0	2	0	30
120587	10/28/2022	2	25	1	0	30
149268	3/25/2021	1	0	5	0	30
149268	4/15/2021	2	21	3	0	30
152067	1/28/2021	1	0	3	0	30
152067	2/11/2021	2	14	1	0	30
154198	12/10/2020	1	0	5	0	30
154198	12/17/2020	2	7	2	0	30
154198	1/6/2021	3	27	3	0	30
170135	9/26/2022	1	0	1	0	30
170135	10/20/2022	2	24	1	0	30
181736	3/12/2024	1	0	1	0	30
181736	3/19/2024	2	7	1	0	30
309868	6/16/2023	1	0	3	0	30
309868	7/14/2023	2	28	2	0	30
318151	4/4/2023	1	0	6	1	30
318151	4/18/2023	2	14	9	1	30
323719	11/9/2022	1	0	1	0	30
323719	11/15/2022	2	6	8	0	30
327936	11/23/2020	1	0	1	0	30
327936	12/22/2020	2	29	1	0	30
329053	12/23/2021	1	0	3	0	30
329053	1/11/2022	2	19	1	0	30
336334	3/5/2024	1	0	1	0	30
336334	3/27/2024	2	22	3	0	30
337798	5/5/2021	1	0	6	0	30
337798	5/14/2021	2	9	4	0	30
338227	11/29/2023	1	0	3	0	30
338227	12/19/2023	2	20	4	0	30
340983	1/20/2021	1	0	4	0	30
340983	1/28/2021	2	8	4	0	30
340983	2/3/2021	3	14	2	0	30
345539	10/3/2025	1	0	3	0	30
345539	10/17/2025	2	14	3	0	30
345539	10/31/2025	3	28	3	0	30
348457	7/6/2023	1	0	1	1	21
348457	7/17/2023	2	11	3	1	21
349118	7/21/2020	1	0	1	0	30
349118	7/30/2020	2	9	4	0	30
349118	8/13/2020	3	23	2	0	30
349948	10/12/2022	1	0	1	1	16
349948	10/19/2022	2	7	4	1	16
350382	1/24/2024	1	0	1	0	30
350382	2/7/2024	2	14	4	0	30
354806	2/23/2021	1	0	2	0	30
354806	3/2/2021	2	7	2	0	30
363310	11/12/2020	1	0	4	0	30
363310	11/17/2020	2	5	6	0	30
367820	5/20/2022	1	0	5	0	30
367820	6/14/2022	2	25	2	0	30

Code that needs some help.

data v2a; set v2;

by id;

retain start stop tvteeth teeth;

if first.id then do; start=1; stop=1; teeth=.; tvteeth=Num_teeth_per_txdate; end;

if tx_visit ne 1 and days_diff < postop_days then do; start=stop; stop=days_diff; teeth=lag(Num_teeth_per_txdate); tvteeth=tvteeth+teeth; event=0; output; end;

if last.id then do; start=stop; stop=postop_days; teeth=lag(Num_teeth_per_txdate); tvteeth=tvteeth+teeth; event=(postop_yn=1); output; end;

run;

PaigeMiller · Posted 01-30-2026 10:06 AM

@dwhitney wrote:

1. One question is how to code the start/stop so that the stop is 1 day before the subsequent tx_date; right now, I am getting the stop to equal the next tx_date.

I am mystified by this question. I feel either I have completely missed the point, or you have missed something obvious. Can you not just subtract 1 and get the quantity you want?

2. The other question is how to get the Num_teeth_per_txdate to be cumulatively summing over the tx_dates.

Cumulative number of teeth, you can add this into your existing SAS data step code:

if first.id then cum_teeth=0;
cum_teeth+num_teeth_per_txdate;

--
Paige Miller

dwhitney · Posted 01-30-2026 10:55 AM

Hi Paige,

Thank you for the reply!

1. I can easily subtract 1 from STOP, but in a follow up DATA STEP as I can isolate the ID order number. In the current DATA STEP, if I subtract or add 1 to either the START or STOP for lines "if tx_visit ne 1..." and "if last.id...", it does this calculation on rows that I DO and do NOT want it to. For example, the below code (bolded segments are the revision) yields the correct START/STOP for the first and last ID rows, but not middle rows for those with 3+ tx_dates (see last ID in below table; I bolded the incorrect row).

data v2a; set v2;

by id;

retain start stop tvteeth teeth;

if first.id then cum_teeth=0; cum_teeth+num_teeth_per_txdate;

if first.id then do; start=1; stop=1; teeth=.; tvteeth=Num_teeth_per_txdate; end;

if tx_visit ne 1 and days_diff < postop_days then do; start=stop; stop=days_diff-1; teeth=lag(Num_teeth_per_txdate); tvteeth=tvteeth+teeth; event=0; output; end;

if last.id then do; start=stop+1; stop=postop_days; teeth=lag(Num_teeth_per_txdate); tvteeth=tvteeth+teeth; event=(postop_yn=1); output; end;

run;

id	tx_date	days_diff	start	stop
120587	10/3/2022	0	1	24
120587	10/28/2022	25	25	30
149268	3/25/2021	0	1	20
149268	4/15/2021	21	21	30
152067	1/28/2021	0	1	13
152067	2/11/2021	14	14	30
154198	12/10/2020	0	1	6
154198	12/17/2020	7	6	26
154198	1/6/2021	27	27	30

2. I have added your suggested code in several few different places, but it is not getting the cumulative sum, and I am not sure why. Below is the first 9 observations where the cum_teeth variable is showing what it SHOULD be, but I can't get that. Any suggestions? Below is the last version of the code I tried.

id	tx_date	start	stop	cum_teeth
120587	10/28/2022	1	24	2
120587	10/28/2022	25	30	3
149268	4/15/2021	1	20	5
149268	4/15/2021	21	30	8
152067	2/11/2021	1	13	3
152067	2/11/2021	14	30	4
154198	12/17/2020	1	6	5
154198	1/6/2021	6	26	7
154198	1/6/2021	27	30	10

data v2a; set v2;

by id;

retain start stop teeth cum_teeth;

if first.id then do; start=1; stop=1; cum_teeth=0; end;

if tx_visit ne 1 and days_diff < postop_days then do; start=stop; stop=days_diff-1; cum_teeth+num_teeth_per_txdate; event=0; output; end;

if last.id then do; start=stop+1; stop=postop_days; cum_teeth+num_teeth_per_txdate; event=(postop_yn=1); output; end;

run;

PaigeMiller · Posted 01-30-2026 11:12 AM

Regarding #2, the summing to get cum_teeth should not be inside an IF statement. That line must execute on every record.

Regarding #1, I am going to request that you provide code inside a code box (click on the little running man icon and paste your code in there) with one command per line, and with proper indentation.

And in fact, it is also requested (by me, if you want me to run code on your data) to provide the data as working SAS data step code (examples and instructions).

--
Paige Miller

dwhitney · Posted 01-30-2026 11:34 AM

data v2a;  set v2; 

by id; 

retain start stop cum_teeth; 

if first.id then do; start=1; stop=1; cum_teeth=0; end;

if tx_visit ne 1 and days_diff < postop_days then do; start=stop; stop=days_diff-1; event=0; output; end;

if last.id then do; start=stop+1; stop=postop_days; event=(postop_yn=1); output; end;

cum_teeth+num_teeth_per_txdate;
			
run;

Above is the code that is still not working to get the cumulative teeth.

Below is the working SAS data step code. I am only taking the first ~50 observations.

data 50.V2;
infile datalines dsd truncover;
input id:BEST. tx_date:MMDDYY10. tx_visit:32. days_diff:32. Num_teeth_per_txdate:32. postop_yn:32. postop_days:32.;
format id BEST. tx_date MMDDYY10.;
label id="ID" tx_date="Tx_date";
datalines4;
120587,10/03/2022,1,0,2,0,30
120587,10/28/2022,2,25,1,0,30
149268,03/25/2021,1,0,5,0,30
149268,04/15/2021,2,21,3,0,30
152067,01/28/2021,1,0,3,0,30
152067,02/11/2021,2,14,1,0,30
154198,12/10/2020,1,0,5,0,30
154198,12/17/2020,2,7,2,0,30
154198,01/06/2021,3,27,3,0,30
170135,09/26/2022,1,0,1,0,30
170135,10/20/2022,2,24,1,0,30
181736,03/12/2024,1,0,1,0,30
181736,03/19/2024,2,7,1,0,30
309868,06/16/2023,1,0,3,0,30
309868,07/14/2023,2,28,2,0,30
318151,04/04/2023,1,0,6,1,30
318151,04/18/2023,2,14,9,1,30
323719,11/09/2022,1,0,1,0,30
323719,11/15/2022,2,6,8,0,30
327936,11/23/2020,1,0,1,0,30
327936,12/22/2020,2,29,1,0,30
329053,12/23/2021,1,0,3,0,30
329053,01/11/2022,2,19,1,0,30
336334,03/05/2024,1,0,1,0,30
336334,03/27/2024,2,22,3,0,30
337798,05/05/2021,1,0,6,0,30
337798,05/14/2021,2,9,4,0,30
338227,11/29/2023,1,0,3,0,30
338227,12/19/2023,2,20,4,0,30
340983,01/20/2021,1,0,4,0,30
340983,01/28/2021,2,8,4,0,30
340983,02/03/2021,3,14,2,0,30
345539,10/03/2025,1,0,3,0,30
345539,10/17/2025,2,14,3,0,30
345539,10/31/2025,3,28,3,0,30
348457,07/06/2023,1,0,1,1,21
348457,07/17/2023,2,11,3,1,21
349118,07/21/2020,1,0,1,0,30
349118,07/30/2020,2,9,4,0,30
349118,08/13/2020,3,23,2,0,30
349948,10/12/2022,1,0,1,1,16
349948,10/19/2022,2,7,4,1,16
350382,01/24/2024,1,0,1,0,30
350382,02/07/2024,2,14,4,0,30
354806,02/23/2021,1,0,2,0,30
354806,03/02/2021,2,7,2,0,30
363310,11/12/2020,1,0,4,0,30
363310,11/17/2020,2,5,6,0,30
367820,05/20/2022,1,0,5,0,30
367820,06/14/2022,2,25,2,0,30
369007,12/01/2025,1,0,5,0,30
369007,12/18/2025,2,17,5,0,30

;;;;

PaigeMiller · Posted 01-30-2026 12:38 PM

Please comply with my request, with emphasis in red.

Regarding #1, I am going to request that you provide code inside a code box (click on the little running man icon and paste your code in there) with one command per line, and with proper indentation.

--
Paige Miller

dwhitney · Posted 01-30-2026 01:12 PM

Apologies, I must not know what proper indentation is. I googled it and tried the below.

data v2a;  
    set v2; 
        by id; 
            retain start stop cum_teeth; 
            if first.id then do; 
                start=1; stop=1; cum_teeth=0; 
            end;
            if tx_visit ne 1 and days_diff < postop_days then do; 
                start=stop; stop=days_diff-1; event=0; 
           output; 
           end;
           if last.id then do; 
               start=stop+1; stop=postop_days; event=(postop_yn=1); 
           output; 
           end;
        cum_teeth+num_teeth_per_txdate;		
run;

PaigeMiller · Posted 01-30-2026 01:29 PM

You still have three commands on a single line.

Nevertheless, this is much more workable, so thank you. Here is my solution.

data v2a;  
    set v2; 
    by id; 
    prev_tx_date=lag(tx_date);
    if first.id then do; 
        days_diff=0;
        cum_teeth=0; 
    end;
    else do;
        days_diff=tx_date-prev_tx_date-1;
    end;
    cum_teeth+num_teeth_per_txdate;	
    drop prev_tx_date;
run;

--
Paige Miller

dwhitney · Posted 01-30-2026 01:57 PM

Ah, I see now what is meant by a command! I will do that in the future. Thank you for working me on this.

Your code works well to get the cum_teeth, but not the START and STOP (in days), which I need for the time-varying proc phreg model. I tried using your code and adding START/STOP variables, but to no luck.

If first.ID, then START should equal 1 and the STOP should be 1 day minus the next tx_date per ID. Then for subsequent tx_dates per ID, the START should equal the diff_days (tx_date of that row - tx_date from the row before) and the STOP should equal either: (1) 1 day minus the next tx_date per ID if not the last.ID or (2) the postop_days if the last.ID. Is there a way to build that into your code?

Tom · Posted 01-30-2026 02:23 PM

Not sure I follow your whole thread but is seems you might need to know the NEXT value of START to do what you want. That is harder than finding the PREVIOUS value since it is much harder to predict the future than to remember the past.

SAS makes it easy to find the PREVIOUS value by using the LAG() function to remember the past value.

Here is simple trick to predict the future. Use an additional SET statement to read in the same dataset only using FIRSTOBS=2 to skip the first observation. I like to add a third read of just the first observation to make the two SET statements have the same number of observations.

Remember to take care not to use the NEXT value when at the end of a BY group since it comes from the following BY group. Just like you took care not to use the PREVIOUS value at the start of the BY group since it came from an earlier BY group.

data v2a;  
    set v2; 
    by id tx_date ; 
    prev_tx_date=lag(tx_date);
    set v2(keep=tx_date rename=(tx_date=next_tx_date) firstobs=2)
        v2(obs=1 drop=_all_)
    ;
    if first.id then do; 
        days_diff=0;
        cum_teeth=0; 
    end;
    else do;
        days_diff=tx_date-prev_tx_date-1;
    end;
    if last.id then call missing(next_tx_date);
    else stop_date = next_tx_date -1 ;
    format stop_date date9.;
    cum_teeth+num_teeth_per_txdate;	
    drop prev_tx_date next_tx_date ;
run;

PaigeMiller · Posted 01-30-2026 02:36 PM

Maybe someone else can figure out how to do this in one step, but here is my solution which takes several steps

data v2a;  
    set v2; 
    by id; 
    prev_tx_date=lag(tx_date);
    if first.id then do; 
        days_diff=0;
        cum_teeth=0; 
        start=1;
    end;
    else do;
        days_diff=tx_date-prev_tx_date-1;
    end;
    cum_teeth+num_teeth_per_txdate;	
    drop prev_tx_date;
run;

/* Find max tx_date and min tx_date per id */
proc summary nway data=v2a;
    class id;
    var tx_date;
    output out=max_tx_date min=min_tx_date max=max_tx_date;
run;
data v2b;
    merge v2a max_tx_date(keep=max_tx_date min_tx_date id);
    by id;
run;
data v2c;
    merge v2b v2b(firstobs=2 keep=tx_date rename=(tx_date=next_tx_date));
    if days_diff=0 then do; 
        start=1;
        stop=next_tx_date-tx_date-1; 
    end;
    else do;
        if tx_date^=max_tx_date then do;
            start=tx_date-min_tx_date;
            stop=next_tx_date-min_tx_date-1; 
        end;
        else do;
            start=tx_date-min_tx_date;
            stop=postop_days;
        end;
    end;
    drop min_tx_date max_tx_date next_tx_date;
run;

--
Paige Miller

quickbluefish · Posted 01-30-2026 03:42 PM

You might try something like this. That said, I don't think I really follow the purpose of the 'post_op' variable here.

proc sql noprint;
select max(nrecs) into :maxrecs trimmed 
from 
	(select id, count(*) as nrecs from tx group by id);
quit;

%let maxrecs=%eval(&maxrecs+1);

proc sort data=tx; by id tx_date; run;

data tx2;
set tx;
by id;
array dts {&maxrecs} _temporary_;
array tth {&maxrecs} _temporary_;
length n startdt enddt 4 nteeth 3;
format startdt enddt date9.;
if first.id then call missing(of dts[*], of tth[*], n, nteeth);
n+1;
dts[n]=tx_date;
tth[n]=num_teeth_per_txdate;
if last.id then do;
	do i=1 to n;
		startdt=dts[i];
		enddt=dts[i+1]-1;
		nteeth+tth[i];
		output;
	end;
end;
keep id startdt enddt nteeth;
run;

dwhitney · Posted 01-30-2026 04:02 PM

Thank you, kindly, to everyone! Using a bit of everyone's approach, I finally got the dataset that I needed. This has been an incredibly helpful experience. Thank you, everyone!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away