I have a dataset in a long format where each subsequent row per ID is a new dental surgery event; folks have anywhere from 2-5 distinct surgery dates (tx_date) within 30 days of the first tx_date. The outcome is examined up to 30 days after the first tx_date. For those that developed the outcome (postop_yn=1), the subsequent dental surgeries occur before or on the same day as the outcome. So, all unique tx_dates per ID are within 30 days of the very first tx_date through the end of the 30 day follow-up or on/before developing the outcome.
At each surgery, a certain number of teeth get extracted (Num_teeth_per_txdate). I need help in figuring out how to code the dataset so that I get a start (day 1) to end (the day before the subsequent tx_date) per tx_date while cumulatively summing the Num_teeth_per_txdate until the last tx_date.
1. One question is how to code the start/stop so that the stop is 1 day before the subsequent tx_date; right now, I am getting the stop to equal the next tx_date.
2. The other question is how to get the Num_teeth_per_txdate to be cumulatively summing over the tx_dates.
Below is the first 50 observations of the dataset. After that is the code I am struggling with.
| id | tx_date | tx_visit | days_diff | Num_teeth_per_txdate | postop_yn | postop_days |
| 120587 | 10/3/2022 | 1 | 0 | 2 | 0 | 30 |
| 120587 | 10/28/2022 | 2 | 25 | 1 | 0 | 30 |
| 149268 | 3/25/2021 | 1 | 0 | 5 | 0 | 30 |
| 149268 | 4/15/2021 | 2 | 21 | 3 | 0 | 30 |
| 152067 | 1/28/2021 | 1 | 0 | 3 | 0 | 30 |
| 152067 | 2/11/2021 | 2 | 14 | 1 | 0 | 30 |
| 154198 | 12/10/2020 | 1 | 0 | 5 | 0 | 30 |
| 154198 | 12/17/2020 | 2 | 7 | 2 | 0 | 30 |
| 154198 | 1/6/2021 | 3 | 27 | 3 | 0 | 30 |
| 170135 | 9/26/2022 | 1 | 0 | 1 | 0 | 30 |
| 170135 | 10/20/2022 | 2 | 24 | 1 | 0 | 30 |
| 181736 | 3/12/2024 | 1 | 0 | 1 | 0 | 30 |
| 181736 | 3/19/2024 | 2 | 7 | 1 | 0 | 30 |
| 309868 | 6/16/2023 | 1 | 0 | 3 | 0 | 30 |
| 309868 | 7/14/2023 | 2 | 28 | 2 | 0 | 30 |
| 318151 | 4/4/2023 | 1 | 0 | 6 | 1 | 30 |
| 318151 | 4/18/2023 | 2 | 14 | 9 | 1 | 30 |
| 323719 | 11/9/2022 | 1 | 0 | 1 | 0 | 30 |
| 323719 | 11/15/2022 | 2 | 6 | 8 | 0 | 30 |
| 327936 | 11/23/2020 | 1 | 0 | 1 | 0 | 30 |
| 327936 | 12/22/2020 | 2 | 29 | 1 | 0 | 30 |
| 329053 | 12/23/2021 | 1 | 0 | 3 | 0 | 30 |
| 329053 | 1/11/2022 | 2 | 19 | 1 | 0 | 30 |
| 336334 | 3/5/2024 | 1 | 0 | 1 | 0 | 30 |
| 336334 | 3/27/2024 | 2 | 22 | 3 | 0 | 30 |
| 337798 | 5/5/2021 | 1 | 0 | 6 | 0 | 30 |
| 337798 | 5/14/2021 | 2 | 9 | 4 | 0 | 30 |
| 338227 | 11/29/2023 | 1 | 0 | 3 | 0 | 30 |
| 338227 | 12/19/2023 | 2 | 20 | 4 | 0 | 30 |
| 340983 | 1/20/2021 | 1 | 0 | 4 | 0 | 30 |
| 340983 | 1/28/2021 | 2 | 8 | 4 | 0 | 30 |
| 340983 | 2/3/2021 | 3 | 14 | 2 | 0 | 30 |
| 345539 | 10/3/2025 | 1 | 0 | 3 | 0 | 30 |
| 345539 | 10/17/2025 | 2 | 14 | 3 | 0 | 30 |
| 345539 | 10/31/2025 | 3 | 28 | 3 | 0 | 30 |
| 348457 | 7/6/2023 | 1 | 0 | 1 | 1 | 21 |
| 348457 | 7/17/2023 | 2 | 11 | 3 | 1 | 21 |
| 349118 | 7/21/2020 | 1 | 0 | 1 | 0 | 30 |
| 349118 | 7/30/2020 | 2 | 9 | 4 | 0 | 30 |
| 349118 | 8/13/2020 | 3 | 23 | 2 | 0 | 30 |
| 349948 | 10/12/2022 | 1 | 0 | 1 | 1 | 16 |
| 349948 | 10/19/2022 | 2 | 7 | 4 | 1 | 16 |
| 350382 | 1/24/2024 | 1 | 0 | 1 | 0 | 30 |
| 350382 | 2/7/2024 | 2 | 14 | 4 | 0 | 30 |
| 354806 | 2/23/2021 | 1 | 0 | 2 | 0 | 30 |
| 354806 | 3/2/2021 | 2 | 7 | 2 | 0 | 30 |
| 363310 | 11/12/2020 | 1 | 0 | 4 | 0 | 30 |
| 363310 | 11/17/2020 | 2 | 5 | 6 | 0 | 30 |
| 367820 | 5/20/2022 | 1 | 0 | 5 | 0 | 30 |
| 367820 | 6/14/2022 | 2 | 25 | 2 | 0 | 30 |
Code that needs some help.
@dwhitney wrote:
1. One question is how to code the start/stop so that the stop is 1 day before the subsequent tx_date; right now, I am getting the stop to equal the next tx_date.
I am mystified by this question. I feel either I have completely missed the point, or you have missed something obvious. Can you not just subtract 1 and get the quantity you want?
2. The other question is how to get the Num_teeth_per_txdate to be cumulatively summing over the tx_dates.
Cumulative number of teeth, you can add this into your existing SAS data step code:
if first.id then cum_teeth=0;
cum_teeth+num_teeth_per_txdate;
Hi Paige,
Thank you for the reply!
1. I can easily subtract 1 from STOP, but in a follow up DATA STEP as I can isolate the ID order number. In the current DATA STEP, if I subtract or add 1 to either the START or STOP for lines "if tx_visit ne 1..." and "if last.id...", it does this calculation on rows that I DO and do NOT want it to. For example, the below code (bolded segments are the revision) yields the correct START/STOP for the first and last ID rows, but not middle rows for those with 3+ tx_dates (see last ID in below table; I bolded the incorrect row).
| id | tx_date | days_diff | start | stop |
| 120587 | 10/3/2022 | 0 | 1 | 24 |
| 120587 | 10/28/2022 | 25 | 25 | 30 |
| 149268 | 3/25/2021 | 0 | 1 | 20 |
| 149268 | 4/15/2021 | 21 | 21 | 30 |
| 152067 | 1/28/2021 | 0 | 1 | 13 |
| 152067 | 2/11/2021 | 14 | 14 | 30 |
| 154198 | 12/10/2020 | 0 | 1 | 6 |
| 154198 | 12/17/2020 | 7 | 6 | 26 |
| 154198 | 1/6/2021 | 27 | 27 | 30 |
2. I have added your suggested code in several few different places, but it is not getting the cumulative sum, and I am not sure why. Below is the first 9 observations where the cum_teeth variable is showing what it SHOULD be, but I can't get that. Any suggestions? Below is the last version of the code I tried.
| id | tx_date | start | stop | cum_teeth |
| 120587 | 10/28/2022 | 1 | 24 | 2 |
| 120587 | 10/28/2022 | 25 | 30 | 3 |
| 149268 | 4/15/2021 | 1 | 20 | 5 |
| 149268 | 4/15/2021 | 21 | 30 | 8 |
| 152067 | 2/11/2021 | 1 | 13 | 3 |
| 152067 | 2/11/2021 | 14 | 30 | 4 |
| 154198 | 12/17/2020 | 1 | 6 | 5 |
| 154198 | 1/6/2021 | 6 | 26 | 7 |
| 154198 | 1/6/2021 | 27 | 30 | 10 |
Regarding #2, the summing to get cum_teeth should not be inside an IF statement. That line must execute on every record.
Regarding #1, I am going to request that you provide code inside a code box (click on the little running man icon and paste your code in there) with one command per line, and with proper indentation.
And in fact, it is also requested (by me, if you want me to run code on your data) to provide the data as working SAS data step code (examples and instructions).
data v2a; set v2;
by id;
retain start stop cum_teeth;
if first.id then do; start=1; stop=1; cum_teeth=0; end;
if tx_visit ne 1 and days_diff < postop_days then do; start=stop; stop=days_diff-1; event=0; output; end;
if last.id then do; start=stop+1; stop=postop_days; event=(postop_yn=1); output; end;
cum_teeth+num_teeth_per_txdate;
run;Above is the code that is still not working to get the cumulative teeth.
Below is the working SAS data step code. I am only taking the first ~50 observations.
data 50.V2;
infile datalines dsd truncover;
input id:BEST. tx_date:MMDDYY10. tx_visit:32. days_diff:32. Num_teeth_per_txdate:32. postop_yn:32. postop_days:32.;
format id BEST. tx_date MMDDYY10.;
label id="ID" tx_date="Tx_date";
datalines4;
120587,10/03/2022,1,0,2,0,30
120587,10/28/2022,2,25,1,0,30
149268,03/25/2021,1,0,5,0,30
149268,04/15/2021,2,21,3,0,30
152067,01/28/2021,1,0,3,0,30
152067,02/11/2021,2,14,1,0,30
154198,12/10/2020,1,0,5,0,30
154198,12/17/2020,2,7,2,0,30
154198,01/06/2021,3,27,3,0,30
170135,09/26/2022,1,0,1,0,30
170135,10/20/2022,2,24,1,0,30
181736,03/12/2024,1,0,1,0,30
181736,03/19/2024,2,7,1,0,30
309868,06/16/2023,1,0,3,0,30
309868,07/14/2023,2,28,2,0,30
318151,04/04/2023,1,0,6,1,30
318151,04/18/2023,2,14,9,1,30
323719,11/09/2022,1,0,1,0,30
323719,11/15/2022,2,6,8,0,30
327936,11/23/2020,1,0,1,0,30
327936,12/22/2020,2,29,1,0,30
329053,12/23/2021,1,0,3,0,30
329053,01/11/2022,2,19,1,0,30
336334,03/05/2024,1,0,1,0,30
336334,03/27/2024,2,22,3,0,30
337798,05/05/2021,1,0,6,0,30
337798,05/14/2021,2,9,4,0,30
338227,11/29/2023,1,0,3,0,30
338227,12/19/2023,2,20,4,0,30
340983,01/20/2021,1,0,4,0,30
340983,01/28/2021,2,8,4,0,30
340983,02/03/2021,3,14,2,0,30
345539,10/03/2025,1,0,3,0,30
345539,10/17/2025,2,14,3,0,30
345539,10/31/2025,3,28,3,0,30
348457,07/06/2023,1,0,1,1,21
348457,07/17/2023,2,11,3,1,21
349118,07/21/2020,1,0,1,0,30
349118,07/30/2020,2,9,4,0,30
349118,08/13/2020,3,23,2,0,30
349948,10/12/2022,1,0,1,1,16
349948,10/19/2022,2,7,4,1,16
350382,01/24/2024,1,0,1,0,30
350382,02/07/2024,2,14,4,0,30
354806,02/23/2021,1,0,2,0,30
354806,03/02/2021,2,7,2,0,30
363310,11/12/2020,1,0,4,0,30
363310,11/17/2020,2,5,6,0,30
367820,05/20/2022,1,0,5,0,30
367820,06/14/2022,2,25,2,0,30
369007,12/01/2025,1,0,5,0,30
369007,12/18/2025,2,17,5,0,30
;;;;
Please comply with my request, with emphasis in red.
Regarding #1, I am going to request that you provide code inside a code box (click on the little running man icon and paste your code in there) with one command per line, and with proper indentation.
Apologies, I must not know what proper indentation is. I googled it and tried the below.
data v2a;
set v2;
by id;
retain start stop cum_teeth;
if first.id then do;
start=1; stop=1; cum_teeth=0;
end;
if tx_visit ne 1 and days_diff < postop_days then do;
start=stop; stop=days_diff-1; event=0;
output;
end;
if last.id then do;
start=stop+1; stop=postop_days; event=(postop_yn=1);
output;
end;
cum_teeth+num_teeth_per_txdate;
run;
You still have three commands on a single line.
Nevertheless, this is much more workable, so thank you. Here is my solution.
data v2a;
set v2;
by id;
prev_tx_date=lag(tx_date);
if first.id then do;
days_diff=0;
cum_teeth=0;
end;
else do;
days_diff=tx_date-prev_tx_date-1;
end;
cum_teeth+num_teeth_per_txdate;
drop prev_tx_date;
run;
Ah, I see now what is meant by a command! I will do that in the future. Thank you for working me on this.
Your code works well to get the cum_teeth, but not the START and STOP (in days), which I need for the time-varying proc phreg model. I tried using your code and adding START/STOP variables, but to no luck.
If first.ID, then START should equal 1 and the STOP should be 1 day minus the next tx_date per ID. Then for subsequent tx_dates per ID, the START should equal the diff_days (tx_date of that row - tx_date from the row before) and the STOP should equal either: (1) 1 day minus the next tx_date per ID if not the last.ID or (2) the postop_days if the last.ID. Is there a way to build that into your code?
Not sure I follow your whole thread but is seems you might need to know the NEXT value of START to do what you want. That is harder than finding the PREVIOUS value since it is much harder to predict the future than to remember the past.
SAS makes it easy to find the PREVIOUS value by using the LAG() function to remember the past value.
Here is simple trick to predict the future. Use an additional SET statement to read in the same dataset only using FIRSTOBS=2 to skip the first observation. I like to add a third read of just the first observation to make the two SET statements have the same number of observations.
Remember to take care not to use the NEXT value when at the end of a BY group since it comes from the following BY group. Just like you took care not to use the PREVIOUS value at the start of the BY group since it came from an earlier BY group.
data v2a;
set v2;
by id tx_date ;
prev_tx_date=lag(tx_date);
set v2(keep=tx_date rename=(tx_date=next_tx_date) firstobs=2)
v2(obs=1 drop=_all_)
;
if first.id then do;
days_diff=0;
cum_teeth=0;
end;
else do;
days_diff=tx_date-prev_tx_date-1;
end;
if last.id then call missing(next_tx_date);
else stop_date = next_tx_date -1 ;
format stop_date date9.;
cum_teeth+num_teeth_per_txdate;
drop prev_tx_date next_tx_date ;
run;
Maybe someone else can figure out how to do this in one step, but here is my solution which takes several steps
data v2a;
set v2;
by id;
prev_tx_date=lag(tx_date);
if first.id then do;
days_diff=0;
cum_teeth=0;
start=1;
end;
else do;
days_diff=tx_date-prev_tx_date-1;
end;
cum_teeth+num_teeth_per_txdate;
drop prev_tx_date;
run;
/* Find max tx_date and min tx_date per id */
proc summary nway data=v2a;
class id;
var tx_date;
output out=max_tx_date min=min_tx_date max=max_tx_date;
run;
data v2b;
merge v2a max_tx_date(keep=max_tx_date min_tx_date id);
by id;
run;
data v2c;
merge v2b v2b(firstobs=2 keep=tx_date rename=(tx_date=next_tx_date));
if days_diff=0 then do;
start=1;
stop=next_tx_date-tx_date-1;
end;
else do;
if tx_date^=max_tx_date then do;
start=tx_date-min_tx_date;
stop=next_tx_date-min_tx_date-1;
end;
else do;
start=tx_date-min_tx_date;
stop=postop_days;
end;
end;
drop min_tx_date max_tx_date next_tx_date;
run;
You might try something like this. That said, I don't think I really follow the purpose of the 'post_op' variable here.
proc sql noprint;
select max(nrecs) into :maxrecs trimmed
from
(select id, count(*) as nrecs from tx group by id);
quit;
%let maxrecs=%eval(&maxrecs+1);
proc sort data=tx; by id tx_date; run;
data tx2;
set tx;
by id;
array dts {&maxrecs} _temporary_;
array tth {&maxrecs} _temporary_;
length n startdt enddt 4 nteeth 3;
format startdt enddt date9.;
if first.id then call missing(of dts[*], of tth[*], n, nteeth);
n+1;
dts[n]=tx_date;
tth[n]=num_teeth_per_txdate;
if last.id then do;
do i=1 to n;
startdt=dts[i];
enddt=dts[i+1]-1;
nteeth+tth[i];
output;
end;
end;
keep id startdt enddt nteeth;
run;
Thank you, kindly, to everyone! Using a bit of everyone's approach, I finally got the dataset that I needed. This has been an incredibly helpful experience. Thank you, everyone!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.