BookmarkSubscribeRSS Feed
Peter_Y
Calcite | Level 5

Hello:

   I am trying to fit Cox PH models with time dependent covariate using both counting process and programming syntax. The results are always slightly different and I am not sure why. 

 

Using an example on SAS website

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_code_phrex6.htm

This example fits a Cox model using programming syntax

proc phreg data= Heart;
   model Time*Status(0)= XStatus Acc_Age;
   if (WaitTime = . or Time < WaitTime) then XStatus = 0;
   else  XStatus = 1;
run;

Peter_Y_1-1637894816584.png

 

I converted the dataset Heart to long format

data heart_long;
  set heart;

  if WaitTime = . or Time < WaitTime then do;
     Xstatus = 0;
     start = 0; 
     end=Time;
     status2 = status;
     output;
  end;
  else do;
     Xstatus = 0;
     start = 0; 
     end=WaitTime-1;
     status2 = 0;
     output;

     Xstatus = 1;
     start = WaitTime-1; 
     end=Time;
     status2 = status;
     output;
  end;
run;

The reason I used waitTime-1 in the calculation of start/end time is because according to SAS document, the interval is right closed i.e (start, end]. 

I then fitted the same model 

proc phreg data= Heart_long;
   model (start,end)*Status2(0)= XStatus Acc_Age;
run;

Peter_Y_0-1637894746459.png

The results are slight different. Is this discrepancy due to my mistake in the code or is it related to how SAS fit model when using different syntax?

 

Thanks,

Peter

 

 

1 REPLY 1
FreelanceReinh
Jade | Level 19

Hello @Peter_Y,

 

Thanks for the interesting question and sorry for my late reply.

 

In this particular example the patient with ID=15 makes the difference: After excluding this ID (e.g., with a WHERE statement) from the first PROC PHREG step the results in terms of fit statistics, global tests and parameter estimates are the same as from your Heart_long dataset (up to trivial differences like <1E-14). This patient no. 15 died on the date of acceptance and hence has TIME=0 in the original Heart dataset. This translates to START=END=0 in your Heart_long dataset. However, PROC PHREG excludes all observations with START>=END (because of the semiclosed (START, END] time intervals) or END<0. Thus you are losing this patient's event (of death) and include only 74 events rather than 75, which affects the model statistics.

 

The "Number of Observations Used" (compared to the "Number of Observations Read") in the output reveals that five additional observations from Heart_long have been excluded -- without further affecting the model statistics: They are from IDs 3, 39, 45, 46 and 95 (all with START>=END, two also have END<0). In some of these cases it's plausible that omitting the observations doesn't change the results. For example, patient no. 3 received their transplant on the date of acceptance. Hence there's no need for an observation representing the zero days before the transplant (START=0, END=-1 in your Heart_long dataset).

 

In other cases it's not obvious (to me at least) that dropping the observation would leave the results unchanged. For example, patient no. 39 has one day between Acc_Date and Xpl_Date (leading to START=END=0 in the first of their two observations in Heart_long). It appears that due to the discreteness of the data the model statistics stay the same when certain small changes are applied to the data (e.g., when the value of TIME for ID 39 is increased from 52 to 53, 54 or 55, everything else being the same).

 

I suggest this modified version of your DATA step:

data heart_long;
  set heart;

  if WaitTime = . then do;
     Xstatus = 0;
     start = 0; 
     end = Time+1;
     status2 = status;
     output;
  end;
  else do;
     Xstatus = 0;
     start = 0; 
     end = WaitTime;
     status2 = 0;
     if end>0 then output; /* PROC PHREG would exclude obs. with start>=end anyway. */

     Xstatus = 1;
     start = WaitTime;
     end = Time+1;
     status2 = status;
     output;
  end;
run;

 

The changes to the definitions of START and END (highlighted in blue) avoid that single days are not counted (because the interval (t, t] would be empty), that negative END values occur or that events are disregarded.

 

The changes to the IF conditions (in black bold face) are optional: The case Time < WaitTime is relevant for the programming statements in the PROC PHREG step from the documentation (because the Time values used there are not limited to those in the input dataset), but it's impossible (and does not occur) in dataset Heart. Observations with START>=END (in particular: START=END=0) would be redundant and not be used by PROC PHREG anyway.

 

Using the above modification, PROC PHREG produced the same statistics (up to trivial differences) for the "long" dataset as for the original dataset -- not only for Heart vs. Heart_long, but also for simulated datasets with thousands of observations that I created for testing purposes.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 442 views
  • 2 likes
  • 2 in conversation