counting vs programming statement in Cox model

Peter_Y · Posted 11-25-2021 09:47 PM

Hello:

I am trying to fit Cox PH models with time dependent covariate using both counting process and programming syntax. The results are always slightly different and I am not sure why.

Using an example on SAS website

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_code_phrex6.htm

This example fits a Cox model using programming syntax

proc phreg data= Heart;
   model Time*Status(0)= XStatus Acc_Age;
   if (WaitTime = . or Time < WaitTime) then XStatus = 0;
   else  XStatus = 1;
run;

I converted the dataset Heart to long format

data heart_long;
  set heart;

  if WaitTime = . or Time < WaitTime then do;
     Xstatus = 0;
     start = 0; 
     end=Time;
     status2 = status;
     output;
  end;
  else do;
     Xstatus = 0;
     start = 0; 
     end=WaitTime-1;
     status2 = 0;
     output;

     Xstatus = 1;
     start = WaitTime-1; 
     end=Time;
     status2 = status;
     output;
  end;
run;

The reason I used waitTime-1 in the calculation of start/end time is because according to SAS document, the interval is right closed i.e (start, end].

I then fitted the same model

proc phreg data= Heart_long;
   model (start,end)*Status2(0)= XStatus Acc_Age;
run;

The results are slight different. Is this discrepancy due to my mistake in the code or is it related to how SAS fit model when using different syntax?

Thanks,

Peter

FreelanceReinh · Posted 11-30-2021 11:46 AM

Hello @Peter_Y,

Thanks for the interesting question and sorry for my late reply.

In this particular example the patient with ID=15 makes the difference: After excluding this ID (e.g., with a WHERE statement) from the first PROC PHREG step the results in terms of fit statistics, global tests and parameter estimates are the same as from your Heart_long dataset (up to trivial differences like <1E-14). This patient no. 15 died on the date of acceptance and hence has TIME=0 in the original Heart dataset. This translates to START=END=0 in your Heart_long dataset. However, PROC PHREG excludes all observations with START>=END (because of the semiclosed (START, END] time intervals) or END<0. Thus you are losing this patient's event (of death) and include only 74 events rather than 75, which affects the model statistics.

The "Number of Observations Used" (compared to the "Number of Observations Read") in the output reveals that five additional observations from Heart_long have been excluded -- without further affecting the model statistics: They are from IDs 3, 39, 45, 46 and 95 (all with START>=END, two also have END<0). In some of these cases it's plausible that omitting the observations doesn't change the results. For example, patient no. 3 received their transplant on the date of acceptance. Hence there's no need for an observation representing the zero days before the transplant (START=0, END=-1 in your Heart_long dataset).

In other cases it's not obvious (to me at least) that dropping the observation would leave the results unchanged. For example, patient no. 39 has one day between Acc_Date and Xpl_Date (leading to START=END=0 in the first of their two observations in Heart_long). It appears that due to the discreteness of the data the model statistics stay the same when certain small changes are applied to the data (e.g., when the value of TIME for ID 39 is increased from 52 to 53, 54 or 55, everything else being the same).

I suggest this modified version of your DATA step:

data heart_long;
  set heart;

  if WaitTime = . then do;
     Xstatus = 0;
     start = 0; 
     end = Time+1;
     status2 = status;
     output;
  end;
  else do;
     Xstatus = 0;
     start = 0; 
     end = WaitTime;
     status2 = 0;
     if end>0 then output; /* PROC PHREG would exclude obs. with start>=end anyway. */

     Xstatus = 1;
     start = WaitTime;
     end = Time+1;
     status2 = status;
     output;
  end;
run;

The changes to the definitions of START and END (highlighted in blue) avoid that single days are not counted (because the interval (t, t] would be empty), that negative END values occur or that events are disregarded.

The changes to the IF conditions (in black bold face) are optional: The case Time < WaitTime is relevant for the programming statements in the PROC PHREG step from the documentation (because the Time values used there are not limited to those in the input dataset), but it's impossible (and does not occur) in dataset Heart. Observations with START>=END (in particular: START=END=0) would be redundant and not be used by PROC PHREG anyway.

Using the above modification, PROC PHREG produced the same statistics (up to trivial differences) for the "long" dataset as for the original dataset -- not only for Heart vs. Heart_long, but also for simulated datasets with thousands of observations that I created for testing purposes.

counting vs programming statement in Cox model

Re: counting vs programming statement in Cox model

Ready to join fellow brilliant minds for the SAS Hackathon?