I recently use different time-based to model cox proportional model.
I have read some papers that the counting process should be the same results as it does in only one survival time variable.
However, I found something wrong here.
In my following example, if i use the counting process, the group 1 and group 2 are actually the same data, but with different time scale. The t1 and t 2 variables in the group 2 have a different time-based, so the t1 and t2 plus the time_shift is the t1 and t2 in the group 2 with the same id.
So, the two groups have the same period of t1 to t2, and the same id, drug use, and outcome (i.e., hypertension).
However, if i use the counting process, i will get a different HR (for group 1 and group, i.e., 2, 1.07 and 2.26, respectively).
When i use hte time-on-study, i will get another different estimate, 1.19, but they are the same in the two groups.
I tried to replicate in the R, the results are the estiamtes are the same in counting process or time-on-study, they are 2.26 and 1.19, although they are still different in counting process and time-on-study.
Can anyone explain why they are different?
data t;
INPUT g id hypertension t1 t2 drug time time_shift;
cards;
1 1 0 21916 24940 1 3024 16699
1 2 1 21916 21976 0 60 14510
1 3 0 21916 27089 0 5173 14550
1 4 1 21916 23765 0 1849 17177
1 5 1 21916 23067 0 1151 14896
1 6 0 21916 21949 0 33 17837
1 6 0 21949 23802 1 1853 17837
1 7 1 21916 22999 1 1083 18612
1 8 1 21916 24659 0 2743 16401
1 9 0 21916 22006 0 90 19633
1 10 0 21916 23691 0 1775 17948
1 11 1 21916 22470 0 554 14488
1 12 0 21916 24201 0 2285 16684
1 12 0 24201 24955 1 754 16684
1 13 0 21916 21951 0 35 17085
1 13 0 21951 24554 1 2603 17085
1 14 0 21916 21946 0 30 16824
1 14 1 21946 24625 1 2679 16824
1 15 0 21916 22422 0 506 17888
1 15 0 22422 23751 1 1329 17888
1 16 0 21916 23628 0 1712 18011
1 17 0 21916 23896 0 1980 17743
1 18 1 21916 23275 1 1359 16971
1 19 0 21916 21961 0 45 16993
1 19 1 21961 23665 1 1704 16993
1 20 1 21916 22364 1 448 17821
2 1 0 38615 41639 1 3024 0
2 2 1 36426 36486 0 60 0
2 3 0 36466 41639 0 5173 0
2 4 1 39093 40942 0 1849 0
2 5 1 36812 37963 0 1151 0
2 6 0 39753 39786 0 33 0
2 6 0 39786 41639 1 1853 0
2 7 1 40528 41611 1 1083 0
2 8 1 38317 41060 0 2743 0
2 9 0 41549 41639 0 90 0
2 10 0 39864 41639 0 1775 0
2 11 1 36404 36958 0 554 0
2 12 0 38600 40885 0 2285 0
2 12 0 40885 41639 1 754 0
2 13 0 39001 39036 0 35 0
2 13 0 39036 41639 1 2603 0
2 14 0 38740 38770 0 30 0
2 14 1 38770 41449 1 2679 0
2 15 0 39804 40310 0 506 0
2 15 0 40310 41639 1 1329 0
2 16 0 39927 41639 0 1712 0
2 17 0 39659 41639 0 1980 0
2 18 1 38887 40246 1 1359 0
2 19 0 38909 38954 0 45 0
2 19 1 38954 40658 1 1704 0
2 20 1 39737 40185 1 448 0
;
proc phreg data=t COVS (AGGREGATE);
model (t1, t2) * hypertension (0) = drug/risklimits;
id id;
by g;
run;
proc phreg data=t COVS (AGGREGATE);
model time * hypertension (0) = drug/risklimits;
id id;
by g;
run;
Hello @ffgsdf,
Sorry to see that nobody has replied yet. I have investigated this question a bit, using a simplified input dataset -- basically a subset of only five patients from your sample data, with shifted and rounded times, but I think maintaining the relevant characteristics of the original data:
data have;
input g id hypertension t1 t2 drug time time_shift;
cards;
1 4 1 100 2000 0 1900 17000
1 8 1 100 3000 0 2900 16400
1 14 0 100 130 0 30 16800
1 14 1 130 2800 1 2670 16800
1 15 0 100 600 0 500 17900
1 15 0 600 1900 1 1300 17900
1 20 1 100 500 1 400 17800
2 4 1 17100 19000 0 1900 0
2 8 1 16500 19400 0 2900 0
2 14 0 16900 16930 0 30 0
2 14 1 16930 19600 1 2670 0
2 15 0 18000 18500 0 500 0
2 15 0 18500 19800 1 1300 0
2 20 1 17900 18300 1 400 0
;
With this HAVE dataset it is feasible to reproduce at least the parameter estimates and hazard ratios from the two PROC PHREG steps by manual calculation:
Luckily, there are no ties in the event times, so the formula for the partial likelihood function (to be maximized) simplifies from the expressions (e.g., Breslow's) given in the documentation: Partial Likelihood Function for the Cox Model. The relevant formula can also be found in Klein and Moeschberger (2003): formula (9.2.1) on page 297.
There is only one dichotomous covariate, variable drug, which simplifies the formula further: Only one parameter (b) is to be estimated and the partial likelihood L(b) is a product of a few fractions -- one for each event time. The numerators of these fractions are either exp(b) or exp(0)=1, depending on whether the relevant value of drug equals 1 or 0. The denominators are sums of a few of such terms (i.e., exp(b) or 1), the number of which depends on the risk set at the relevant point in time.
Let's start with group 1 (g=1) using the counting process data (t1, t2). We have four distinct event times: t1=500 (from id 20; not to be confused with variable t1), t2=2000 (from id 4), t3=2800 (id 14) and t4=3000 (id 8). Note that id 15 (hypertension=0) contributes only censored observations, no event times. The first factor in the product L(b), i.e., the factor for event time t1=500, has the numerator exp(b) because the patient with id=20 was on drug (drug=1) when the event (hypertension=1) occurred. The denominator is 1 + 1 + exp(b) + 1 + exp(b) = 3 + 2exp(b) -- the sum of the corresponding terms for the patients at risk at time t1=500: id=4 with drug=0 (hence the first 1), id=8 with drug=0 (hence the second 1), id=14 with drug=1 (hence exp(b)), and id=15 with drug=0 (at that time, hence the third 1), and id=20 itself with drug=1 (hence the second exp(b), as in the numerator).
The second factor in the product L(b), i.e., the factor for event time t2=2000, has the numerator 1 because the corresponding patient (id=4) was not on drug (drug=0) when the event (hypertension=1) occurred. The denominator is 1 + 1 + exp(b) -- the sum of the corresponding terms for the patients at risk at time t2=2000: id=4 itself with drug=0 (hence the first 1), id=8 with drug=0 (hence the second 1) and id=14 with drug=1 (hence exp(b)). The other two patients are no longer at risk at time t2=2000 (id 15 was censored at time 1900, id 20 has had their event already at time 500).
Similarly, we obtain exp(b)/(1+exp(b)) for the third factor in the product L(b) and simply 1/1 = 1 for the fourth factor, so L(b) = exp(2b)/((3+2exp(b))*(2+exp(b))*(1+exp(b))). This function (see a plot) exhibits a unique maximum at b≈1.07307 -- matching the parameter estimate from PROC PHREG exactly. The hazard ratio of "drug=1 vs. drug=0" is therefore exp(b)=exp(1.07307)≈2.924, again matching the PROC PHREG output.
Performing the analogous calculation for the second group (g=2) -- the four event times are now t1=18300 (from id 20), t2=19000 (id 4), t3=19400 (id 8) and t4=19600 (id 14) --, it becomes obvious that a constant time_shift would not change the results, whereas time_shift values varying between patients have a potential impact. Indeed, this is the case for your data (and for our simplified dataset HAVE as well). For example, patients 14 and 15 are now in the risk set at the event time of patient 8 (t3=19400), which was not the case for g=1. We end up with a different partial likelihood function, L(b) = exp(b)/((3+2exp(b))*(2+2exp(b))*(1+2exp(b))*2) (one factor exp(b) cancels out) and, not surprisingly, this function has its maximum at a quite different place: b≈-0.82168 -- matching the parameter estimate from PROC PHREG (-0.82158) almost exactly. Why only "almost"? It turns out that it takes a tighter convergence criterion to get the more accurate parameter estimate -0.82168 from PROC PHREG: adding the option xconv=1e-8, for instance, to the MODEL statement does the trick. The hazard ratio exp(b)≈0.440 this time.
It is a question for the study statistician familiar with the research question (and the subject matter in general), what time shifts (if any) are appropriate to ensure that corresponding times of different patients are "synchronized" so that the risk sets at each event time are built correctly.
Now let's turn to the second PROC PHREG step, using variable time instead of t1 and t2, with event times t1=400 (id 20), t2=1900 (id 4), t3=2670 (id 14) and t4=2900 (id 8). I strongly suspect that this is inappropriate with the dataset in its current form. It appears to me that the two observations for id=14 are treated by PROC PHREG as if they were from different (albeit somewhat related [ID statement]) patients: one with drug=1 and an event, one with drug=0 and censored. Similarly, the two observations with id=15 have different drug values, but both are censored. The different drug values are not treated as they should for a time-dependent covariate. As a matter of fact, we obtain yet another partial likelihood function, L(b) = exp(2b)/((3+3exp(b))*(2+exp(b))*(1+exp(b))), resulting in the estimate b≈0.94061 (again, use xconv=1e-8 to get the last two decimals right) and the hazard ratio exp(b)≈2.561.
To correct this, you can aggregate the data to one observation per patient and then use programming statements in the PROC PHREG step to create a really time-dependent covariate d:
data have_td; /* "td" indicating the time-dependent covariate */
set have;
by g id;
if last.id;
run;
proc phreg data=have_td COVS (AGGREGATE);
model time * hypertension (0) = d/risklimits;
*id id; /* ID statement is now redundant */
where g=1; /* no difference between the groups */
if id=14 then d=(time>30);
else if id=15 then d=(time>500);
else d=drug;
run;
(In the PROC PHREG step I used hardcoded IDs and time thresholds for simplicity; see Example 92.7 Time-Dependent Repeated Measurements of a Covariate for another technique.)
Finally, the results, parameter estimate b≈1.07307 and hazard ratio exp(b)≈2.924, match exactly those from g=1 with the data in counting process style (see above) and are not impacted by the constant time shift from time=0 to t1=100 as the start time.
Hello @ffgsdf,
Sorry to see that nobody has replied yet. I have investigated this question a bit, using a simplified input dataset -- basically a subset of only five patients from your sample data, with shifted and rounded times, but I think maintaining the relevant characteristics of the original data:
data have;
input g id hypertension t1 t2 drug time time_shift;
cards;
1 4 1 100 2000 0 1900 17000
1 8 1 100 3000 0 2900 16400
1 14 0 100 130 0 30 16800
1 14 1 130 2800 1 2670 16800
1 15 0 100 600 0 500 17900
1 15 0 600 1900 1 1300 17900
1 20 1 100 500 1 400 17800
2 4 1 17100 19000 0 1900 0
2 8 1 16500 19400 0 2900 0
2 14 0 16900 16930 0 30 0
2 14 1 16930 19600 1 2670 0
2 15 0 18000 18500 0 500 0
2 15 0 18500 19800 1 1300 0
2 20 1 17900 18300 1 400 0
;
With this HAVE dataset it is feasible to reproduce at least the parameter estimates and hazard ratios from the two PROC PHREG steps by manual calculation:
Luckily, there are no ties in the event times, so the formula for the partial likelihood function (to be maximized) simplifies from the expressions (e.g., Breslow's) given in the documentation: Partial Likelihood Function for the Cox Model. The relevant formula can also be found in Klein and Moeschberger (2003): formula (9.2.1) on page 297.
There is only one dichotomous covariate, variable drug, which simplifies the formula further: Only one parameter (b) is to be estimated and the partial likelihood L(b) is a product of a few fractions -- one for each event time. The numerators of these fractions are either exp(b) or exp(0)=1, depending on whether the relevant value of drug equals 1 or 0. The denominators are sums of a few of such terms (i.e., exp(b) or 1), the number of which depends on the risk set at the relevant point in time.
Let's start with group 1 (g=1) using the counting process data (t1, t2). We have four distinct event times: t1=500 (from id 20; not to be confused with variable t1), t2=2000 (from id 4), t3=2800 (id 14) and t4=3000 (id 8). Note that id 15 (hypertension=0) contributes only censored observations, no event times. The first factor in the product L(b), i.e., the factor for event time t1=500, has the numerator exp(b) because the patient with id=20 was on drug (drug=1) when the event (hypertension=1) occurred. The denominator is 1 + 1 + exp(b) + 1 + exp(b) = 3 + 2exp(b) -- the sum of the corresponding terms for the patients at risk at time t1=500: id=4 with drug=0 (hence the first 1), id=8 with drug=0 (hence the second 1), id=14 with drug=1 (hence exp(b)), and id=15 with drug=0 (at that time, hence the third 1), and id=20 itself with drug=1 (hence the second exp(b), as in the numerator).
The second factor in the product L(b), i.e., the factor for event time t2=2000, has the numerator 1 because the corresponding patient (id=4) was not on drug (drug=0) when the event (hypertension=1) occurred. The denominator is 1 + 1 + exp(b) -- the sum of the corresponding terms for the patients at risk at time t2=2000: id=4 itself with drug=0 (hence the first 1), id=8 with drug=0 (hence the second 1) and id=14 with drug=1 (hence exp(b)). The other two patients are no longer at risk at time t2=2000 (id 15 was censored at time 1900, id 20 has had their event already at time 500).
Similarly, we obtain exp(b)/(1+exp(b)) for the third factor in the product L(b) and simply 1/1 = 1 for the fourth factor, so L(b) = exp(2b)/((3+2exp(b))*(2+exp(b))*(1+exp(b))). This function (see a plot) exhibits a unique maximum at b≈1.07307 -- matching the parameter estimate from PROC PHREG exactly. The hazard ratio of "drug=1 vs. drug=0" is therefore exp(b)=exp(1.07307)≈2.924, again matching the PROC PHREG output.
Performing the analogous calculation for the second group (g=2) -- the four event times are now t1=18300 (from id 20), t2=19000 (id 4), t3=19400 (id 8) and t4=19600 (id 14) --, it becomes obvious that a constant time_shift would not change the results, whereas time_shift values varying between patients have a potential impact. Indeed, this is the case for your data (and for our simplified dataset HAVE as well). For example, patients 14 and 15 are now in the risk set at the event time of patient 8 (t3=19400), which was not the case for g=1. We end up with a different partial likelihood function, L(b) = exp(b)/((3+2exp(b))*(2+2exp(b))*(1+2exp(b))*2) (one factor exp(b) cancels out) and, not surprisingly, this function has its maximum at a quite different place: b≈-0.82168 -- matching the parameter estimate from PROC PHREG (-0.82158) almost exactly. Why only "almost"? It turns out that it takes a tighter convergence criterion to get the more accurate parameter estimate -0.82168 from PROC PHREG: adding the option xconv=1e-8, for instance, to the MODEL statement does the trick. The hazard ratio exp(b)≈0.440 this time.
It is a question for the study statistician familiar with the research question (and the subject matter in general), what time shifts (if any) are appropriate to ensure that corresponding times of different patients are "synchronized" so that the risk sets at each event time are built correctly.
Now let's turn to the second PROC PHREG step, using variable time instead of t1 and t2, with event times t1=400 (id 20), t2=1900 (id 4), t3=2670 (id 14) and t4=2900 (id 8). I strongly suspect that this is inappropriate with the dataset in its current form. It appears to me that the two observations for id=14 are treated by PROC PHREG as if they were from different (albeit somewhat related [ID statement]) patients: one with drug=1 and an event, one with drug=0 and censored. Similarly, the two observations with id=15 have different drug values, but both are censored. The different drug values are not treated as they should for a time-dependent covariate. As a matter of fact, we obtain yet another partial likelihood function, L(b) = exp(2b)/((3+3exp(b))*(2+exp(b))*(1+exp(b))), resulting in the estimate b≈0.94061 (again, use xconv=1e-8 to get the last two decimals right) and the hazard ratio exp(b)≈2.561.
To correct this, you can aggregate the data to one observation per patient and then use programming statements in the PROC PHREG step to create a really time-dependent covariate d:
data have_td; /* "td" indicating the time-dependent covariate */
set have;
by g id;
if last.id;
run;
proc phreg data=have_td COVS (AGGREGATE);
model time * hypertension (0) = d/risklimits;
*id id; /* ID statement is now redundant */
where g=1; /* no difference between the groups */
if id=14 then d=(time>30);
else if id=15 then d=(time>500);
else d=drug;
run;
(In the PROC PHREG step I used hardcoded IDs and time thresholds for simplicity; see Example 92.7 Time-Dependent Repeated Measurements of a Covariate for another technique.)
Finally, the results, parameter estimate b≈1.07307 and hazard ratio exp(b)≈2.924, match exactly those from g=1 with the data in counting process style (see above) and are not impacted by the constant time shift from time=0 to t1=100 as the start time.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.