Hello @ffgsdf,
Sorry to see that nobody has replied yet. I have investigated this question a bit, using a simplified input dataset -- basically a subset of only five patients from your sample data, with shifted and rounded times, but I think maintaining the relevant characteristics of the original data:
data have;
input g id hypertension t1 t2 drug time time_shift;
cards;
1 4 1 100 2000 0 1900 17000
1 8 1 100 3000 0 2900 16400
1 14 0 100 130 0 30 16800
1 14 1 130 2800 1 2670 16800
1 15 0 100 600 0 500 17900
1 15 0 600 1900 1 1300 17900
1 20 1 100 500 1 400 17800
2 4 1 17100 19000 0 1900 0
2 8 1 16500 19400 0 2900 0
2 14 0 16900 16930 0 30 0
2 14 1 16930 19600 1 2670 0
2 15 0 18000 18500 0 500 0
2 15 0 18500 19800 1 1300 0
2 20 1 17900 18300 1 400 0
;
With this HAVE dataset it is feasible to reproduce at least the parameter estimates and hazard ratios from the two PROC PHREG steps by manual calculation:
Luckily, there are no ties in the event times, so the formula for the partial likelihood function (to be maximized) simplifies from the expressions (e.g., Breslow's) given in the documentation: Partial Likelihood Function for the Cox Model. The relevant formula can also be found in Klein and Moeschberger (2003): formula (9.2.1) on page 297.
There is only one dichotomous covariate, variable drug, which simplifies the formula further: Only one parameter (b) is to be estimated and the partial likelihood L(b) is a product of a few fractions -- one for each event time. The numerators of these fractions are either exp(b) or exp(0)=1, depending on whether the relevant value of drug equals 1 or 0. The denominators are sums of a few of such terms (i.e., exp(b) or 1), the number of which depends on the risk set at the relevant point in time.
Let's start with group 1 (g=1) using the counting process data (t1, t2). We have four distinct event times: t 1 =500 (from id 20; not to be confused with variable t1), t 2 =2000 (from id 4), t 3 =2800 (id 14) and t 4 =3000 (id 8). Note that id 15 (hypertension=0) contributes only censored observations, no event times. The first factor in the product L(b), i.e., the factor for event time t 1 =500, has the numerator exp(b) because the patient with id=20 was on drug (drug=1) when the event (hypertension=1) occurred. The denominator is 1 + 1 + exp(b) + 1 + exp(b) = 3 + 2exp(b) -- the sum of the corresponding terms for the patients at risk at time t 1 =500: id=4 with drug=0 (hence the first 1), id=8 with drug=0 (hence the second 1), id=14 with drug=1 (hence exp(b)), and id=15 with drug=0 (at that time, hence the third 1), and id=20 itself with drug=1 (hence the second exp(b), as in the numerator).
The second factor in the product L(b), i.e., the factor for event time t 2 =2000, has the numerator 1 because the corresponding patient (id=4) was not on drug (drug=0) when the event (hypertension=1) occurred. The denominator is 1 + 1 + exp(b) -- the sum of the corresponding terms for the patients at risk at time t 2 =2000: id=4 itself with drug=0 (hence the first 1), id=8 with drug=0 (hence the second 1) and id=14 with drug=1 (hence exp(b)). The other two patients are no longer at risk at time t 2 =2000 (id 15 was censored at time 1900, id 20 has had their event already at time 500).
Similarly, we obtain exp(b)/(1+exp(b)) for the third factor in the product L(b) and simply 1/1 = 1 for the fourth factor, so L(b) = exp(2b)/((3+2exp(b))*(2+exp(b))*(1+exp(b))). This function (see a plot) exhibits a unique maximum at b≈1.07307 -- matching the parameter estimate from PROC PHREG exactly. The hazard ratio of "drug=1 vs. drug=0" is therefore exp(b)=exp(1.07307)≈2.924, again matching the PROC PHREG output.
Performing the analogous calculation for the second group (g=2) -- the four event times are now t 1 =18300 (from id 20), t 2 =19000 (id 4), t 3 =19400 (id 8) and t 4 =19600 (id 14) --, it becomes obvious that a constant time_shift would not change the results, whereas time_shift values varying between patients have a potential impact. Indeed, this is the case for your data (and for our simplified dataset HAVE as well). For example, patients 14 and 15 are now in the risk set at the event time of patient 8 (t 3 =19400), which was not the case for g=1. We end up with a different partial likelihood function, L(b) = exp(b)/((3+2exp(b))*(2+2exp(b))*(1+2exp(b))*2) (one factor exp(b) cancels out) and, not surprisingly, this function has its maximum at a quite different place: b≈-0.82168 -- matching the parameter estimate from PROC PHREG (-0.82158) almost exactly. Why only "almost"? It turns out that it takes a tighter convergence criterion to get the more accurate parameter estimate -0.82168 from PROC PHREG: adding the option xconv=1e-8, for instance, to the MODEL statement does the trick. The hazard ratio exp(b)≈0.440 this time.
It is a question for the study statistician familiar with the research question (and the subject matter in general), what time shifts (if any) are appropriate to ensure that corresponding times of different patients are "synchronized" so that the risk sets at each event time are built correctly.
Now let's turn to the second PROC PHREG step, using variable time instead of t1 and t2, with event times t 1 =400 (id 20), t 2 =1900 (id 4), t 3 =2670 (id 14) and t 4 =2900 (id 8). I strongly suspect that this is inappropriate with the dataset in its current form. It appears to me that the two observations for id=14 are treated by PROC PHREG as if they were from different (albeit somewhat related [ID statement]) patients: one with drug=1 and an event, one with drug=0 and censored. Similarly, the two observations with id=15 have different drug values, but both are censored. The different drug values are not treated as they should for a time-dependent covariate. As a matter of fact, we obtain yet another partial likelihood function, L(b) = exp(2b)/((3+3exp(b))*(2+exp(b))*(1+exp(b))), resulting in the estimate b≈0.94061 (again, use xconv=1e-8 to get the last two decimals right) and the hazard ratio exp(b)≈2.561.
To correct this, you can aggregate the data to one observation per patient and then use programming statements in the PROC PHREG step to create a really time-dependent covariate d:
data have_td; /* "td" indicating the time-dependent covariate */
set have;
by g id;
if last.id;
run;
proc phreg data=have_td COVS (AGGREGATE);
model time * hypertension (0) = d/risklimits;
*id id; /* ID statement is now redundant */
where g=1; /* no difference between the groups */
if id=14 then d=(time>30);
else if id=15 then d=(time>500);
else d=drug;
run;
(In the PROC PHREG step I used hardcoded IDs and time thresholds for simplicity; see Example 92.7 Time-Dependent Repeated Measurements of a Covariate for another technique.)
Finally, the results, parameter estimate b≈1.07307 and hazard ratio exp(b)≈2.924, match exactly those from g=1 with the data in counting process style (see above) and are not impacted by the constant time shift from time=0 to t1=100 as the start time.
... View more