Solved: using time-on-study and counting process with different time-based cha...

ffgsdf · Posted 06-14-2023 12:56 PM

I recently use different time-based to model cox proportional model.

I have read some papers that the counting process should be the same results as it does in only one survival time variable.

However, I found something wrong here.

In my following example, if i use the counting process, the group 1 and group 2 are actually the same data, but with different time scale. The t1 and t 2 variables in the group 2 have a different time-based, so the t1 and t2 plus the time_shift is the t1 and t2 in the group 2 with the same id.

So, the two groups have the same period of t1 to t2, and the same id, drug use, and outcome (i.e., hypertension).

However, if i use the counting process, i will get a different HR (for group 1 and group, i.e., 2, 1.07 and 2.26, respectively).

When i use hte time-on-study, i will get another different estimate, 1.19, but they are the same in the two groups.

I tried to replicate in the R, the results are the estiamtes are the same in counting process or time-on-study, they are 2.26 and 1.19, although they are still different in counting process and time-on-study.

Can anyone explain why they are different?

data t;
INPUT g	id	hypertension	t1	t2	drug	time	time_shift;
cards;
1	1	0	21916	24940	1	3024	16699
1	2	1	21916	21976	0	60	14510
1	3	0	21916	27089	0	5173	14550
1	4	1	21916	23765	0	1849	17177
1	5	1	21916	23067	0	1151	14896
1	6	0	21916	21949	0	33	17837
1	6	0	21949	23802	1	1853	17837
1	7	1	21916	22999	1	1083	18612
1	8	1	21916	24659	0	2743	16401
1	9	0	21916	22006	0	90	19633
1	10	0	21916	23691	0	1775	17948
1	11	1	21916	22470	0	554	14488
1	12	0	21916	24201	0	2285	16684
1	12	0	24201	24955	1	754	16684
1	13	0	21916	21951	0	35	17085
1	13	0	21951	24554	1	2603	17085
1	14	0	21916	21946	0	30	16824
1	14	1	21946	24625	1	2679	16824
1	15	0	21916	22422	0	506	17888
1	15	0	22422	23751	1	1329	17888
1	16	0	21916	23628	0	1712	18011
1	17	0	21916	23896	0	1980	17743
1	18	1	21916	23275	1	1359	16971
1	19	0	21916	21961	0	45	16993
1	19	1	21961	23665	1	1704	16993
1	20	1	21916	22364	1	448	17821
2	1	0	38615	41639	1	3024	0
2	2	1	36426	36486	0	60	0
2	3	0	36466	41639	0	5173	0
2	4	1	39093	40942	0	1849	0
2	5	1	36812	37963	0	1151	0
2	6	0	39753	39786	0	33	0
2	6	0	39786	41639	1	1853	0
2	7	1	40528	41611	1	1083	0
2	8	1	38317	41060	0	2743	0
2	9	0	41549	41639	0	90	0
2	10	0	39864	41639	0	1775	0
2	11	1	36404	36958	0	554	0
2	12	0	38600	40885	0	2285	0
2	12	0	40885	41639	1	754	0
2	13	0	39001	39036	0	35	0
2	13	0	39036	41639	1	2603	0
2	14	0	38740	38770	0	30	0
2	14	1	38770	41449	1	2679	0
2	15	0	39804	40310	0	506	0
2	15	0	40310	41639	1	1329	0
2	16	0	39927	41639	0	1712	0
2	17	0	39659	41639	0	1980	0
2	18	1	38887	40246	1	1359	0
2	19	0	38909	38954	0	45	0
2	19	1	38954	40658	1	1704	0
2	20	1	39737	40185	1	448	0
;

proc phreg data=t COVS (AGGREGATE);
model (t1, t2) * hypertension (0) = drug/risklimits; 
id id;
by g;
run;

proc phreg data=t COVS (AGGREGATE);
model time * hypertension (0) = drug/risklimits; 
id id;
by g;
run;

FreelanceReinh · Posted 06-17-2023 06:35 PM

Hello @ffgsdf,

Sorry to see that nobody has replied yet. I have investigated this question a bit, using a simplified input dataset -- basically a subset of only five patients from your sample data, with shifted and rounded times, but I think maintaining the relevant characteristics of the original data:

data have;
input g id hypertension t1 t2 drug time time_shift;
cards;
1  4 1   100  2000 0 1900 17000
1  8 1   100  3000 0 2900 16400
1 14 0   100   130 0   30 16800
1 14 1   130  2800 1 2670 16800
1 15 0   100   600 0  500 17900
1 15 0   600  1900 1 1300 17900
1 20 1   100   500 1  400 17800
2  4 1 17100 19000 0 1900     0
2  8 1 16500 19400 0 2900     0
2 14 0 16900 16930 0   30     0
2 14 1 16930 19600 1 2670     0
2 15 0 18000 18500 0  500     0
2 15 0 18500 19800 1 1300     0
2 20 1 17900 18300 1  400     0
;

With this HAVE dataset it is feasible to reproduce at least the parameter estimates and hazard ratios from the two PROC PHREG steps by manual calculation:

Luckily, there are no ties in the event times, so the formula for the partial likelihood function (to be maximized) simplifies from the expressions (e.g., Breslow's) given in the documentation: Partial Likelihood Function for the Cox Model. The relevant formula can also be found in Klein and Moeschberger (2003): formula (9.2.1) on page 297.

There is only one dichotomous covariate, variable drug, which simplifies the formula further: Only one parameter (b) is to be estimated and the partial likelihood L(b) is a product of a few fractions -- one for each event time. The numerators of these fractions are either exp(b) or exp(0)=1, depending on whether the relevant value of drug equals 1 or 0. The denominators are sums of a few of such terms (i.e., exp(b) or 1), the number of which depends on the risk set at the relevant point in time.

Let's start with group 1 (g=1) using the counting process data (t1, t2). We have four distinct event times: t₁=500 (from id 20; not to be confused with variable t1), t₂=2000 (from id 4), t₃=2800 (id 14) and t₄=3000 (id 8). Note that id 15 (hypertension=0) contributes only censored observations, no event times. The first factor in the product L(b), i.e., the factor for event time t₁=500, has the numerator exp(b) because the patient with id=20 was on drug (drug=1) when the event (hypertension=1) occurred. The denominator is 1 + 1 + exp(b) + 1 + exp(b) = 3 + 2exp(b) -- the sum of the corresponding terms for the patients at risk at time t₁=500: id=4 with drug=0 (hence the first 1), id=8 with drug=0 (hence the second 1), id=14 with drug=1 (hence exp(b)), and id=15 with drug=0 (at that time, hence the third 1), and id=20 itself with drug=1 (hence the second exp(b), as in the numerator).

The second factor in the product L(b), i.e., the factor for event time t₂=2000, has the numerator 1 because the corresponding patient (id=4) was not on drug (drug=0) when the event (hypertension=1) occurred. The denominator is 1 + 1 + exp(b) -- the sum of the corresponding terms for the patients at risk at time t₂=2000: id=4 itself with drug=0 (hence the first 1), id=8 with drug=0 (hence the second 1) and id=14 with drug=1 (hence exp(b)). The other two patients are no longer at risk at time t₂=2000 (id 15 was censored at time 1900, id 20 has had their event already at time 500).

Similarly, we obtain exp(b)/(1+exp(b)) for the third factor in the product L(b) and simply 1/1 = 1 for the fourth factor, so L(b) = exp(2b)/((3+2exp(b))*(2+exp(b))*(1+exp(b))). This function (see a plot) exhibits a unique maximum at b≈1.07307 -- matching the parameter estimate from PROC PHREG exactly. The hazard ratio of "drug=1 vs. drug=0" is therefore exp(b)=exp(1.07307)≈2.924, again matching the PROC PHREG output.

Performing the analogous calculation for the second group (g=2) -- the four event times are now t₁=18300 (from id 20), t₂=19000 (id 4), t₃=19400 (id 8) and t₄=19600 (id 14) --, it becomes obvious that a constant time_shift would not change the results, whereas time_shift values varying between patients have a potential impact. Indeed, this is the case for your data (and for our simplified dataset HAVE as well). For example, patients 14 and 15 are now in the risk set at the event time of patient 8 (t₃=19400), which was not the case for g=1. We end up with a different partial likelihood function, L(b) = exp(b)/((3+2exp(b))*(2+2exp(b))*(1+2exp(b))*2) (one factor exp(b) cancels out) and, not surprisingly, this function has its maximum at a quite different place: b≈-0.82168 -- matching the parameter estimate from PROC PHREG (-0.82158) almost exactly. Why only "almost"? It turns out that it takes a tighter convergence criterion to get the more accurate parameter estimate -0.82168 from PROC PHREG: adding the option xconv=1e-8, for instance, to the MODEL statement does the trick. The hazard ratio exp(b)≈0.440 this time.

It is a question for the study statistician familiar with the research question (and the subject matter in general), what time shifts (if any) are appropriate to ensure that corresponding times of different patients are "synchronized" so that the risk sets at each event time are built correctly.

Now let's turn to the second PROC PHREG step, using variable time instead of t1 and t2, with event times t₁=400 (id 20), t₂=1900 (id 4), t₃=2670 (id 14) and t₄=2900 (id 8). I strongly suspect that this is inappropriate with the dataset in its current form. It appears to me that the two observations for id=14 are treated by PROC PHREG as if they were from different (albeit somewhat related [ID statement]) patients: one with drug=1 and an event, one with drug=0 and censored. Similarly, the two observations with id=15 have different drug values, but both are censored. The different drug values are not treated as they should for a time-dependent covariate. As a matter of fact, we obtain yet another partial likelihood function, L(b) = exp(2b)/((3+3exp(b))*(2+exp(b))*(1+exp(b))), resulting in the estimate b≈0.94061 (again, use xconv=1e-8 to get the last two decimals right) and the hazard ratio exp(b)≈2.561.

To correct this, you can aggregate the data to one observation per patient and then use programming statements in the PROC PHREG step to create a really time-dependent covariate d:

data have_td; /* "td" indicating the time-dependent covariate */
set have;
by g id;
if last.id;
run;

proc phreg data=have_td COVS (AGGREGATE);
model time * hypertension (0) = d/risklimits; 
*id id; /* ID statement is now redundant */
where g=1; /* no difference between the groups */
if id=14 then d=(time>30);
else if id=15 then d=(time>500);
else d=drug;
run;

(In the PROC PHREG step I used hardcoded IDs and time thresholds for simplicity; see Example 92.7 Time-Dependent Repeated Measurements of a Covariate for another technique.)

Finally, the results, parameter estimate b≈1.07307 and hazard ratio exp(b)≈2.924, match exactly those from g=1 with the data in counting process style (see above) and are not impacted by the constant time shift from time=0 to t1=100 as the start time.

View solution in original post

FreelanceReinh · Posted 06-17-2023 06:35 PM

Hello @ffgsdf,

Sorry to see that nobody has replied yet. I have investigated this question a bit, using a simplified input dataset -- basically a subset of only five patients from your sample data, with shifted and rounded times, but I think maintaining the relevant characteristics of the original data:

data have;
input g id hypertension t1 t2 drug time time_shift;
cards;
1  4 1   100  2000 0 1900 17000
1  8 1   100  3000 0 2900 16400
1 14 0   100   130 0   30 16800
1 14 1   130  2800 1 2670 16800
1 15 0   100   600 0  500 17900
1 15 0   600  1900 1 1300 17900
1 20 1   100   500 1  400 17800
2  4 1 17100 19000 0 1900     0
2  8 1 16500 19400 0 2900     0
2 14 0 16900 16930 0   30     0
2 14 1 16930 19600 1 2670     0
2 15 0 18000 18500 0  500     0
2 15 0 18500 19800 1 1300     0
2 20 1 17900 18300 1  400     0
;