Hello!
I am trying to perform survival analysis on a sample with 100,000 observations.
The sample is 90% censored, so there are around 10,000 events.
(The reason for choosing such a large sample was to ensure adequate number of events with a 90% censoring rate in the population)
The survival time is in days from date of birth, the event is death.
Left truncation is accounted for by including the "entry =" option in the model statement.
I have used proc phreg for a semi-parametric cox proportional hazards model and followed this procedure:
1. Model with two variables: education level (4 levels), sex(2 levels).
proc phreg data=model_data;
class sex(ref='M') edu_n(ref='3');
model surv_t_dob*event(0) = edu_n sex /entry=surv_t_till_s;
output ressch = _all_;
run;
2. The PH assumption is violated for both the variables, verified by inspecting log cumulative hazard plots, schoenfeld residuals, and time-dependent interaction significance.
3. To remedy the PH violation, I stratified on sex and included 3 time interactions (education level i * survival time) with one interaction for each level of education except the reference level.
proc phreg data=model_data;
class sex(ref='M') edu_n(ref='3');
strata sex;
model surv_t_dob*event(0) =
edu_n edu_nt1 edu_nt2 edu_nt4 / entry=surv_t_till_s;
edu_nt1 = (edu_n=1)*surv_t_dob;
edu_nt2 = (edu_n=2)*surv_t_dob;
edu_nt4 = (edu_n=4)*surv_t_dob;
output ressch =_all_;
run;
Now, I want to know if I can correctly interpret the hazard ratios of this extended cox model. My initial guess was to look at the model fit statistics and also the Schoenfeld residuals.
The fit statistics tell me that the model performs better than a null model:
Model Fit Statistics | |
Criterion | Without | With |
| Covariates | Covariates |
-2 LOG L | 188952.79 | 188658.3 |
AIC | 188952.8 | 188670.3 |
SBC | 188952.8 | 188714.6 |
But, the problem is that PROC PHREG does not create an output dataset when time dependent covariates are included using programming statements.
I tried to include the time dependent variables separately in a data step but according to this discussion: Link the method is incorrect.
Another option was to use counting style process of input, but according to this note: Link , the survival estimates are wrong when a counting style process input with time dependent covariates is used and there is no circumvention. Hence, I assume that the residuals will also be incorrect.
My questions are:
Q1: How do I validate this extended Cox model in SAS? When can I appropriately interpret the Hazard Ratios?
Q2: Is it possible to look at the residuals of such a model? Why does SAS not create an output data set when time dependent covariates are included?