I'm trying to figure out how SAS calculates the per level estimates using contrast statements, as well as the predicted survival per person in a survival model with an interaction coefficient.
For example, I'm trying to predict an outcome from bmi, a binary categorical variable (1 vs. 2), a binary covariate (0 vs. 1), and an interaction term (bmi_lvl=bmi*lvl). Here's the data:
data temp;
input ID foltime bmi cov lvl bmi_lvl outcome;
datalines;
1 1.2 20 0 2 40 1
2 0.5 23 1 2 46 1
3 4.5 30 0 1 30 0
4 3.6 18 0 2 36 0
5 1.0 25 0 1 25 1
6 1.2 22 0 1 22 1
7 4.0 26 1 2 52 0
8 3.3 26 0 1 26 1
9 0.8 24 1 2 48 0
10 1.2 32 0 2 64 0
11 3.0 22 1 2 44 1
12 3.3 23 1 2 46 1
13 1.0 30 0 1 30 0
14 1.8 36 0 2 72 1
15 0.9 25 1 2 50 1
;
run;
I plug it into the model:
proc phreg data=temp outest=beta;
class cov (ref='0');
model foltime*outcome(0)=bmi lvl cov bmi_lvl/ties=efron RL;
*output statement provides the per person estimated survival;
output out=pred survival=survpred;
*contrast statement provides hazard estimates at specific BMI and level, at reference category for covariate;
*contrasting hazard at BMI 24 for levels 1 and 2;
contrast "lvl 1 at bmi 24" lvl 1 bmi 0 cov 0 bmi_lvl 24/estimate=exp e;
contrast "lvl 2 at bmi 24" lvl 2 bmi 0 cov 0 bmi_lvl 48/estimate=exp e;
*contrasting hazard at BMI 28 for levels 1 and 2;
contrast "lvl 1 at BMI 28" lvl 1 bmi 0 cov 0 bmi_lvl 28/estimate=exp e;
contrast "lvl 2 at BMI 28" lvl 2 bmi 0 cov 0 bmi_lvl 56/estimate=exp e;
run;
I set BMI to 0 in the contrast statement based on example 2, page 10 of this paper, https://support.sas.com/resources/papers/proceedings10/253-2010.pdf, however, I'm not sure why that is.
The coefficients are -0.8357 for BMI, -11.398 for lvl, 0.4375 for cov, and 0.4345 for bmi_lvl. Attached are the results for the contrast statements, as well as the per person estimated survival.
How does SAS calculate these?
Thanks!
Hello @Caetreviop543,
Thanks for posting this interesting question and for providing usable sample data and code.
The "Details" sections of the statistical procedures' documentation are mostly helpful regarding computation formulas.
Let's first see what we find out about the predicted survival (probability) and use ID 1 as our example:
You are using the default method in the OUTPUT statement, i.e. METHOD=BRESLOW.
Section "Survivor Function Estimators" of the PROC PHREG documentation informs us that the Breslow estimate of the survivor function is computed as exp(minus some "Lambda hat") and that "Lambda hat," in turn, is the exponentiated linear combination of the explanatory variables (vector x) with the parameter estimates (vector "b hat") as coefficients -- multiplied with a sum with as many terms as there are distinct uncensored times (ti) less than or equal to the time t for which we want to compute the predicted survival probability (and this is the foltime of ID 1, i.e. t=1.2). There are four such times: t1=0.5, t2=0.9, t3=1.0 and t4=1.2 (note that 0.8 is a censored time, hence disregarded here).
Now, let's calculate each of the four terms in the sum. The numerators are easy: d(ti) "is the number of subjects that have an event at" ti, i.e. 1, 1, 1, 2 for i=1, 2, 3, 4, respectively (ID 10, which was censored at t=1.2, is not counted). The denominators are sums with 15 terms (one for each ID). The factors Yj(t), for our calculation: Yj(ti), are easy: The indicator function defining them is 1 if foltime>=ti, else 0. In particular, for i=1 (the term corresponding to the smallest event time) Yj(t1)=1 for all j=1, ..., 15. The other factor is computed from the explanatory variables (exp of the linear combination as above):
data calc;
if _n_=1 then set beta(keep=bmi lvl cov1 bmi_lvl rename=(bmi=_bmi lvl=_lvl cov1=_cov1 bmi_lvl=_bmi_lvl));
set temp(rename=(foltime=t));
b=bmi*_bmi+lvl*_lvl+cov*_cov1+bmi_lvl*_bmi_lvl;
e=exp(b);
y1=(t>=0.5);
y2=(t>=0.9);
y3=(t>=1);
y4=(t>=1.2);
run;
Let's quickly use PROC MEANS to compute the four denominators:
proc means data=calc(where=(y1)) sum; var e; run;
proc means data=calc(where=(y2)) sum; var e; run;
proc means data=calc(where=(y3)) sum; var e; run;
proc means data=calc(where=(y4)) sum; var e; run;
Finally, we are ready to compute the predicted survival probability for ID 1 at time 1.2 from the value of variable E in dataset CALC (see above) for ID 1 (2.4507E-10), the d(ti) values (numerators) and the four sums from the PROC MEANS steps (denominators):
exp(-2.4507E-10 * (1/6.4531E-9 + 1/5.5999E-9 + 1/5.1514E-9 + 2/4.5907E-9))=0.78972, which confirms the value found in dataset PRED.
Now, let's turn to the contrasts and take the first one as an example. For this contrast you specified the L matrix (which is simply a row vector in this case) as (0, 1, 0, 24), where the four components correspond to bmi, lvl, cov and bmi_lvl, respectively. The estimate of Lb is, of course, L "b hat". Thanks to the zeros in the L matrix this simplifies to the parameter estimate of lvl plus 24 times the parameter estimate of bmi_lvl: -11.39812+24*0.43454=-0.96926 (rounding error corrected!). By means of the EXP option of the CONTRAST statement you requested exp(-0.96926)=0.3794 (exactly what is shown in the "Estimate" column in your table "Contrast Estimation and Testing Results by Row").
Section Type 3 Tests and Joint Tests of the documentation tells us how the Wald Chi-Square test statistic is calculated: Since vector L "b hat" is just a number in our case, it's sufficient to multiply the middle term, "[...]^-1", by the square of -0.96926 (see above). The middle term is a number as well (hence no matrix inversion required), but it involves the estimated model-based covariance matrix. This can be requested by adding the COVB option to the MODEL statement (after "ties=efron RL").
Estimated Covariance Matrix Parameter bmi lvl cov1 bmi_lvl bmi 0.24457185 3.12231124 0.01192306 -0.12710150 lvl 3.12231124 42.06677017 -0.52534029 -1.67093145 cov1 cov 1 0.01192306 -0.52534029 0.80994750 0.00021602 bmi_lvl -0.12710150 -1.67093145 0.00021602 0.06777288
Thanks to the symmetry of this matrix and the zeros in L we only need three different entries of the matrix (call them vij, e.g. v24=-1.67093145):
c²=(-0.96926)**2/(v22+48*v24+576*v44)=1.0447, which confirms the result in the output (including the p-value
1-probchi(1.0447,1)=0.3067).
Looking at the Cox proportional hazards model we see that exp(Lb) is a hazard ratio: L=(0,1,0,24)=Z'1-Z'2, where the first component (0) means "equal bmi values", the second (1) means a difference of 1 between the lvl values (hence must be lvl=2 vs. lvl=1, which I would include in the label of the contrast), the third (0) means "equal cov values" and the fourth (24) means a difference of 24 between the bmi_lvl values, which implies (because of the first and second component and the definition of bmi_lvl=bmi*lvl) bmi=24. So, I'd say the estimate 0.3794 suggests a (not significantly -- see the upper confidence limit >1) reduced risk for "level 2" subjects with BMI 24 compared to "level 1" subjects with the same BMI and the same COV (be it 0 or 1).
I don't see a similar interpretation of the second contrast you defined ("lvl 2 at bmi 24"), though, because a difference of 2 between two lvl values (taken from {1, 2}) is impossible. Note that the factor 2 cancels out (twice) in the computation of the Wald Chi-Square value, which is therefore the same as for the first contrast, while the estimate (0.1439) is just the square of the other one (0.3794).
Hello @Caetreviop543,
Thanks for posting this interesting question and for providing usable sample data and code.
The "Details" sections of the statistical procedures' documentation are mostly helpful regarding computation formulas.
Let's first see what we find out about the predicted survival (probability) and use ID 1 as our example:
You are using the default method in the OUTPUT statement, i.e. METHOD=BRESLOW.
Section "Survivor Function Estimators" of the PROC PHREG documentation informs us that the Breslow estimate of the survivor function is computed as exp(minus some "Lambda hat") and that "Lambda hat," in turn, is the exponentiated linear combination of the explanatory variables (vector x) with the parameter estimates (vector "b hat") as coefficients -- multiplied with a sum with as many terms as there are distinct uncensored times (ti) less than or equal to the time t for which we want to compute the predicted survival probability (and this is the foltime of ID 1, i.e. t=1.2). There are four such times: t1=0.5, t2=0.9, t3=1.0 and t4=1.2 (note that 0.8 is a censored time, hence disregarded here).
Now, let's calculate each of the four terms in the sum. The numerators are easy: d(ti) "is the number of subjects that have an event at" ti, i.e. 1, 1, 1, 2 for i=1, 2, 3, 4, respectively (ID 10, which was censored at t=1.2, is not counted). The denominators are sums with 15 terms (one for each ID). The factors Yj(t), for our calculation: Yj(ti), are easy: The indicator function defining them is 1 if foltime>=ti, else 0. In particular, for i=1 (the term corresponding to the smallest event time) Yj(t1)=1 for all j=1, ..., 15. The other factor is computed from the explanatory variables (exp of the linear combination as above):
data calc;
if _n_=1 then set beta(keep=bmi lvl cov1 bmi_lvl rename=(bmi=_bmi lvl=_lvl cov1=_cov1 bmi_lvl=_bmi_lvl));
set temp(rename=(foltime=t));
b=bmi*_bmi+lvl*_lvl+cov*_cov1+bmi_lvl*_bmi_lvl;
e=exp(b);
y1=(t>=0.5);
y2=(t>=0.9);
y3=(t>=1);
y4=(t>=1.2);
run;
Let's quickly use PROC MEANS to compute the four denominators:
proc means data=calc(where=(y1)) sum; var e; run;
proc means data=calc(where=(y2)) sum; var e; run;
proc means data=calc(where=(y3)) sum; var e; run;
proc means data=calc(where=(y4)) sum; var e; run;
Finally, we are ready to compute the predicted survival probability for ID 1 at time 1.2 from the value of variable E in dataset CALC (see above) for ID 1 (2.4507E-10), the d(ti) values (numerators) and the four sums from the PROC MEANS steps (denominators):
exp(-2.4507E-10 * (1/6.4531E-9 + 1/5.5999E-9 + 1/5.1514E-9 + 2/4.5907E-9))=0.78972, which confirms the value found in dataset PRED.
Now, let's turn to the contrasts and take the first one as an example. For this contrast you specified the L matrix (which is simply a row vector in this case) as (0, 1, 0, 24), where the four components correspond to bmi, lvl, cov and bmi_lvl, respectively. The estimate of Lb is, of course, L "b hat". Thanks to the zeros in the L matrix this simplifies to the parameter estimate of lvl plus 24 times the parameter estimate of bmi_lvl: -11.39812+24*0.43454=-0.96926 (rounding error corrected!). By means of the EXP option of the CONTRAST statement you requested exp(-0.96926)=0.3794 (exactly what is shown in the "Estimate" column in your table "Contrast Estimation and Testing Results by Row").
Section Type 3 Tests and Joint Tests of the documentation tells us how the Wald Chi-Square test statistic is calculated: Since vector L "b hat" is just a number in our case, it's sufficient to multiply the middle term, "[...]^-1", by the square of -0.96926 (see above). The middle term is a number as well (hence no matrix inversion required), but it involves the estimated model-based covariance matrix. This can be requested by adding the COVB option to the MODEL statement (after "ties=efron RL").
Estimated Covariance Matrix Parameter bmi lvl cov1 bmi_lvl bmi 0.24457185 3.12231124 0.01192306 -0.12710150 lvl 3.12231124 42.06677017 -0.52534029 -1.67093145 cov1 cov 1 0.01192306 -0.52534029 0.80994750 0.00021602 bmi_lvl -0.12710150 -1.67093145 0.00021602 0.06777288
Thanks to the symmetry of this matrix and the zeros in L we only need three different entries of the matrix (call them vij, e.g. v24=-1.67093145):
c²=(-0.96926)**2/(v22+48*v24+576*v44)=1.0447, which confirms the result in the output (including the p-value
1-probchi(1.0447,1)=0.3067).
Looking at the Cox proportional hazards model we see that exp(Lb) is a hazard ratio: L=(0,1,0,24)=Z'1-Z'2, where the first component (0) means "equal bmi values", the second (1) means a difference of 1 between the lvl values (hence must be lvl=2 vs. lvl=1, which I would include in the label of the contrast), the third (0) means "equal cov values" and the fourth (24) means a difference of 24 between the bmi_lvl values, which implies (because of the first and second component and the definition of bmi_lvl=bmi*lvl) bmi=24. So, I'd say the estimate 0.3794 suggests a (not significantly -- see the upper confidence limit >1) reduced risk for "level 2" subjects with BMI 24 compared to "level 1" subjects with the same BMI and the same COV (be it 0 or 1).
I don't see a similar interpretation of the second contrast you defined ("lvl 2 at bmi 24"), though, because a difference of 2 between two lvl values (taken from {1, 2}) is impossible. Note that the factor 2 cancels out (twice) in the computation of the Wald Chi-Square value, which is therefore the same as for the first contrast, while the estimate (0.1439) is just the square of the other one (0.3794).
Ok, thank you so much for your thorough and clear response. That really helps, and will hopefully help others too. You're right about the two level increase in the contrast statement. I thought they were measuring the hazard at a specific level, rather than a ratio. In other words, I thought each statement corresponded to the hazard at level 1 or level 2 for the same BMI, rather than a change in hazard per unit increase in level.
Thanks again!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.