Referencing this post and community members as it is now marked as solved and locked:
I really liked @FreelanceReinh and @ballardw explanations from the post from @TL93 on PROC PHREG not recognizing reference category. I actually want to understand it further.
Even though SAS is ignoring the variable because it's absorbed by the first variable, I believe SAS will still account for those in other categories where it's not absorbed. So say immigrant = 0 and ysmcat = 0 occ = 2 ind = 3:
For those with immigrant = 0, in the output ysmcat = 0 variable is set to missing and immigrant = 0 is used as the reference. The calculations aren't outputted, however those observations aren't dropped and this observation's occ = 2 and ind = 3 is used in a calculation of hazards for those variables and subsequent ratios.
Additionally, the effect of the next strata for ysmcat above the original ysmcat = 0 (so ysmcat =1) may become the effect of immigrant = 1. I am seeing this in a similar model that I have run. Would you know why this happens and what is going in the model?
My take is that in PROC PHREG when a strata in one variable (immigrant = 0) can be replicated in another strata of another variable (ysmcat = 0) SAS does not take those variable's strata under consideration, and conducts an implicit stepping up or down to the next non-linearly dependent strata in the variable (ysmcat = 1 or ysmcat = 6 depending on REF parameters). Since for the other variables for (occ = 2 ind = 3) there is no dependence, the HR for occ and ind for that individual is still considered in the output.
I notice this as observations would not be dropped due to the linear dependence of immigrant strata 0 and ysmcat strata 0 ("Number of Observations Read" in the output). But I would like to know more what of what is happening if you all or others have insight. If @TL93 can share if they saw this happening as well would be great. Thank you.
Hello @soosas,
@soosas wrote:
... observations would not be dropped due to the linear dependence of immigrant strata 0 and ysmcat strata 0 ("Number of Observations Read" in the output).
Correct. The observations are used appropriately in the model. It's just one or the other parameter estimate that is set to zero because of the linear dependence between the dummy variables.
@soosas wrote:
Additionally, the effect of the next strata for ysmcat above the original ysmcat = 0 (so ysmcat =1) may become the effect of immigrant = 1. I am seeing this in a similar model that I have run. Would you know why this happens and what is going in the model?
Yes, this happens because the coefficients (parameter estimates) are not uniquely determined due to the linear dependence.
We don't have the data from the 2018 thread you are referring to, so let me create a similar dataset from the VALung dataset found in the PROC PHREG documentation:
data valung1;
set valung;
if cell='large' then therapy='standard';
if therapy='standard' then cell='large';
run;
proc freq data=valung1;
tables cell*therapy / nopercent norow nocol;
run;
Table of Cell by Therapy Cell(cell type) Therapy(type of treatment) Frequency|standard|test | Total ---------+--------+--------+ adeno | 0 | 18 | 18 ---------+--------+--------+ large | 81 | 0 | 81 ---------+--------+--------+ small | 0 | 18 | 18 ---------+--------+--------+ squamous | 0 | 20 | 20 ---------+--------+--------+ Total 81 56 137
Now we have an analogous situation as in the 2018 thread: Therapy='standard' is equivalent to Cell='large'. And these are the reference categories for Therapy and Cell, respectively, in the model below:
proc phreg data=VALung1;
class Prior(ref='no') Cell(ref='large') Therapy(ref='standard');
model Time*Status(0) = Kps Duration Age Therapy Cell Prior;
run;
Result:
Analysis of Maximum Likelihood Estimates Parameter Standard Hazard Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Label Kps 1 -0.03125 0.00542 33.2568 <.0001 0.969 Karnofsky performance scale Duration 1 -0.00361 0.00915 0.1560 0.6929 0.996 months from diagnosis to randomization Age 1 -0.00680 0.00906 0.5639 0.4527 0.993 age in years Therapy test 1 -0.57661 0.29581 3.7996 0.0513 0.562 type of treatment test Cell adeno 1 1.28090 0.38983 10.7966 0.0010 3.600 cell type adeno Cell small 1 1.45078 0.39951 13.1868 0.0003 4.266 cell type small Cell squamous 0 0 . . . . cell type squamous Prior yes 1 0.04539 0.22642 0.0402 0.8411 1.046 prior therapy yes
If the order of Therapy and Cell is changed in the MODEL statement, leaving everything else the same,
model Time*Status(0) = Kps Duration Age Cell Therapy Prior;
the parameter estimates change as follows:
Parameter Standard Hazard Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Label Kps 1 -0.03125 0.00542 33.2568 <.0001 0.969 Karnofsky performance scale Duration 1 -0.00361 0.00915 0.1560 0.6929 0.996 months from diagnosis to randomization Age 1 -0.00680 0.00906 0.5639 0.4527 0.993 age in years Cell adeno 1 0.70430 0.28973 5.9092 0.0151 2.022 cell type adeno Cell small 1 0.87417 0.30022 8.4782 0.0036 2.397 cell type small Cell squamous 1 -0.57661 0.29581 3.7996 0.0513 0.562 cell type squamous Therapy test 0 0 . . . . type of treatment test Prior yes 1 0.04539 0.22642 0.0402 0.8411 1.046 prior therapy yes
Indeed, the coefficient and all pertinent statistics for Therapy='test' in the first model are now those of one of the Cell categories -- Cell='squamous' -- (and vice versa) and the statistics for the other Cell categories are also affected (but not the statistics of the other model variables, i.e., Kps, Duration, etc.).
As mentioned above, this is due to the linear dependence of the dummy variables: Let's denote the dummy variables for 'test', 'adeno', 'small' and 'squamous' by t, a, b and c, respectively. Then we have the linear relation
t = a + b + c (cf. the PROC FREQ output above). As a consequence, any linear combination of t, a, b and c, i.e., an expression of the form b1t + b2a + b3b + b4c, can be written equivalently in various different ways, as the bi are not uniquely determined: For example, we can force the coefficient of c to be zero. Since c = t − a − b, we have the equation b1t + b2a + b3b + b4c = (b1+b4)t + (b2−b4)a + (b3−b4)b, so c is eliminated. This is what SAS did in the first model: The parameter estimate of 'squamous' is 0. Similarly, in the second model t is eliminated, i.e., its coefficient is set to zero. This is based on the equation b1t + b2a + b3b + b4c = (b1+b2)a + (b1+b3)b + (b1+b4)c (obtained by substituting t with a + b + c in the linear combination).
Note that the coefficient of c -- (b1+b4) -- is exactly what was the coefficient of t in the former model. Which is what we observed in the PROC PHREG table of maximum likelihood estimates regarding the parameter estimates of 'test' and 'squamous' when we swapped the two variables Therapy and Cell in the MODEL statement.
We also see how exactly the coefficients of 'adeno' and 'small' have changed: The coefficient (b1+b2) of dummy variable a in the second model can be written as (b1+b4)+(b2−b4) in terms of coefficients (namely the sum of the first two) of the dummy variables in the first model. Indeed, the parameter estimate for 'adeno' in the second model, 0.70430, is the sum of the parameter estimates for 'test' and 'adeno' in the first model, −0.57661+1.28090 (up to rounding error). Similarly −0.57661+1.45078 = 0.87417 yields the new parameter estimate for 'small' by adding the old parameter estimates for 'test' and 'small'. This is, again, a valid equation of coefficients: b1+b3 = (b1+b4)+(b3−b4).
Hello @soosas,
@soosas wrote:
... observations would not be dropped due to the linear dependence of immigrant strata 0 and ysmcat strata 0 ("Number of Observations Read" in the output).
Correct. The observations are used appropriately in the model. It's just one or the other parameter estimate that is set to zero because of the linear dependence between the dummy variables.
@soosas wrote:
Additionally, the effect of the next strata for ysmcat above the original ysmcat = 0 (so ysmcat =1) may become the effect of immigrant = 1. I am seeing this in a similar model that I have run. Would you know why this happens and what is going in the model?
Yes, this happens because the coefficients (parameter estimates) are not uniquely determined due to the linear dependence.
We don't have the data from the 2018 thread you are referring to, so let me create a similar dataset from the VALung dataset found in the PROC PHREG documentation:
data valung1;
set valung;
if cell='large' then therapy='standard';
if therapy='standard' then cell='large';
run;
proc freq data=valung1;
tables cell*therapy / nopercent norow nocol;
run;
Table of Cell by Therapy Cell(cell type) Therapy(type of treatment) Frequency|standard|test | Total ---------+--------+--------+ adeno | 0 | 18 | 18 ---------+--------+--------+ large | 81 | 0 | 81 ---------+--------+--------+ small | 0 | 18 | 18 ---------+--------+--------+ squamous | 0 | 20 | 20 ---------+--------+--------+ Total 81 56 137
Now we have an analogous situation as in the 2018 thread: Therapy='standard' is equivalent to Cell='large'. And these are the reference categories for Therapy and Cell, respectively, in the model below:
proc phreg data=VALung1;
class Prior(ref='no') Cell(ref='large') Therapy(ref='standard');
model Time*Status(0) = Kps Duration Age Therapy Cell Prior;
run;
Result:
Analysis of Maximum Likelihood Estimates Parameter Standard Hazard Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Label Kps 1 -0.03125 0.00542 33.2568 <.0001 0.969 Karnofsky performance scale Duration 1 -0.00361 0.00915 0.1560 0.6929 0.996 months from diagnosis to randomization Age 1 -0.00680 0.00906 0.5639 0.4527 0.993 age in years Therapy test 1 -0.57661 0.29581 3.7996 0.0513 0.562 type of treatment test Cell adeno 1 1.28090 0.38983 10.7966 0.0010 3.600 cell type adeno Cell small 1 1.45078 0.39951 13.1868 0.0003 4.266 cell type small Cell squamous 0 0 . . . . cell type squamous Prior yes 1 0.04539 0.22642 0.0402 0.8411 1.046 prior therapy yes
If the order of Therapy and Cell is changed in the MODEL statement, leaving everything else the same,
model Time*Status(0) = Kps Duration Age Cell Therapy Prior;
the parameter estimates change as follows:
Parameter Standard Hazard Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Label Kps 1 -0.03125 0.00542 33.2568 <.0001 0.969 Karnofsky performance scale Duration 1 -0.00361 0.00915 0.1560 0.6929 0.996 months from diagnosis to randomization Age 1 -0.00680 0.00906 0.5639 0.4527 0.993 age in years Cell adeno 1 0.70430 0.28973 5.9092 0.0151 2.022 cell type adeno Cell small 1 0.87417 0.30022 8.4782 0.0036 2.397 cell type small Cell squamous 1 -0.57661 0.29581 3.7996 0.0513 0.562 cell type squamous Therapy test 0 0 . . . . type of treatment test Prior yes 1 0.04539 0.22642 0.0402 0.8411 1.046 prior therapy yes
Indeed, the coefficient and all pertinent statistics for Therapy='test' in the first model are now those of one of the Cell categories -- Cell='squamous' -- (and vice versa) and the statistics for the other Cell categories are also affected (but not the statistics of the other model variables, i.e., Kps, Duration, etc.).
As mentioned above, this is due to the linear dependence of the dummy variables: Let's denote the dummy variables for 'test', 'adeno', 'small' and 'squamous' by t, a, b and c, respectively. Then we have the linear relation
t = a + b + c (cf. the PROC FREQ output above). As a consequence, any linear combination of t, a, b and c, i.e., an expression of the form b1t + b2a + b3b + b4c, can be written equivalently in various different ways, as the bi are not uniquely determined: For example, we can force the coefficient of c to be zero. Since c = t − a − b, we have the equation b1t + b2a + b3b + b4c = (b1+b4)t + (b2−b4)a + (b3−b4)b, so c is eliminated. This is what SAS did in the first model: The parameter estimate of 'squamous' is 0. Similarly, in the second model t is eliminated, i.e., its coefficient is set to zero. This is based on the equation b1t + b2a + b3b + b4c = (b1+b2)a + (b1+b3)b + (b1+b4)c (obtained by substituting t with a + b + c in the linear combination).
Note that the coefficient of c -- (b1+b4) -- is exactly what was the coefficient of t in the former model. Which is what we observed in the PROC PHREG table of maximum likelihood estimates regarding the parameter estimates of 'test' and 'squamous' when we swapped the two variables Therapy and Cell in the MODEL statement.
We also see how exactly the coefficients of 'adeno' and 'small' have changed: The coefficient (b1+b2) of dummy variable a in the second model can be written as (b1+b4)+(b2−b4) in terms of coefficients (namely the sum of the first two) of the dummy variables in the first model. Indeed, the parameter estimate for 'adeno' in the second model, 0.70430, is the sum of the parameter estimates for 'test' and 'adeno' in the first model, −0.57661+1.28090 (up to rounding error). Similarly −0.57661+1.45078 = 0.87417 yields the new parameter estimate for 'small' by adding the old parameter estimates for 'test' and 'small'. This is, again, a valid equation of coefficients: b1+b3 = (b1+b4)+(b3−b4).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.