Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- Re: proc phreg not recognizing reference category (solved but a follow...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

☑ This topic is **solved**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 05-15-2023 10:31 AM
(624 views)

Referencing this post and community members as it is now marked as solved and locked:

I really liked @FreelanceReinh and @ballardw explanations from the post from @TL93 on PROC PHREG not recognizing reference category. I actually want to understand it further.

Even though SAS is ignoring the variable because it's absorbed by the first variable, I believe SAS will still account for those in other categories where it's not absorbed. So say immigrant = 0 and ysmcat = 0 occ = 2 ind = 3:

For those with immigrant = 0, in the output ysmcat = 0 variable is set to missing and immigrant = 0 is used as the reference. The calculations aren't outputted, however those observations aren't dropped and this observation's occ = 2 and ind = 3 is used in a calculation of hazards for those variables and subsequent ratios.

Additionally, the effect of the next strata for ysmcat above the original ysmcat = 0 (so ysmcat =1) may become the effect of immigrant = 1. I am seeing this in a similar model that I have run. Would you know why this happens and what is going in the model?

My take is that in PROC PHREG when a strata in one variable (immigrant = 0) can be replicated in another strata of another variable (ysmcat = 0) SAS does not take those variable's strata under consideration, and conducts an implicit stepping up or down to the next non-linearly dependent strata in the variable (ysmcat = 1 or ysmcat = 6 depending on REF parameters). Since for the other variables for (occ = 2 ind = 3) there is no dependence, the HR for occ and ind for that individual is still considered in the output.

I notice this as observations would not be dropped due to the linear dependence of immigrant strata 0 and ysmcat strata 0 ("Number of Observations Read" in the output). But I would like to know more what of what is happening if you all or others have insight. If @TL93 can share if they saw this happening as well would be great. Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello @soosas,

@soosas wrote:

... observations would not be dropped due to the linear dependence of immigrant strata 0 and ysmcat strata 0 ("Number of Observations Read" in the output).

Correct. The observations are used appropriately in the model. It's just one or the other parameter estimate that is set to zero because of the linear dependence between the dummy variables.

@soosas wrote:

Additionally, the effect of the next strata for ysmcat above the original ysmcat = 0 (so ysmcat =1) may become the effect of immigrant = 1. I am seeing this in a similar model that I have run. Would you know why this happens and what is going in the model?

Yes, this happens because the coefficients (parameter estimates) are not uniquely determined due to the linear dependence.

We don't have the data from the 2018 thread you are referring to, so let me create a similar dataset from the VALung dataset found in the PROC PHREG documentation:

```
data valung1;
set valung;
if cell='large' then therapy='standard';
if therapy='standard' then cell='large';
run;
proc freq data=valung1;
tables cell*therapy / nopercent norow nocol;
run;
```

Table of Cell by Therapy Cell(cell type) Therapy(type of treatment) Frequency|standard|test | Total ---------+--------+--------+ adeno | 0 | 18 | 18 ---------+--------+--------+ large | 81 | 0 | 81 ---------+--------+--------+ small | 0 | 18 | 18 ---------+--------+--------+ squamous | 0 | 20 | 20 ---------+--------+--------+ Total 81 56 137

Now we have an analogous situation as in the 2018 thread: Therapy='standard' is equivalent to Cell='large'. And these are the reference categories for Therapy and Cell, respectively, in the model below:

```
proc phreg data=VALung1;
class Prior(ref='no') Cell(ref='large') Therapy(ref='standard');
model Time*Status(0) = Kps Duration Age
```**Therapy Cell** Prior;
run;

Result:

Analysis of Maximum Likelihood Estimates Parameter Standard Hazard Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Label Kps 1 -0.03125 0.00542 33.2568 <.0001 0.969 Karnofsky performance scale Duration 1 -0.00361 0.00915 0.1560 0.6929 0.996 months from diagnosis to randomization Age 1 -0.00680 0.00906 0.5639 0.4527 0.993 age in years Therapy test 1-0.57661 0.29581 3.7996 0.0513 0.562type of treatment test Cell adeno 1 1.28090 0.38983 10.7966 0.0010 3.600 cell type adeno Cell small 1 1.45078 0.39951 13.1868 0.0003 4.266 cell type small Cell squamous 0 0 . . . . cell type squamous Prior yes 1 0.04539 0.22642 0.0402 0.8411 1.046 prior therapy yes

If the order of Therapy and Cell is changed in the MODEL statement, leaving everything else the same,

model Time*Status(0) = Kps Duration AgeCell TherapyPrior;

the parameter estimates change as follows:

Parameter Standard Hazard Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Label Kps 1 -0.03125 0.00542 33.2568 <.0001 0.969 Karnofsky performance scale Duration 1 -0.00361 0.00915 0.1560 0.6929 0.996 months from diagnosis to randomization Age 1 -0.00680 0.00906 0.5639 0.4527 0.993 age in years Cell adeno 1 0.70430 0.28973 5.9092 0.0151 2.022 cell type adeno Cell small 1 0.87417 0.30022 8.4782 0.0036 2.397 cell type small Cell squamous 1-0.57661 0.29581 3.7996 0.0513 0.562cell type squamous Therapy test 0 0 . . . . type of treatment test Prior yes 1 0.04539 0.22642 0.0402 0.8411 1.046 prior therapy yes

Indeed, the coefficient and all pertinent statistics for Therapy='test' in the first model are now those of one of the Cell categories -- Cell='squamous' -- (and vice versa) and the statistics for the other Cell categories are also affected (but *not* the statistics of the other model variables, i.e., Kps, Duration, etc.).

As mentioned above, this is due to the linear dependence of the dummy variables: Let's denote the dummy variables for 'test', 'adeno', 'small' and 'squamous' by *t*, *a*, *b* and *c*, respectively. Then we have the linear relation

*t = a + b + c* (cf. the PROC FREQ output above). As a consequence, any linear combination of *t*, *a*, *b* and *c*, i.e., an expression of the form *b _{1}t + b_{2}a + b_{3}b + b_{4}c*, can be written equivalently in various different ways, as the

Note that the coefficient of *c* -- ** (b_{1}+b_{4})** -- is exactly what was the coefficient of

We also see how exactly the coefficients of 'adeno' and 'small' have changed: The coefficient *(b _{1}+b_{2})* of dummy variable

1 REPLY 1

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello @soosas,

@soosas wrote:

... observations would not be dropped due to the linear dependence of immigrant strata 0 and ysmcat strata 0 ("Number of Observations Read" in the output).

Correct. The observations are used appropriately in the model. It's just one or the other parameter estimate that is set to zero because of the linear dependence between the dummy variables.

@soosas wrote:

Additionally, the effect of the next strata for ysmcat above the original ysmcat = 0 (so ysmcat =1) may become the effect of immigrant = 1. I am seeing this in a similar model that I have run. Would you know why this happens and what is going in the model?

Yes, this happens because the coefficients (parameter estimates) are not uniquely determined due to the linear dependence.

We don't have the data from the 2018 thread you are referring to, so let me create a similar dataset from the VALung dataset found in the PROC PHREG documentation:

```
data valung1;
set valung;
if cell='large' then therapy='standard';
if therapy='standard' then cell='large';
run;
proc freq data=valung1;
tables cell*therapy / nopercent norow nocol;
run;
```

Table of Cell by Therapy Cell(cell type) Therapy(type of treatment) Frequency|standard|test | Total ---------+--------+--------+ adeno | 0 | 18 | 18 ---------+--------+--------+ large | 81 | 0 | 81 ---------+--------+--------+ small | 0 | 18 | 18 ---------+--------+--------+ squamous | 0 | 20 | 20 ---------+--------+--------+ Total 81 56 137

Now we have an analogous situation as in the 2018 thread: Therapy='standard' is equivalent to Cell='large'. And these are the reference categories for Therapy and Cell, respectively, in the model below:

```
proc phreg data=VALung1;
class Prior(ref='no') Cell(ref='large') Therapy(ref='standard');
model Time*Status(0) = Kps Duration Age
```**Therapy Cell** Prior;
run;

Result:

Analysis of Maximum Likelihood Estimates Parameter Standard Hazard Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Label Kps 1 -0.03125 0.00542 33.2568 <.0001 0.969 Karnofsky performance scale Duration 1 -0.00361 0.00915 0.1560 0.6929 0.996 months from diagnosis to randomization Age 1 -0.00680 0.00906 0.5639 0.4527 0.993 age in years Therapy test 1-0.57661 0.29581 3.7996 0.0513 0.562type of treatment test Cell adeno 1 1.28090 0.38983 10.7966 0.0010 3.600 cell type adeno Cell small 1 1.45078 0.39951 13.1868 0.0003 4.266 cell type small Cell squamous 0 0 . . . . cell type squamous Prior yes 1 0.04539 0.22642 0.0402 0.8411 1.046 prior therapy yes

If the order of Therapy and Cell is changed in the MODEL statement, leaving everything else the same,

model Time*Status(0) = Kps Duration AgeCell TherapyPrior;

the parameter estimates change as follows:

Parameter Standard Hazard Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Label Kps 1 -0.03125 0.00542 33.2568 <.0001 0.969 Karnofsky performance scale Duration 1 -0.00361 0.00915 0.1560 0.6929 0.996 months from diagnosis to randomization Age 1 -0.00680 0.00906 0.5639 0.4527 0.993 age in years Cell adeno 1 0.70430 0.28973 5.9092 0.0151 2.022 cell type adeno Cell small 1 0.87417 0.30022 8.4782 0.0036 2.397 cell type small Cell squamous 1-0.57661 0.29581 3.7996 0.0513 0.562cell type squamous Therapy test 0 0 . . . . type of treatment test Prior yes 1 0.04539 0.22642 0.0402 0.8411 1.046 prior therapy yes

Indeed, the coefficient and all pertinent statistics for Therapy='test' in the first model are now those of one of the Cell categories -- Cell='squamous' -- (and vice versa) and the statistics for the other Cell categories are also affected (but *not* the statistics of the other model variables, i.e., Kps, Duration, etc.).

As mentioned above, this is due to the linear dependence of the dummy variables: Let's denote the dummy variables for 'test', 'adeno', 'small' and 'squamous' by *t*, *a*, *b* and *c*, respectively. Then we have the linear relation

*t = a + b + c* (cf. the PROC FREQ output above). As a consequence, any linear combination of *t*, *a*, *b* and *c*, i.e., an expression of the form *b _{1}t + b_{2}a + b_{3}b + b_{4}c*, can be written equivalently in various different ways, as the

Note that the coefficient of *c* -- ** (b_{1}+b_{4})** -- is exactly what was the coefficient of

We also see how exactly the coefficients of 'adeno' and 'small' have changed: The coefficient *(b _{1}+b_{2})* of dummy variable

**SAS Innovate 2025** is scheduled for May 6-9 in Orlando, FL. Sign up to be **first to learn** about the agenda and registration!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Ready to level-up your skills? Choose your own adventure.