Solved: Re: proc phreg not recognizing reference category

TL93 · Posted 10-31-2018 05:33 PM

Hi SAS Community,

I am running proc phreg to predict the incidence of cancer, based on socio-demographic variables:

proc phreg data=<data_name>;
class immigrant (ref='0') ysmcat (ref='0') ind (ref='0') occ (ref='0') /param=reference;
model duration*status_cancer(0)=immigrant ysmcat ind occ /rl ties=efron;
run;

*note: I have quite a few more predictor variables but these four are the ones giving me trouble

ysmcat = years since migration; categories 0 to 6;

0=non-immigrants (reference group); 1=0 years, 2=1-5 years, and so on.

occ = occupation; categories 0 to 5;

0=no occupation (reference group); 1=occupation group 1, and so on.

My issue is highlighted in the following photo of my output:

I am getting periods for the last category of the ysmcat and occ variables. Furthermore, the degrees of freedom is 0; I'm assuming that SAS is treating these categories as the reference group, even though I specified using category=0 as the reference group. Category 0 for both variables are not present in the model. When I run a frequency table, it shows that all categories are coded correctly, even the 0 category.

I thought this issue might be caused by the immigrant variable, for ysmcat. So I dropped immigrant from the regression and lo and behold, the ysmcat was fine. Same goes for occ when I drop ind from the model. I have a feeling it has something to do with multicollinearity, or the overlap in reference groups for these variables but I cannot explain it.

Can anyone help me explain what is going on and how to rectify the issue?

Thank you so much! More info on these four variables below:

immigrant

0=not an immigrant; 1=immigrant

ysmcat (years since migration cat)

0=not an immigrant; 1=0 years; 2=1-5 years; 3=6-10 years; 4=year_range; 5=year_range; 6=year_range

occ (occupation)

0=no occupation; 1=management; 2=professional; 3=occ_cat; 4=occ_cat; 5=occ_cat

ind (industry)

0=no industry; 1=primary; 2=manufacturing; 3=construction; 4=ind_cat; 5=ind_cat; 6=ind_cat; 7=ind_cat; 8=ind_cat

immigrant 0 is equal to ysmcat 0 (since it is capturing all non-immigrants)

occ 0 is equal to ind 0 (since it is capturing all those who are jobless)

Again, thank you for your time.

FreelanceReinh · Posted 10-31-2018 07:14 PM

Hi @TL93,

I think the reason (in both cases) is the linear dependence of the design variables (see table "Class Level Information" in the output): The last design variable can be expressed as a linear combination of the other design variables because the "0" categories of OCC and IND coincide (and analogously for IMMIGRANT and YSMCAT).

Example with two variables a, b, each with categories 0, 1, 2, a=0 <==> b=0, and design variables a1, a2, b1, b2 for a=1, a=2, etc.: b2=a1+a2-b1.

So, the effect of the last design variable is already "absorbed" by the preceding design variables. (You've "run out of degrees of freedom".) You can decide which of the two variables (e.g. OCC or IND) is affected by changing the order of the two in the MODEL statement (occ ind vs. ind occ).

View solution in original post

ballardw · Posted 10-31-2018 06:01 PM

one thing the proper syntax is (ref='0').

You state

Category 0 for both variables are not present in the model.

Since you use a category that does not exist then the system defaults to the Default value of Ref which is LAST or the largest value.

TL93 · Posted 10-31-2018 09:13 PM

Thank you for your response, ballardw! Sorry, there was a typo in my original post. My code did use (ref='0'), and it works for all my other categorical variables (not listed here). Additionally, I ran frequency tables to make sure that the category does exist. For some reason these specific 4 variables are giving me trouble.

I have just made edits to my post after reading your response. Thanks again!

Reeza · Posted 10-31-2018 06:18 PM

I believe Ref needs to be the formatted value as well, ie ref=‘no industry’ not the 0 value.

REF=’level’ | keyword
specifies the reference level for PARAM=EFFECT, PARAM=REFERENCE, and their orthogonalizations. For PARAM=GLM, the REF= option specifies a level of the classification variable to be put at the end of the list of levels. This level thus corresponds to the reference level in the usual interpretation of the linear estimates with a singular parameterization.

For an individual variable REF= option (but not for a global REF= option), you can specify the level of the variable to use as the reference level. Specify the formatted value of the variable if a format is assigned. For a global or individual variable REF= option, you can use one of the following keywords. The default is REF=LAST.

https://documentation.sas.com/?docsetId=statug&docsetVersion=14.3&docsetTarget=statug_phreg_syntax06...

FreelanceReinh · Posted 10-31-2018 07:14 PM

Hi @TL93,

I think the reason (in both cases) is the linear dependence of the design variables (see table "Class Level Information" in the output): The last design variable can be expressed as a linear combination of the other design variables because the "0" categories of OCC and IND coincide (and analogously for IMMIGRANT and YSMCAT).

Example with two variables a, b, each with categories 0, 1, 2, a=0 <==> b=0, and design variables a1, a2, b1, b2 for a=1, a=2, etc.: b2=a1+a2-b1.

So, the effect of the last design variable is already "absorbed" by the preceding design variables. (You've "run out of degrees of freedom".) You can decide which of the two variables (e.g. OCC or IND) is affected by changing the order of the two in the MODEL statement (occ ind vs. ind occ).

TL93 · Posted 10-31-2018 09:21 PM

Thank you, FreelanceReinhard! That was very informative. Yes, I have a feeling this is more a statistical issue than a programming issue but I wanted to get everyone else's insight as well. When I change the order of immigrant and ysmcat so that ysmcat comes first, it is the immigrant variable that has issues. I have yet to switch the order for occ and ind.

I might try a couple other things (like what Reeza mentioned above) before I consider dropping immigrant and occ (or ind) from my models.

Take care!

Reeza · Posted 10-31-2018 09:43 PM

If you're referring to the fact that one level is always missing when dummy coding, that's the nature of dummy coding. It's also why you create N-1 levels of a dummy variable when dummy coding. To fit a full model, I don't think you can use the referential coding and then you need to interpret the coefficients differently and your hypothesis is different. It gets asked pretty regularly on here, usually under Statistics. I think someone (Paige Miller or PGSTATs) has an example of how to get all estimates but that's beyond me at the moment since it's my bed time.

FreelanceReinh · Posted 11-01-2018 05:29 AM

@TL93 wrote:

I might try a couple other things (like what Reeza mentioned above) before I consider dropping immigrant and occ (or ind) from my models.

I don't think the (necessarily) missing estimate for one of several non-reference categories of OCC or IND is a reason for dropping either variable. The only redundant variable is IMMIGRANT because its value is completely determined by the value of YSMCAT. So, the decision would be to use either IMMIGRANT or YSMCAT with its refined categories in the model (provided they are significant).

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away