BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TL93
Obsidian | Level 7

Hi SAS Community,

 

I am running proc phreg to predict the incidence of cancer, based on socio-demographic variables:

proc phreg data=<data_name>;
class immigrant (ref='0') ysmcat (ref='0') ind (ref='0') occ (ref='0') /param=reference;
model duration*status_cancer(0)=immigrant ysmcat ind occ /rl ties=efron;
run;

*note: I have quite a few more predictor variables but these four are the ones giving me trouble

 

ysmcat = years since migration; categories 0 to 6;

         0=non-immigrants (reference group); 1=0 years, 2=1-5 years, and so on.

occ = occupation; categories 0 to 5;

         0=no occupation (reference group); 1=occupation group 1, and so on.

 

My issue is highlighted in the following photo of my output:

 

proc phreg issue.png

 

I am getting periods for the last category of the ysmcat and occ variables. Furthermore, the degrees of freedom is 0; I'm assuming that SAS is treating these categories as the reference group, even though I specified using category=0 as the reference group. Category 0 for both variables are not present in the model. When I run a frequency table, it shows that all categories are coded correctly, even the 0 category.

 

I thought this issue might be caused by the immigrant variable, for ysmcat. So I dropped immigrant from the regression and lo and behold, the ysmcat was fine. Same goes for occ when I drop ind from the model. I have a feeling it has something to do with multicollinearity, or the overlap in reference groups for these variables but I cannot explain it.

 

Can anyone help me explain what is going on and how to rectify the issue?

 

Thank you so much! More info on these four variables below:

 

immigrant

0=not an immigrant;    1=immigrant

 

ysmcat (years since migration cat)

0=not an immigrant;    1=0 years;    2=1-5 years;    3=6-10 years;    4=year_range;    5=year_range;    6=year_range

 

occ (occupation)

0=no occupation;    1=management;    2=professional;    3=occ_cat;    4=occ_cat;    5=occ_cat

 

ind (industry)

0=no industry;    1=primary;    2=manufacturing;    3=construction;    4=ind_cat;    5=ind_cat;    6=ind_cat;    7=ind_cat;    8=ind_cat

 

immigrant 0 is equal to ysmcat 0 (since it is capturing all non-immigrants)

occ 0 is equal to ind 0 (since it is capturing all those who are jobless)

 

Again, thank you for your time.

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @TL93,

 

I think the reason (in both cases) is the linear dependence of the design variables (see table "Class Level Information" in the output): The last design variable can be expressed as a linear combination of the other design variables because the "0" categories of OCC and IND coincide (and analogously for IMMIGRANT and YSMCAT).

 

Example with two variables a, b, each with categories 0, 1, 2, a=0 <==> b=0, and design variables a1, a2, b1, b2 for a=1, a=2, etc.: b2=a1+a2-b1.

 

So, the effect of the last design variable is already "absorbed" by the preceding design variables. (You've "run out of degrees of freedom".) You can decide which of the two variables (e.g.  OCC or IND) is affected by changing the order of the two in the MODEL statement (occ ind  vs. ind occ).

View solution in original post

7 REPLIES 7
ballardw
Super User

one thing the proper syntax is (ref='0').

You state

Category 0 for both variables are not present in the model.

Since you use a category that does not exist then the system defaults to the Default value of Ref which is LAST or the largest value.

 

 

 


 

 

TL93
Obsidian | Level 7

Thank you for your response, ballardw! Sorry, there was a typo in my original post. My code did use (ref='0'), and it works for all my other categorical variables (not listed here). Additionally, I ran frequency tables to make sure that the category does exist. For some reason these specific 4 variables are giving me trouble.

 

I have just made edits to my post after reading your response. Thanks again!

Reeza
Super User

I believe Ref needs to be the formatted value as well, ie ref=‘no industry’ not the 0 value.

REF=’level’ | keyword
specifies the reference level for PARAM=EFFECT, PARAM=REFERENCE, and their orthogonalizations. For PARAM=GLM, the REF= option specifies a level of the classification variable to be put at the end of the list of levels. This level thus corresponds to the reference level in the usual interpretation of the linear estimates with a singular parameterization.

For an individual variable REF= option (but not for a global REF= option), you can specify the level of the variable to use as the reference level. Specify the formatted value of the variable if a format is assigned. For a global or individual variable REF= option, you can use one of the following keywords. The default is REF=LAST.

https://documentation.sas.com/?docsetId=statug&docsetVersion=14.3&docsetTarget=statug_phreg_syntax06...

FreelanceReinh
Jade | Level 19

Hi @TL93,

 

I think the reason (in both cases) is the linear dependence of the design variables (see table "Class Level Information" in the output): The last design variable can be expressed as a linear combination of the other design variables because the "0" categories of OCC and IND coincide (and analogously for IMMIGRANT and YSMCAT).

 

Example with two variables a, b, each with categories 0, 1, 2, a=0 <==> b=0, and design variables a1, a2, b1, b2 for a=1, a=2, etc.: b2=a1+a2-b1.

 

So, the effect of the last design variable is already "absorbed" by the preceding design variables. (You've "run out of degrees of freedom".) You can decide which of the two variables (e.g.  OCC or IND) is affected by changing the order of the two in the MODEL statement (occ ind  vs. ind occ).

TL93
Obsidian | Level 7

Thank you, FreelanceReinhard! That was very informative. Yes, I have a feeling this is more a statistical issue than a programming issue but I wanted to get everyone else's insight as well. When I change the order of immigrant and ysmcat so that ysmcat comes first, it is the immigrant variable that has issues. I have yet to switch the order for occ and ind.

 

I might try a couple other things (like what Reeza mentioned above) before I consider dropping immigrant and occ (or ind) from my models.

 

Take care!

Reeza
Super User
If you're referring to the fact that one level is always missing when dummy coding, that's the nature of dummy coding. It's also why you create N-1 levels of a dummy variable when dummy coding. To fit a full model, I don't think you can use the referential coding and then you need to interpret the coefficients differently and your hypothesis is different. It gets asked pretty regularly on here, usually under Statistics. I think someone (Paige Miller or PGSTATs) has an example of how to get all estimates but that's beyond me at the moment since it's my bed time.
FreelanceReinh
Jade | Level 19

@TL93 wrote:

 

I might try a couple other things (like what Reeza mentioned above) before I consider dropping immigrant and occ (or ind) from my models.

 


I don't think the (necessarily) missing estimate for one of several non-reference categories of OCC or IND is a reason for dropping either variable. The only redundant variable is IMMIGRANT because its value is completely determined by the value of YSMCAT. So, the decision would be to use either IMMIGRANT or YSMCAT with its refined categories in the model (provided they are significant).

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 5389 views
  • 4 likes
  • 4 in conversation