Hi SAS Community,
I am running proc phreg to predict the incidence of cancer, based on socio-demographic variables:
proc phreg data=<data_name>;
class immigrant (ref='0') ysmcat (ref='0') ind (ref='0') occ (ref='0') /param=reference;
model duration*status_cancer(0)=immigrant ysmcat ind occ /rl ties=efron;
run;
*note: I have quite a few more predictor variables but these four are the ones giving me trouble
ysmcat = years since migration; categories 0 to 6;
0=non-immigrants (reference group); 1=0 years, 2=1-5 years, and so on.
occ = occupation; categories 0 to 5;
0=no occupation (reference group); 1=occupation group 1, and so on.
My issue is highlighted in the following photo of my output:
I am getting periods for the last category of the ysmcat and occ variables. Furthermore, the degrees of freedom is 0; I'm assuming that SAS is treating these categories as the reference group, even though I specified using category=0 as the reference group. Category 0 for both variables are not present in the model. When I run a frequency table, it shows that all categories are coded correctly, even the 0 category.
I thought this issue might be caused by the immigrant variable, for ysmcat. So I dropped immigrant from the regression and lo and behold, the ysmcat was fine. Same goes for occ when I drop ind from the model. I have a feeling it has something to do with multicollinearity, or the overlap in reference groups for these variables but I cannot explain it.
Can anyone help me explain what is going on and how to rectify the issue?
Thank you so much! More info on these four variables below:
immigrant
0=not an immigrant; 1=immigrant
ysmcat (years since migration cat)
0=not an immigrant; 1=0 years; 2=1-5 years; 3=6-10 years; 4=year_range; 5=year_range; 6=year_range
occ (occupation)
0=no occupation; 1=management; 2=professional; 3=occ_cat; 4=occ_cat; 5=occ_cat
ind (industry)
0=no industry; 1=primary; 2=manufacturing; 3=construction; 4=ind_cat; 5=ind_cat; 6=ind_cat; 7=ind_cat; 8=ind_cat
immigrant 0 is equal to ysmcat 0 (since it is capturing all non-immigrants)
occ 0 is equal to ind 0 (since it is capturing all those who are jobless)
Again, thank you for your time.
Hi @TL93,
I think the reason (in both cases) is the linear dependence of the design variables (see table "Class Level Information" in the output): The last design variable can be expressed as a linear combination of the other design variables because the "0" categories of OCC and IND coincide (and analogously for IMMIGRANT and YSMCAT).
Example with two variables a, b, each with categories 0, 1, 2, a=0 <==> b=0, and design variables a1, a2, b1, b2 for a=1, a=2, etc.: b2=a1+a2-b1.
So, the effect of the last design variable is already "absorbed" by the preceding design variables. (You've "run out of degrees of freedom".) You can decide which of the two variables (e.g. OCC or IND) is affected by changing the order of the two in the MODEL statement (occ ind vs. ind occ).
one thing the proper syntax is (ref='0').
You state
Category 0 for both variables are not present in the model.
Since you use a category that does not exist then the system defaults to the Default value of Ref which is LAST or the largest value.
Thank you for your response, ballardw! Sorry, there was a typo in my original post. My code did use (ref='0'), and it works for all my other categorical variables (not listed here). Additionally, I ran frequency tables to make sure that the category does exist. For some reason these specific 4 variables are giving me trouble.
I have just made edits to my post after reading your response. Thanks again!
I believe Ref needs to be the formatted value as well, ie ref=‘no industry’ not the 0 value.
REF=’level’ | keyword
specifies the reference level for PARAM=EFFECT, PARAM=REFERENCE, and their orthogonalizations. For PARAM=GLM, the REF= option specifies a level of the classification variable to be put at the end of the list of levels. This level thus corresponds to the reference level in the usual interpretation of the linear estimates with a singular parameterization.
For an individual variable REF= option (but not for a global REF= option), you can specify the level of the variable to use as the reference level. Specify the formatted value of the variable if a format is assigned. For a global or individual variable REF= option, you can use one of the following keywords. The default is REF=LAST.
https://documentation.sas.com/?docsetId=statug&docsetVersion=14.3&docsetTarget=statug_phreg_syntax06...
Hi @TL93,
I think the reason (in both cases) is the linear dependence of the design variables (see table "Class Level Information" in the output): The last design variable can be expressed as a linear combination of the other design variables because the "0" categories of OCC and IND coincide (and analogously for IMMIGRANT and YSMCAT).
Example with two variables a, b, each with categories 0, 1, 2, a=0 <==> b=0, and design variables a1, a2, b1, b2 for a=1, a=2, etc.: b2=a1+a2-b1.
So, the effect of the last design variable is already "absorbed" by the preceding design variables. (You've "run out of degrees of freedom".) You can decide which of the two variables (e.g. OCC or IND) is affected by changing the order of the two in the MODEL statement (occ ind vs. ind occ).
Thank you, FreelanceReinhard! That was very informative. Yes, I have a feeling this is more a statistical issue than a programming issue but I wanted to get everyone else's insight as well. When I change the order of immigrant and ysmcat so that ysmcat comes first, it is the immigrant variable that has issues. I have yet to switch the order for occ and ind.
I might try a couple other things (like what Reeza mentioned above) before I consider dropping immigrant and occ (or ind) from my models.
Take care!
@TL93 wrote:
I might try a couple other things (like what Reeza mentioned above) before I consider dropping immigrant and occ (or ind) from my models.
I don't think the (necessarily) missing estimate for one of several non-reference categories of OCC or IND is a reason for dropping either variable. The only redundant variable is IMMIGRANT because its value is completely determined by the value of YSMCAT. So, the decision would be to use either IMMIGRANT or YSMCAT with its refined categories in the model (provided they are significant).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.