PROC LOGISTIC often overrides all ref= statements when used with a multinomial model. Can SAS output results or a dataset indicating which reference category was selected for the outcome (dependent) variable? The highest category is generally selected by default, but it would be nice to have output to confirm this, particularly if one wishes to change the reference category and confirm that it has changed. For example, can this be determined using the outputted OUTMODEL dataset? The model in question is a (partial) proportional odds cumulative logit model with three different categories (although this applies to any predictor with at least three categories). This is SAS version 9.4M7.
proc logistic data=datain order=internal outest=testout_est outmodel=testout_model;
class outcome(ref=first) catvar1(ref=first) catvar2(ref=last) / param=ref ref=first; * param=effect is default, which compares average effect across all values. param=ref compares to reference category;
model outcome(event='1') = catvar1 catvar2 / link=clogit unequalslopes=(catvar1);
run;
NOTE: The REF= option for the response variable is ignored.
NOTE: PROC LOGISTIC is fitting the cumulative logit model. The probabilities modeled are summed over the responses having the lower Ordered Values in the Response Profile table.
OUTMODEL output (testout_model). Some _MISC_ values have been changed slightly or replaced with xxx.
_TYPE_ | _NAME_ | _CATEGORY_ | _NAMEIDX_ | _CATIDX_ | _MISC_ |
L | 7 | ||||
M | NYYNYNNN | 7 | |||
G | outcome | outcome=0 | 0 | 0 | 10 |
G | outcome | outcome=1 | 0 | 1 | 10 |
G | outcome | outcome=2 | 0 | 2 | -10 |
G | outcome | -1 | 0 | 13 | |
G | outcome | -1 | 1 | 8 | |
G | outcome | -1 | 2 | 35 | |
G | outcome | -1 | -2 | -16 | |
G | catvar1 | 1 | 1 | 0 | -1 |
G | catvar1 | 2 | 1 | 1 | 1 |
G | catvar1 | -2 | -1 | 3 | |
G | catvar1 | -2 | -2 | -6 | |
G | catvar2 | 1 | 2 | 0 | 2 |
G | catvar3 | 2 | 2 | 1 | 2 |
G | catvar4 | 3 | 2 | 2 | -2 |
G | catvar5 | -3 | -1 | 3 | |
G | catvar6 | -3 | -2 | -11 | |
E | Intercept | E | 0 | 0 | xxx |
E | Intercept | E | 0 | 1 | xxx |
E | EFFECT | G | 0 | 0 | 1 |
E | EFFECT | X | 0 | 0 | 1 |
E | EFFECT | E | 0 | 0 | xxx |
E | EFFECT | E | 0 | 1 | xxx |
E | EFFECT | Q | 0 | 0 | 0 |
E | EFFECT | G | 1 | 0 | 3 |
E | EFFECT | X | 1 | 0 | 3 |
E | EFFECT | E | 1 | 0 | xxx |
E | EFFECT | E | 1 | 1 | xxx |
E | EFFECT | Q | 1 | 1 | 1 |
E | EFFECT | V | 0 | xxx | |
E | EFFECT | V | 1 | xxx | |
E | EFFECT | V | 2 | xxx | |
E | EFFECT | V | 3 | xxx | |
E | EFFECT | V | 4 | xxx | |
E | EFFECT | V | 5 | xxx | |
E | EFFECT | V | 6 | xxx | |
E | EFFECT | V | 7 | xxx | |
E | EFFECT | V | 8 | xxx | |
E | EFFECT | V | 9 | xxx | |
E | EFFECT | V | 10 | xxx | |
E | EFFECT | V | 11 | xxx | |
E | EFFECT | V | 12 | xxx | |
E | EFFECT | V | 13 | xxx | |
E | EFFECT | V | 14 | xxx | |
E | EFFECT | V | 15 | xxx | |
E | EFFECT | V | 16 | xxx | |
E | EFFECT | V | 17 | xxx | |
E | EFFECT | V | 18 | xxx | |
E | EFFECT | V | 19 | xxx | |
E | EFFECT | V | 20 | xxx | |
X | 52 | 27 | 232 | xxx |
First, you should never specify the response variable (OUTCOME) in the CLASS statement. Any options that you want to apply to the response levels should be specified in parentheses after the response variable in the MODEL statement. These are called the response variable options. You are fitting a cumulative logit model for an ordered response, so the only response level sorting and ordering are relevant. Neither the EVENT= option, which only applies to a binary response, nor the REF= option are relevant and are ignored. Since your response is ordinal, you should be concerned with whether the response levels are in proper ascending or descending order. The order being used is shown in the Response Profile table. For instance, if the response has levels High, Medium, and Low, you don't want the Response Profile table showing the response levels in the order Medium, Low, High. If the displayed order is not properly ascending or descending, you can use the ORDER= response variable option or you can create a format for the response whose values will sort properly. If they are in proper descending order but you want to model probabilities of higher response levels, then also add the DESCENDING response variable option. See Response Level Ordering in the Details section of the LOGISTIC documentation and this note.
ods output ClassLevelInfo= ClassLevelInfo;
proc logistic data=sashelp.heart;
class bp_status sex;
model status=bp_status sex weight height;
run;
Apologies; I should have clarified that I am referring to the reference category of the outcome variable, not the predictor variables. I have edited this into the OP.
This was intended for a multinomial outcome (i.e. not binary). For example:
proc logistic data=sashelp.heart;
class bp_status sex;
model Chol_Status = bp_status sex weight height;
output out=want p=pred;
run;
Which shows values for each observation compared to a borderline and desirable response value. That makes me think that the other category, High, is the reference category.
In the output, the reference category will not have a parameter estimate. In this case, "Acura" does not have a parameter estimate. It will also show up in the Odds Ratio table where all the non-reference levels are compared to the reference level. It also shows up in Class Level Information output (I leave it as a homework assignment for you to look at the table and determine the reference level)
proc logistic data=sashelp.cars(obs=100);
class make(ref='Acura');
model origin=enginesize weight make;
run;
The only way I know of specifying the reference level for the response variable is to shift to fitting a generalized logit to the multinomial distribution. If you go that way, you can specify any particular level of the response variable as the reference using the ref=' ' method. The source of the NOTE: regarding the reference category in the response variable is due to the link chosen. As I mentioned in the first sentence, you have to specify LINK=GLOGIT for it to apply the reference level. There are other PROCs that operate similarly (HPGENSELECT, GLIMMIX for example).
Just thought of another way, but it requires formatting the levels of the response variable. Just set up your format so that the level you want as the reference is either LAST (default) or FIRST (needs a REF=FIRST in either the MODEL or CLASS statement).
SteveDenham
Since your Y variable Chol_Status is multinomial variable have three levels, then you fit TWO logistic models separatedly:
1)
where Chol_Status in ('borderline ' 'High');
model Chol_Status (event='borderline ')=
2)
where Chol_Status in ('desirable ' 'High');
model Chol_Status (event='desirable ')=
First, you should never specify the response variable (OUTCOME) in the CLASS statement. Any options that you want to apply to the response levels should be specified in parentheses after the response variable in the MODEL statement. These are called the response variable options. You are fitting a cumulative logit model for an ordered response, so the only response level sorting and ordering are relevant. Neither the EVENT= option, which only applies to a binary response, nor the REF= option are relevant and are ignored. Since your response is ordinal, you should be concerned with whether the response levels are in proper ascending or descending order. The order being used is shown in the Response Profile table. For instance, if the response has levels High, Medium, and Low, you don't want the Response Profile table showing the response levels in the order Medium, Low, High. If the displayed order is not properly ascending or descending, you can use the ORDER= response variable option or you can create a format for the response whose values will sort properly. If they are in proper descending order but you want to model probabilities of higher response levels, then also add the DESCENDING response variable option. See Response Level Ordering in the Details section of the LOGISTIC documentation and this note.
For GLIMMIX and HPGENSELECT, you must specify the output variable in the CLASS statement if you are fitting a generalized logit link to a multivariate distribution. For LOGISTIC, it is as @StatDave says - don't put the output variable in the CLASS statement. I don't know why it isn't consistent.
SteveDenham
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.