BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chapidi99
Fluorite | Level 6

Hi,

I am new to SAS and implementing logistic regression. I would like to know what is reference category in logistic regression. How is it useful. I have a categorical variable called "Level of pain" as no pain, less pain, medium, high and extreme. I have created dummy variables out of the categories. Which of the dummy variable need to be given as reference category? And what options I need to give in proc logistic regression to choose a best reference category?

 

Many thanks for the help!!

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

A predictor variable is declared as categorical by including it in the CLASS statement which then creates dummy variables for its levels. If the variable is not in the CLASS statement, it is treated as continuous and it is used directly as a column in the design matrix. You can see in your results that all variables have multiple coefficients associated with their multiple dummy variables when you used the CLASS statement, but they have only a single coefficient when the CLASS statement is not used. P-values aren't even comparable between these two ways of treating the predictors. What doesn't make sense is that the coefficient associated with the last (reference) level of each CLASS predictor is not zero, which it should be whenever PARAM=GLM is used. Maybe that second set of output wasn't really generated by the code that you show or there were some error or warning messages in the log. Probably there was a message about "separation" since some of the standard errors are very large which is typical when the data are too sparse causing the separation condition.

View solution in original post

13 REPLIES 13
Reeza
Super User

AFAIK, there is no such thing as a 'best reference category' and you don't need to create dummy variables for logistic regression in SAS, it does it automatically.

 

Have you worked through the examples in the PROC LOGISTIC documentation? It includes full code and I believe the second example is about categorical variables. The documentation uses the GLM method of parameterization for categorical variables but the usual desired option is the REF method.

 

Documentation examples

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=statug&docsetTarget=statu...

 

An example on the REF option is here:
https://stats.idre.ucla.edu/sas/dae/logit-regression/

 

The different types of paramterization methods are outlined here, but not all are available in every procedure:

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=statug&docsetTarget=statu...

 

That should be enough to get you started, feel free to post any further questions.

 


@chapidi99 wrote:

Hi,

I am new to SAS and implementing logistic regression. I would like to know what is reference category in logistic regression. How is it useful. I have a categorical variable called "Level of pain" as no pain, less pain, medium, high and extreme. I have created dummy variables out of the categories. Which of the dummy variable need to be given as reference category? And what options I need to give in proc logistic regression to choose a best reference category?

 

Many thanks for the help!!


 

Chapi
Obsidian | Level 7

Thank you Rezza providing examples of proc logistic documentation. I understand now, I don't need to create the dummy variables separately. I have executed the proc logistic regression both ways with reference category and without and I don't understand why the p>ChiSQ values has a drastic difference in both techniques (with and without reference categories). When I executed proc logistic WITHOUT reference category I got all the variables with below 0.05 p>ChiSQ values but WITH reference category the catagories within the variables are bot below 0.05 p>ChiSQ. I don't understand why.

StatDave
SAS Super FREQ

I assume that your Level of Pain variable is a predictor in the model rather than the response variable. In that case, you do not need to create dummy variables because that is what the CLASS statement does for you. It also allows you to pick the reference category with the REF= option. For example, if your original variable is called LevelOfPain with values 1, 2, 3, 4, or 5, and you want to use level 1 as the reference level, then specify

class LevelOfPain(ref="1") / param=glm;

Then include LevelOfPain in your MODEL statement. There is no "best" reference category. The choice is arbitrary and is made for convenience of interpretation. The above CLASS statement will create the conventional 0,1-coded dummy variables with level 1 as the reference level (all dummies equal 0). The parameter estimates will be interpreted as the difference in effect of each level compared to the reference level, 1.

Chapi
Obsidian | Level 7

Thank you Dave, I applied the below statement it gave the desired results. But the Pr>ChiSq values increased drastically. I don't understand if I still need to use those variables or not. Could you please explain?

PaigeMiller
Diamond | Level 26

@Chapi wrote:

Thank you Dave, I applied the below statement it gave the desired results. But the Pr>ChiSq values increased drastically. I don't understand if I still need to use those variables or not. Could you please explain?


Not clear what you did when the Pr>ChiSq values changed, could you show us the code and output before and after, plus the corresponding outputs?

 

Could you also please clarify if this reference category you want is for an independent variable or for the dependent variable?

--
Paige Miller
Chapi
Obsidian | Level 7

 

Hello,

 

proc logistic data=Work.Dataset desc plots(only)=roc ;
class Age Breath Blood Water  Heart Stomach Heavey other UBEL water2 Eyesight  Dialysis hearing hearingdevice glasses water3 Psychiatri pregnancy  
/param=glm;

model GCPS_Binry = Age Breath Blood Water  Heart Stomach Heavey other UBEL water2 Eyesight  Dialysis hearing hearingdevice glasses water3 Psychiatri pregnancy

/ selection=stepwise ;
output out=out3 p=pred1;
run;

 

Previous results when not used class statement

Analysis of Maximum Likelihood Estimates
ParameterDFEstimateStandardWaldPr > ChiSq
ErrorChi-Square
Intercept1-0.96570.117567.5333<.0001
Age10.34970.11399.42650.0021
Breath10.28590.11985.69350.017
Blood10.26560.11515.3260.021
Water10.29920.10997.41550.0065
Heart10.23110.10544.80340.0284
Stomach10.25950.11325.25470.0219
Water310.5940.142517.3817<.0001
Glasses10.26560.10626.25240.0124
UBEL10.44030.108616.4331<.0001
Eyesight10.30950.1157.24450.0071

 

Latest results when used class statement for reference categories: As you can see the the Pe>ChiSq is greater than 0.005 for some of the categories.

Analysis of Maximum Likelihood Estimates
Parameter DFEstimateStandardWaldPr > ChiSq
ErrorChi-Square
Intercept 15.95081.679812.54940.0004
Breath-0.4507527619.599180.80.00280.9577
Breath1.91797094110.5818180.80.00340.9533
Age-0.593580161-11.4667180.80.0040.9494
Age1.391110341-10.8176180.80.00360.9523
Age-0.763358641-0.3481.27910.0740.7856
Age1.0142879510.28791.28650.05010.8229
Age-0.709728691-1.12240.69512.60780.1063
Age1.201297641-0.48570.70850.46990.493
Breath-0.454137341-2.43231.0185.70910.0169
Breath1.297551511-2.0251.03343.83990.05
Water-2.461279581-2.17780.552415.5422<.0001
Water-0.879112131-0.48180.35521.83960.175
Water-0.676410881-0.92680.64352.07430.1498
Water0.3979974610.007670.30080.00070.9797
Glasses-0.695469051-1.76760.82334.60990.0318
Glasses0.860888811-1.53961.05492.13020.1444
Eyesight1.273605711-1.0060.83631.44690.229
Eyesight-1.046155331-0.64880.2377.49180.0062
Chapi
Obsidian | Level 7
Hello, Please see the above results before and after using class statement in proc logistic regression and code.
PaigeMiller
Diamond | Level 26

I'm really having a lot of trouble understanding the problem, you start by talking about "level of pain" as a variable, but I don't see it in your code. And its still not clear to me if the "level of pain" variable is the dependent variable or an independent variable. Could you please clarify this?

 

As far as your p-values, only the categorical variables go in the CLASS statement. The continuous variables do not go in the CLASS statement.

--
Paige Miller
Chapi
Obsidian | Level 7

Sorry for the confusion, all independent variables are related to pain of a specific part of the body. And the dependent variable is to predict pain / No pain of the patient. All variables included in the class statement are categorical variables. Example Age is categorised by applying WOE transformation.

 

PaigeMiller
Diamond | Level 26

@Chapi wrote:

Sorry for the confusion, all independent variables are related to pain of a specific part of the body. And the dependent variable is to predict pain / No pain of the patient. All variables included in the class statement are categorical variables. Example Age is categorised by applying WOE transformation.


So is your original question about reference category referring to the independent variables or the dependent variable (or both)?

 

Your p-values are not comparable across the two different models. Once you switch to categorizing Age (and other variables ) by WOE, you can't expect the same answers as when age was used as a continuous variable, they may not even be close.

--
Paige Miller
Chapi
Obsidian | Level 7

My question was about referencing independent variables.

Both the outputs are generated after implementing WOE transformation and age variable as categorical in both models. Only the difference is applying reference category in the latest output and previously without reference category.

 

I have a question about the p-value, Should we look at the whole variable as significant rather that the categories of the variables? 

StatDave
SAS Super FREQ

A predictor variable is declared as categorical by including it in the CLASS statement which then creates dummy variables for its levels. If the variable is not in the CLASS statement, it is treated as continuous and it is used directly as a column in the design matrix. You can see in your results that all variables have multiple coefficients associated with their multiple dummy variables when you used the CLASS statement, but they have only a single coefficient when the CLASS statement is not used. P-values aren't even comparable between these two ways of treating the predictors. What doesn't make sense is that the coefficient associated with the last (reference) level of each CLASS predictor is not zero, which it should be whenever PARAM=GLM is used. Maybe that second set of output wasn't really generated by the code that you show or there were some error or warning messages in the log. Probably there was a message about "separation" since some of the standard errors are very large which is typical when the data are too sparse causing the separation condition.

PaigeMiller
Diamond | Level 26

I have a question about the p-value, Should we look at the whole variable as significant rather that the categories of the variables?

PROC LOGISTIC produces coefficients for each level of the CLASS variable (where one level should have a zero coefficient), these are tested to see if the coefficient is zero, and a p-value is reported. PROC LOGISTIC also produces a Type III test which tests to see if the coefficients are equal across all levels of the CLASS variable. This is a different test than the one you show, and has different meaning and different p-values.

 

So, you might want to look at both the Type III test and the test of the individual coefficients, and interpret both together simultaneously.

--
Paige Miller

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 9018 views
  • 3 likes
  • 5 in conversation