- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am new to SAS and implementing logistic regression. I would like to know what is reference category in logistic regression. How is it useful. I have a categorical variable called "Level of pain" as no pain, less pain, medium, high and extreme. I have created dummy variables out of the categories. Which of the dummy variable need to be given as reference category? And what options I need to give in proc logistic regression to choose a best reference category?
Many thanks for the help!!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A predictor variable is declared as categorical by including it in the CLASS statement which then creates dummy variables for its levels. If the variable is not in the CLASS statement, it is treated as continuous and it is used directly as a column in the design matrix. You can see in your results that all variables have multiple coefficients associated with their multiple dummy variables when you used the CLASS statement, but they have only a single coefficient when the CLASS statement is not used. P-values aren't even comparable between these two ways of treating the predictors. What doesn't make sense is that the coefficient associated with the last (reference) level of each CLASS predictor is not zero, which it should be whenever PARAM=GLM is used. Maybe that second set of output wasn't really generated by the code that you show or there were some error or warning messages in the log. Probably there was a message about "separation" since some of the standard errors are very large which is typical when the data are too sparse causing the separation condition.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
AFAIK, there is no such thing as a 'best reference category' and you don't need to create dummy variables for logistic regression in SAS, it does it automatically.
Have you worked through the examples in the PROC LOGISTIC documentation? It includes full code and I believe the second example is about categorical variables. The documentation uses the GLM method of parameterization for categorical variables but the usual desired option is the REF method.
Documentation examples
An example on the REF option is here:
https://stats.idre.ucla.edu/sas/dae/logit-regression/
The different types of paramterization methods are outlined here, but not all are available in every procedure:
That should be enough to get you started, feel free to post any further questions.
@chapidi99 wrote:
Hi,
I am new to SAS and implementing logistic regression. I would like to know what is reference category in logistic regression. How is it useful. I have a categorical variable called "Level of pain" as no pain, less pain, medium, high and extreme. I have created dummy variables out of the categories. Which of the dummy variable need to be given as reference category? And what options I need to give in proc logistic regression to choose a best reference category?
Many thanks for the help!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Rezza providing examples of proc logistic documentation. I understand now, I don't need to create the dummy variables separately. I have executed the proc logistic regression both ways with reference category and without and I don't understand why the p>ChiSQ values has a drastic difference in both techniques (with and without reference categories). When I executed proc logistic WITHOUT reference category I got all the variables with below 0.05 p>ChiSQ values but WITH reference category the catagories within the variables are bot below 0.05 p>ChiSQ. I don't understand why.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I assume that your Level of Pain variable is a predictor in the model rather than the response variable. In that case, you do not need to create dummy variables because that is what the CLASS statement does for you. It also allows you to pick the reference category with the REF= option. For example, if your original variable is called LevelOfPain with values 1, 2, 3, 4, or 5, and you want to use level 1 as the reference level, then specify
class LevelOfPain(ref="1") / param=glm;
Then include LevelOfPain in your MODEL statement. There is no "best" reference category. The choice is arbitrary and is made for convenience of interpretation. The above CLASS statement will create the conventional 0,1-coded dummy variables with level 1 as the reference level (all dummies equal 0). The parameter estimates will be interpreted as the difference in effect of each level compared to the reference level, 1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Dave, I applied the below statement it gave the desired results. But the Pr>ChiSq values increased drastically. I don't understand if I still need to use those variables or not. Could you please explain?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Chapi wrote:
Thank you Dave, I applied the below statement it gave the desired results. But the Pr>ChiSq values increased drastically. I don't understand if I still need to use those variables or not. Could you please explain?
Not clear what you did when the Pr>ChiSq values changed, could you show us the code and output before and after, plus the corresponding outputs?
Could you also please clarify if this reference category you want is for an independent variable or for the dependent variable?
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
proc logistic data=Work.Dataset desc plots(only)=roc ;
class Age Breath Blood Water Heart Stomach Heavey other UBEL water2 Eyesight Dialysis hearing hearingdevice glasses water3 Psychiatri pregnancy
/param=glm;
model GCPS_Binry = Age Breath Blood Water Heart Stomach Heavey other UBEL water2 Eyesight Dialysis hearing hearingdevice glasses water3 Psychiatri pregnancy
/ selection=stepwise ;
output out=out3 p=pred1;
run;
Previous results when not used class statement
Analysis of Maximum Likelihood Estimates | |||||
Parameter | DF | Estimate | Standard | Wald | Pr > ChiSq |
Error | Chi-Square | ||||
Intercept | 1 | -0.9657 | 0.1175 | 67.5333 | <.0001 |
Age | 1 | 0.3497 | 0.1139 | 9.4265 | 0.0021 |
Breath | 1 | 0.2859 | 0.1198 | 5.6935 | 0.017 |
Blood | 1 | 0.2656 | 0.1151 | 5.326 | 0.021 |
Water | 1 | 0.2992 | 0.1099 | 7.4155 | 0.0065 |
Heart | 1 | 0.2311 | 0.1054 | 4.8034 | 0.0284 |
Stomach | 1 | 0.2595 | 0.1132 | 5.2547 | 0.0219 |
Water3 | 1 | 0.594 | 0.1425 | 17.3817 | <.0001 |
Glasses | 1 | 0.2656 | 0.1062 | 6.2524 | 0.0124 |
UBEL | 1 | 0.4403 | 0.1086 | 16.4331 | <.0001 |
Eyesight | 1 | 0.3095 | 0.115 | 7.2445 | 0.0071 |
Latest results when used class statement for reference categories: As you can see the the Pe>ChiSq is greater than 0.005 for some of the categories.
Analysis of Maximum Likelihood Estimates | ||||||
Parameter | DF | Estimate | Standard | Wald | Pr > ChiSq | |
Error | Chi-Square | |||||
Intercept | 1 | 5.9508 | 1.6798 | 12.5494 | 0.0004 | |
Breath | -0.45075276 | 1 | 9.599 | 180.8 | 0.0028 | 0.9577 |
Breath | 1.91797094 | 1 | 10.5818 | 180.8 | 0.0034 | 0.9533 |
Age | -0.59358016 | 1 | -11.4667 | 180.8 | 0.004 | 0.9494 |
Age | 1.39111034 | 1 | -10.8176 | 180.8 | 0.0036 | 0.9523 |
Age | -0.76335864 | 1 | -0.348 | 1.2791 | 0.074 | 0.7856 |
Age | 1.01428795 | 1 | 0.2879 | 1.2865 | 0.0501 | 0.8229 |
Age | -0.70972869 | 1 | -1.1224 | 0.6951 | 2.6078 | 0.1063 |
Age | 1.20129764 | 1 | -0.4857 | 0.7085 | 0.4699 | 0.493 |
Breath | -0.45413734 | 1 | -2.4323 | 1.018 | 5.7091 | 0.0169 |
Breath | 1.29755151 | 1 | -2.025 | 1.0334 | 3.8399 | 0.05 |
Water | -2.46127958 | 1 | -2.1778 | 0.5524 | 15.5422 | <.0001 |
Water | -0.87911213 | 1 | -0.4818 | 0.3552 | 1.8396 | 0.175 |
Water | -0.67641088 | 1 | -0.9268 | 0.6435 | 2.0743 | 0.1498 |
Water | 0.39799746 | 1 | 0.00767 | 0.3008 | 0.0007 | 0.9797 |
Glasses | -0.69546905 | 1 | -1.7676 | 0.8233 | 4.6099 | 0.0318 |
Glasses | 0.86088881 | 1 | -1.5396 | 1.0549 | 2.1302 | 0.1444 |
Eyesight | 1.27360571 | 1 | -1.006 | 0.8363 | 1.4469 | 0.229 |
Eyesight | -1.04615533 | 1 | -0.6488 | 0.237 | 7.4918 | 0.0062 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm really having a lot of trouble understanding the problem, you start by talking about "level of pain" as a variable, but I don't see it in your code. And its still not clear to me if the "level of pain" variable is the dependent variable or an independent variable. Could you please clarify this?
As far as your p-values, only the categorical variables go in the CLASS statement. The continuous variables do not go in the CLASS statement.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the confusion, all independent variables are related to pain of a specific part of the body. And the dependent variable is to predict pain / No pain of the patient. All variables included in the class statement are categorical variables. Example Age is categorised by applying WOE transformation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Chapi wrote:
Sorry for the confusion, all independent variables are related to pain of a specific part of the body. And the dependent variable is to predict pain / No pain of the patient. All variables included in the class statement are categorical variables. Example Age is categorised by applying WOE transformation.
So is your original question about reference category referring to the independent variables or the dependent variable (or both)?
Your p-values are not comparable across the two different models. Once you switch to categorizing Age (and other variables ) by WOE, you can't expect the same answers as when age was used as a continuous variable, they may not even be close.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
My question was about referencing independent variables.
Both the outputs are generated after implementing WOE transformation and age variable as categorical in both models. Only the difference is applying reference category in the latest output and previously without reference category.
I have a question about the p-value, Should we look at the whole variable as significant rather that the categories of the variables?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A predictor variable is declared as categorical by including it in the CLASS statement which then creates dummy variables for its levels. If the variable is not in the CLASS statement, it is treated as continuous and it is used directly as a column in the design matrix. You can see in your results that all variables have multiple coefficients associated with their multiple dummy variables when you used the CLASS statement, but they have only a single coefficient when the CLASS statement is not used. P-values aren't even comparable between these two ways of treating the predictors. What doesn't make sense is that the coefficient associated with the last (reference) level of each CLASS predictor is not zero, which it should be whenever PARAM=GLM is used. Maybe that second set of output wasn't really generated by the code that you show or there were some error or warning messages in the log. Probably there was a message about "separation" since some of the standard errors are very large which is typical when the data are too sparse causing the separation condition.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a question about the p-value, Should we look at the whole variable as significant rather that the categories of the variables?
PROC LOGISTIC produces coefficients for each level of the CLASS variable (where one level should have a zero coefficient), these are tested to see if the coefficient is zero, and a p-value is reported. PROC LOGISTIC also produces a Type III test which tests to see if the coefficients are equal across all levels of the CLASS variable. This is a different test than the one you show, and has different meaning and different p-values.
So, you might want to look at both the Type III test and the test of the individual coefficients, and interpret both together simultaneously.
Paige Miller