Solved: What Reference Category in Logistic regression

chapidi99 · Posted 11-26-2020 01:18 PM

Hi,

I am new to SAS and implementing logistic regression. I would like to know what is reference category in logistic regression. How is it useful. I have a categorical variable called "Level of pain" as no pain, less pain, medium, high and extreme. I have created dummy variables out of the categories. Which of the dummy variable need to be given as reference category? And what options I need to give in proc logistic regression to choose a best reference category?

Many thanks for the help!!

StatDave · Posted 11-27-2020 10:52 AM

A predictor variable is declared as categorical by including it in the CLASS statement which then creates dummy variables for its levels. If the variable is not in the CLASS statement, it is treated as continuous and it is used directly as a column in the design matrix. You can see in your results that all variables have multiple coefficients associated with their multiple dummy variables when you used the CLASS statement, but they have only a single coefficient when the CLASS statement is not used. P-values aren't even comparable between these two ways of treating the predictors. What doesn't make sense is that the coefficient associated with the last (reference) level of each CLASS predictor is not zero, which it should be whenever PARAM=GLM is used. Maybe that second set of output wasn't really generated by the code that you show or there were some error or warning messages in the log. Probably there was a message about "separation" since some of the standard errors are very large which is typical when the data are too sparse causing the separation condition.

View solution in original post

Reeza · Posted 11-26-2020 01:49 PM

AFAIK, there is no such thing as a 'best reference category' and you don't need to create dummy variables for logistic regression in SAS, it does it automatically.

Have you worked through the examples in the PROC LOGISTIC documentation? It includes full code and I believe the second example is about categorical variables. The documentation uses the GLM method of parameterization for categorical variables but the usual desired option is the REF method.

Documentation examples

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=statug&docsetTarget=statu...

An example on the REF option is here:
https://stats.idre.ucla.edu/sas/dae/logit-regression/

The different types of paramterization methods are outlined here, but not all are available in every procedure:

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=statug&docsetTarget=statu...

That should be enough to get you started, feel free to post any further questions.

@chapidi99 wrote:

Hi,

I am new to SAS and implementing logistic regression. I would like to know what is reference category in logistic regression. How is it useful. I have a categorical variable called "Level of pain" as no pain, less pain, medium, high and extreme. I have created dummy variables out of the categories. Which of the dummy variable need to be given as reference category? And what options I need to give in proc logistic regression to choose a best reference category?

Many thanks for the help!!

Chapi · Posted 11-27-2020 06:00 AM

Thank you Rezza providing examples of proc logistic documentation. I understand now, I don't need to create the dummy variables separately. I have executed the proc logistic regression both ways with reference category and without and I don't understand why the p>ChiSQ values has a drastic difference in both techniques (with and without reference categories). When I executed proc logistic WITHOUT reference category I got all the variables with below 0.05 p>ChiSQ values but WITH reference category the catagories within the variables are bot below 0.05 p>ChiSQ. I don't understand why.

StatDave · Posted 11-26-2020 01:50 PM

I assume that your Level of Pain variable is a predictor in the model rather than the response variable. In that case, you do not need to create dummy variables because that is what the CLASS statement does for you. It also allows you to pick the reference category with the REF= option. For example, if your original variable is called LevelOfPain with values 1, 2, 3, 4, or 5, and you want to use level 1 as the reference level, then specify

class LevelOfPain(ref="1") / param=glm;

Then include LevelOfPain in your MODEL statement. There is no "best" reference category. The choice is arbitrary and is made for convenience of interpretation. The above CLASS statement will create the conventional 0,1-coded dummy variables with level 1 as the reference level (all dummies equal 0). The parameter estimates will be interpreted as the difference in effect of each level compared to the reference level, 1.

Chapi · Posted 11-27-2020 06:02 AM

Thank you Dave, I applied the below statement it gave the desired results. But the Pr>ChiSq values increased drastically. I don't understand if I still need to use those variables or not. Could you please explain?

PaigeMiller · Posted 11-27-2020 06:54 AM

@Chapi wrote:

Thank you Dave, I applied the below statement it gave the desired results. But the Pr>ChiSq values increased drastically. I don't understand if I still need to use those variables or not. Could you please explain?

Not clear what you did when the Pr>ChiSq values changed, could you show us the code and output before and after, plus the corresponding outputs?

Could you also please clarify if this reference category you want is for an independent variable or for the dependent variable?

--
Paige Miller

Chapi · Posted 11-27-2020 07:13 AM

Hello,

proc logistic data=Work.Dataset desc plots(only)=roc ;
class Age Breath Blood Water Heart Stomach Heavey other UBEL water2 Eyesight Dialysis hearing hearingdevice glasses water3 Psychiatri pregnancy
/param=glm;

model GCPS_Binry = Age Breath Blood Water Heart Stomach Heavey other UBEL water2 Eyesight Dialysis hearing hearingdevice glasses water3 Psychiatri pregnancy

/ selection=stepwise ;
output out=out3 p=pred1;
run;

Previous results when not used class statement

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard	Wald	Pr > ChiSq
Error	Chi-Square
Intercept	1	-0.9657	0.1175	67.5333	<.0001
Age	1	0.3497	0.1139	9.4265	0.0021
Breath	1	0.2859	0.1198	5.6935	0.017
Blood	1	0.2656	0.1151	5.326	0.021
Water	1	0.2992	0.1099	7.4155	0.0065
Heart	1	0.2311	0.1054	4.8034	0.0284
Stomach	1	0.2595	0.1132	5.2547	0.0219
Water3	1	0.594	0.1425	17.3817	<.0001
Glasses	1	0.2656	0.1062	6.2524	0.0124
UBEL	1	0.4403	0.1086	16.4331	<.0001
Eyesight	1	0.3095	0.115	7.2445	0.0071

Latest results when used class statement for reference categories: As you can see the the Pe>ChiSq is greater than 0.005 for some of the categories.

Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard	Wald	Pr > ChiSq
Error	Chi-Square
Intercept		1	5.9508	1.6798	12.5494	0.0004
Breath	-0.45075276	1	9.599	180.8	0.0028	0.9577
Breath	1.91797094	1	10.5818	180.8	0.0034	0.9533
Age	-0.59358016	1	-11.4667	180.8	0.004	0.9494
Age	1.39111034	1	-10.8176	180.8	0.0036	0.9523
Age	-0.76335864	1	-0.348	1.2791	0.074	0.7856
Age	1.01428795	1	0.2879	1.2865	0.0501	0.8229
Age	-0.70972869	1	-1.1224	0.6951	2.6078	0.1063
Age	1.20129764	1	-0.4857	0.7085	0.4699	0.493
Breath	-0.45413734	1	-2.4323	1.018	5.7091	0.0169
Breath	1.29755151	1	-2.025	1.0334	3.8399	0.05
Water	-2.46127958	1	-2.1778	0.5524	15.5422	<.0001
Water	-0.87911213	1	-0.4818	0.3552	1.8396	0.175
Water	-0.67641088	1	-0.9268	0.6435	2.0743	0.1498
Water	0.39799746	1	0.00767	0.3008	0.0007	0.9797
Glasses	-0.69546905	1	-1.7676	0.8233	4.6099	0.0318
Glasses	0.86088881	1	-1.5396	1.0549	2.1302	0.1444
Eyesight	1.27360571	1	-1.006	0.8363	1.4469	0.229
Eyesight	-1.04615533	1	-0.6488	0.237	7.4918	0.0062

Chapi · Posted 11-27-2020 07:14 AM

Hello, Please see the above results before and after using class statement in proc logistic regression and code.

PaigeMiller · Posted 11-27-2020 07:34 AM

I'm really having a lot of trouble understanding the problem, you start by talking about "level of pain" as a variable, but I don't see it in your code. And its still not clear to me if the "level of pain" variable is the dependent variable or an independent variable. Could you please clarify this?

As far as your p-values, only the categorical variables go in the CLASS statement. The continuous variables do not go in the CLASS statement.

--
Paige Miller

Chapi · Posted 11-27-2020 07:48 AM

Sorry for the confusion, all independent variables are related to pain of a specific part of the body. And the dependent variable is to predict pain / No pain of the patient. All variables included in the class statement are categorical variables. Example Age is categorised by applying WOE transformation.

PaigeMiller · Posted 11-27-2020 08:00 AM

@Chapi wrote:

Sorry for the confusion, all independent variables are related to pain of a specific part of the body. And the dependent variable is to predict pain / No pain of the patient. All variables included in the class statement are categorical variables. Example Age is categorised by applying WOE transformation.

So is your original question about reference category referring to the independent variables or the dependent variable (or both)?

Your p-values are not comparable across the two different models. Once you switch to categorizing Age (and other variables ) by WOE, you can't expect the same answers as when age was used as a continuous variable, they may not even be close.

--
Paige Miller

Chapi · Posted 11-27-2020 08:18 AM

My question was about referencing independent variables.

Both the outputs are generated after implementing WOE transformation and age variable as categorical in both models. Only the difference is applying reference category in the latest output and previously without reference category.

I have a question about the p-value, Should we look at the whole variable as significant rather that the categories of the variables?

StatDave · Posted 11-27-2020 10:52 AM

A predictor variable is declared as categorical by including it in the CLASS statement which then creates dummy variables for its levels. If the variable is not in the CLASS statement, it is treated as continuous and it is used directly as a column in the design matrix. You can see in your results that all variables have multiple coefficients associated with their multiple dummy variables when you used the CLASS statement, but they have only a single coefficient when the CLASS statement is not used. P-values aren't even comparable between these two ways of treating the predictors. What doesn't make sense is that the coefficient associated with the last (reference) level of each CLASS predictor is not zero, which it should be whenever PARAM=GLM is used. Maybe that second set of output wasn't really generated by the code that you show or there were some error or warning messages in the log. Probably there was a message about "separation" since some of the standard errors are very large which is typical when the data are too sparse causing the separation condition.

PaigeMiller · Posted 11-27-2020 12:16 PM

I have a question about the p-value, Should we look at the whole variable as significant rather that the categories of the variables?

PROC LOGISTIC produces coefficients for each level of the CLASS variable (where one level should have a zero coefficient), these are tested to see if the coefficient is zero, and a p-value is reported. PROC LOGISTIC also produces a Type III test which tests to see if the coefficients are equal across all levels of the CLASS variable. This is a different test than the one you show, and has different meaning and different p-values.

So, you might want to look at both the Type III test and the test of the individual coefficients, and interpret both together simultaneously.

--
Paige Miller

What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression

Re: What Reference Category in Logistic regression