Re: Categorical Variables and Dummy Variables

jjsingh04 · Posted 11-12-2022 03:21 AM

I'm trying to analyze political effects in US Presidential elections.

Red States= Republican wins by >=5%

Blue States = Democrat wins by >=5%

Battleground States = In-Between

Hypotheses: H1 Null Ho Red vs Battleground is more Signif than Blue vs Battleground in affecting Y

H1 Alternate HA Blue vs Battleground is more Signif than Red vs Battleground in affecting Y

H2 Null Ho Political Extremism in State (ie EITHER Red or Blue) is more Signif than Battleground in affecting Y

H2 Alternate HA Battleground is more Signif than Political Extremism in State (ie EITHER Red or Blue) in affecting Y

How would I set up dummy or categorical explanatory variables for this type of model? It seems perplexing.

I've thought of 2 possible solutions, but they both seem wrong:

a) I have a dummy variable RED (=1 if state is Red State; =0 if Blue or Battleground),

I have a dummy variable BATTLE (=1 if state is Battleground; =0 if not Battleground).

b) I have a categorical variable RED (=1 if state is Red State; =0 if Battleground; =-1 if Blue State),

I have a dummy variable BATTLE (=1 if state is Battleground; =0 if not Battleground).

What if I instead have variables like these:

c) I have a dummy variable RED (=1 if state is Red State; =0 if Blue or Battleground),

I have a dummy variable REDBLUE (=1 if state is Red or Blue; =0 if IS Battleground).

Then the following combinations of RED,REDBLUE would mean the following:

1,1 Red State

0,1 Blue State

0,0 Battleground

Comparing the significances of the RED coefficient (vs 0) helps us resolve Hypothesis H1;

and of the REDBLUE coefficient, Hypothesis H2.

Am I making the correct conclusion here?

Our lives are enriched by the people around us.

PaigeMiller · Posted 11-12-2022 05:42 AM

How would I set up dummy or categorical explanatory variables for this type of model? It seems perplexing.

No need to create dummy variables at all. Most times a categorical variable works well. Exactly how you set it up probably doesn't matter, as long as you have three values of the categorical variable. Then just about any PROC in SAS that you choose to do the analysis will be able to work with this categorical variable.

--
Paige Miller

jjsingh04 · Posted 11-12-2022 03:26 PM

But I need to have 2 variables to test the 2 hypotheses right? And if one of them is a categorical variable, it will create a constrained effect that will equate the effects of redness and blueness that would be incorrect, right?

Our lives are enriched by the people around us.

PaigeMiller · Posted 11-12-2022 04:46 PM

@jjsingh04 wrote:

But I need to have 2 variables to test the 2 hypotheses right? And if one of them is a categorical variable, it will create a constrained effect that will equate the effects of redness and blueness that would be incorrect, right?

No. You create one set of categorical variables, and then you can ask SAS to do both comparisons. For example, if Y is continuous then you use PROC GLM, you can do both comparisons using two CONTRAST or two ESTIMATE statements. Similarly, if Y is categorical too, then you could do similar comparisons in PROC LOGISTIC or PROC GENMOD.

Simple examples in PROC GLM using the ESTIMATE statement: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_glm_syntax07.htm

--
Paige Miller

jjsingh04 · Posted 11-12-2022 11:43 PM

Y is a dummy actually (1,0) so I do logistic. So you're saying 2 separate logistic regressions, 1 with each variable?

Our lives are enriched by the people around us.

jjsingh04 · Posted 11-12-2022 11:48 PM

I've read in several places online about the problem of constrained effects with categorical variables too--the implicit assumption of equal spacing and equal magnitude being implied: https://stats.stackexchange.com/questions/278837/numerical-coding-and-constraints-for-categorical-va...

I was to avoid all that 🙂

Our lives are enriched by the people around us.

PaigeMiller · Posted 11-13-2022 05:25 AM

@jjsingh04 wrote:

I've read in several places online about the problem of constrained effects with categorical variables too--the implicit assumption of equal spacing and equal magnitude being implied: https://stats.stackexchange.com/questions/278837/numerical-coding-and-constraints-for-categorical-va...

I was to avoid all that 🙂

That doesn't apply here. I am not suggesting you use numerical coding at all. What I was talking about was not a "constraint", anyway.

--
Paige Miller

PaigeMiller · Posted 11-13-2022 03:10 PM

Y is a dummy actually (1,0) so I do logistic. So you're saying 2 separate logistic regressions, 1 with each variable?

I don't see anywhere that I say to do 2 separate logistic regressions, 1 with each variable, which wouldn't make any sense. I said "you can do both comparisons using two CONTRAST or two ESTIMATE statements". You can have multiple CONTRAST or multiple ESTIMATE statements in one regression.

--
Paige Miller

jjsingh04 · Posted 11-13-2022 06:53 PM

I think I now understand. Is this example illustrative of what you mean?:

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_logistic_examples02.htm

This example contrast's the 3-level categorical variable treatment: A vs P, B vs P, and A vs B, which in my case is analogous to Red vs Battle, Blue vs Battle, and Red vs Blue.

But I want to also test a hypothesis of (Either Red or Blue) vs (Battle), which I'm not sure how I would set up.

Our lives are enriched by the people around us.

PaigeMiller · Posted 11-14-2022 05:50 AM

@jjsingh04 wrote:

I think I now understand. Is this example illustrative of what you mean?:

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_logistic_examples02.htm

This example contrast's the 3-level categorical variable treatment: A vs P, B vs P, and A vs B, which in my case is analogous to Red vs Battle, Blue vs Battle, and Red vs Blue.

But I want to also test a hypothesis of (Either Red or Blue) vs (Battle), which I'm not sure how I would set up.

Yes, that is it except it uses CONTRAST instead of ESTIMATE, which is fine.

What is the math would you use to test the hypothesis of "(Either Red or Blue) vs (Battle)"? Don't explain with SAS code, explain with words and simple formulas.

--
Paige Miller

jjsingh04 · Posted 11-14-2022 03:28 PM

Well, I had been thinking that if there was a separate dummy variable called BATTLE, =1 for Battleground States, and =0 for Red or Blue States, then the significance of that dummy variable would be measure enough to test that hypothesis, no?

And if so, would that go in the same logistic regression as what we were just discussing, or in a separate one?

Our lives are enriched by the people around us.

PaigeMiller · Posted 11-14-2022 03:37 PM

And what if you didn't use dummy variables at all (because you don't need them)? You just had three categories, Red Blue and Battle? How do you compare Battle to Red and Blue?

Hint: there's an actual SAS example in the link I gave earlier, comparing one category to two other categories.

I'm also trying to get you to stop thinking about dummy variables, they're not helpful here and not helpful in most cases in SAS. Yes, you learn about them in you university training, but SAS pretty much makes them obsolete in most cases, SAS computes the dummy variables behind the scenes, so you can think about categories and comparisons that you want to make between the categories and you can avoid thinking about dummy variables.

--
Paige Miller

jjsingh04 · Posted 11-15-2022 11:41 AM

Do you mean A1+A2 vs A3 in the discussion of Divisor? Does that work with CONTRAST as well?:

DIVISOR=number

specifies a value by which to divide all coefficients so that fractional coefficients can be entered as integer numerators. For example, you can use

estimate '1/3(A1+A2) - 2/3A3' a 1 1 -2 / divisor=3;

instead of

estimate '1/3(A1+A2) - 2/3A3' a 0.33333 0.33333 -0.66667;

Our lives are enriched by the people around us.

PaigeMiller · Posted 11-16-2022 07:23 AM

Yes that's it

(1/2) * A1 + (1/2) * A2 - A3 is the comparison, where A1 A2 and A3 are the means for each group.

estimate 'Mean(A1,A2) - A3' a 1 1 -2 / divisor=2;

which actually doesn't even need the divisor part.

Also, you can do

(1/2) * A1 + (1/2) * A3 - A2 which compares the means of A1 and A3 to A2, and there's one more comparison that can be done like this.

You can use the CONTRAST statement similarly, that would work fine for this problem.

--
Paige Miller

jjsingh04 · Posted 11-17-2022 01:01 AM

In my case, which variables would refer to Red States, Blue States, Battleground States, respectively.

Is it A1, A2, A3 or A1, A3, A2?

Our lives are enriched by the people around us.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away