BookmarkSubscribeRSS Feed
disguy
Calcite | Level 5

Consider making an indicator variable for a predictor with 3 levels. Suppose the color variable can be red, blue, or white.

select(color);

when('red')

do;

i_Red = 1;

i_Blue = 0;

end;

when('blue')

do;

i_Red = 0;

i_Blue = 1;

end;

when('white')

do;

i_Red = 0;

i_Blue = 0;

end;

If we had a fourth level, how would I generalize the above procedure? If we added the color purple, would I have to redo all the code to something along the lines of:

when('purple')

do;

i_Purple = 1;

i_Red = 0;

i_Blue = 0;

i_White = 0;

end;

And after I have successfully created an indicator variable, how do I go about calling it for a procedure? If I wanted to run the regression procedure, would it be something like:

proc reg data = data_1;

model y*color;

when color = 'white'

end;

Thanks in advance.

12 REPLIES 12
SteveDenham
Jade | Level 19

Look at PROC GLMMOD.  It will create a dataset with indicator variables on it.  Your syntax for proc reg will not work--check the model statement.

Steve Denham

Reeza
Super User

Depending on the procedure SAS will do that for you with a Class Statement. Look at how it parameterizes it though, GLM/EFFECT/REF. Generally you want REF but its not the default method.


That would be my starting point, what procs are you looking at.

Coding systems for categorical variables in regression analysis

disguy
Calcite | Level 5

I was told that I could not use proc glm. I'm supposed to get familiar with how to do things without it. So I'm trying to generalize the method I learned for categorical variables with 3 levels.

I'm mainly limited to using proc reg.

SteveDenham
Jade | Level 19

So this is a homework type problem?

Almost every SAS proc has access to a CLASS statement.  For those that don't, then the GLMMOD procedure was created to generate consistent coding for categorical variables.  To not use it would be equivalent (to me) to requiring that you write the code in assembler language.

Steve Denham

disguy
Calcite | Level 5

Correct, this is a homework question. I'm not very well-acquainted with SAS, so I'm not sure what you mean by assembler language.

SteveDenham
Jade | Level 19

Assembler language is what was used back in mainframe days.  Fortran mapped commands into assembler which were then converted into the actual binary code.  Asking you to write code to do something that has already been verified to work correctly when you use the given PROC falls into that kind of request to my mind.  Why write low level code, when a higher level version is readily available?

(As you may have guessed, I'm not a real great programmer--but I can do an awful lot in SAS using various PROCs.)

Steve Denham

disguy
Calcite | Level 5

Ah, I see. Well my teacher explained that a lot of people use proc glm for everything, but they don't know what they're doing or what they're looking for since they don't know what it actually does unless you actually know what some of the lower-level code does.

I'm assuming for indicator variables, when you have 4 levels, you want 3 indicators. So you basically want {1, 0, 0, 0} for 3 of them, and one with all 0's?

SteveDenham
Jade | Level 19

No, I see.  Your instructor is wise--I've seen a lot of GLM used inappropriately because people didn't know what they were doing (which can be corrected fairly easily and what your instructor is trying to do with this exercise) or they assumed it did something that it just cannot do.

So, look in the documentation under Shared Concepts and Topics for Levelization of Classification Variables and Parameterization of Model Effects.  In particular, take a look at REF and GLM coding.  I'm just not the right person to ask about data step coding to get there.

Steve Denham

data_null__
Jade | Level 19

This is an example that uses PROC TRANSPOSE to create indicator variables.  You don't have to know now many levels of COLOR you have.   It then uses PROC STDIZE to poke zeros into the missing values and then fits the model with REG and includes a TEST statement to compute the COLOR main effect.  (you will need to know about the names of the indicator variables to write this statement or you could use code gen).  Then it does GLM to check.

data colors;
   length color $8;
  
do i = 1 to 40;
      color = chooseC(rantbl(
1234,.25,.25,.25),'Red','Blue','Green','Yellow');
      y = rannor(0);
      output;
     
end;
  
retain one 1;
  
run;
proc transpose data=colors out=indic(drop=_name_) prefix=i_;
   by i y;
   var one;
   id color;
   run;
proc stdize reponly missing=0 out=indic2;
   var i_:;
   run;
proc reg data=indic2;
   model y = i_:;
   color: test i_red, i_blue, i_green;
   run;
proc glm data=colors;
   class color;
   model y = color;
   run;
SteveDenham
Jade | Level 19

See--that's what a real SAS programmer, not a PROC hacker like me, can come up with.

Steve Denham

disguy
Calcite | Level 5

Thanks for the help everyone.

Reeza
Super User

I'd actually disagree. It's useful to teach students how categorical variables are treated in regression, but they should learn both methods.

Do it in Proc Reg and then compare with with PROC GLM or whatever the applicable proc is.

And PROC GLM is not GLMMOD, different procs, different purposes entirely.

The GLMMOD procedure constructs the design matrix for a general linear model; it essentially constitutes the model-building front end for the GLM procedure.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 2870 views
  • 2 likes
  • 4 in conversation