BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JamesBlack
Fluorite | Level 6

I may be missing something elemental but I can't get rid of the feeling that hpgenselect handles interaction incorrectly.

 

Let's say I have 2 categorical variables. One with 2 levels and the other with 3 levels. Both of them have a reference category set in the CLASS statement.

var1 = {A, B}

var2 = {X, Y, Z}

 

I want to have their interaction in the model. Therefore, I include var1*var2 in the MODEL statement.

 

Oddly, I only get parameters estimates for combinations: {A,X} and {A,Y} even though there should be 5 parameters estimated... Ideally, there should be estimates for {A,X}, {A,Y}, {A,Z}, {B,X}, {B,Y}.

 

 

This manual says that hpgenselect "permits any degree of interaction effects that involve classification and continuous variables".

 

 

Any hint, please?

SAS EG 7.15 HF3 (7.100.5.6132) (64-bit)

 

__

I could use GENMOD which works fine, but it does't support stepwise selection of variables...

1 ACCEPTED SOLUTION

Accepted Solutions
SAS_Rob
SAS Employee

With GLM coding (the default in GENMOD) the procedure implicitly fits the main effects as well in an interaction only model, that is, it absorbs those DF in the calculation of parameter estimates.  HPGENSELECT would do the same thing if you use GLM coding.

 

proc hpgenselect data=data;
class var1(ref = "A") var2(ref = "X")/param=glm;
model target = var1*var2
/dist=Poisson link=log;
selection method=NONE details=all;
run;

 

 

 

 

 

 

View solution in original post

8 REPLIES 8
PaigeMiller
Diamond | Level 26

@JamesBlack wrote:

 

Let's say I have 2 categorical variables. One with 2 levels and the other with 3 levels. Both of them have a reference category set in the CLASS statement.

var1 = {A, B}

var2 = {X, Y, Z}

 

I want to have their interaction in the model. Therefore, I include var1*var2 in the MODEL statement.

 

Oddly, I only get parameters estimates for combinations: {A,X} and {A,Y} even though there should be 5 parameters estimated... Ideally, there should be estimates for {A,X}, {A,Y}, {A,Z}, {B,X}, {B,Y}.

There are actually six combination, you left out {B,Z}

 

However, there are only 2 degrees of freedom for the interaction. This is how the math works out for this type of design. Only two of the interaction coefficients can be estimated, even though there are six interaction combinations. This isn't SAS deciding this, this is basic statistics, it would be the same if the analysis was done via pencil and paper.

 

But the interaction is still in the model. SAS is doing the right thing. You can select (or not) this interaction via HPGENSELECT.

--
Paige Miller
JamesBlack
Fluorite | Level 6

Yeah, my bad... I forgot to mention that I have model with intercept, therefore I left out {B,Z} as reference category.

 

If you say, it's basic statistics and SAS is doing the right thing, can you please teach me a lesson on why GENMOD gives (in my opinion) correctly estimates for 5 parameters and HPGENSELECT only for 2?

 

Are both procedures doing the right thing, you think?

PaigeMiller
Diamond | Level 26

In general, SAS does what you tell it to do. Rarely if ever, are mistakes made where degrees of freedom are handled improperly. SAS worked all of this out a long time ago and over bazillions of uses over decades, it seems like there are no complaints in this area of figuring out degrees of freedom (and it's not that hard to do anyway).

 

But since I don't have your data and I don't have your code, I cannot explain the difference. But since SAS does what you tell it to do, then as a guess based on no information, you have specified different models for each of the two procedures.

--
Paige Miller
SAS_Rob
SAS Employee

If you are using the REF= option on the CLASS statement in HPGENSELECT then it will automatically choose reference coding.  This is a full rank parameterization and only those levels associated with non-reference levels will appear in the Parameter Estimates table.

 

If you remove the REF= option then HPGENSELECT will use GLM coding, which is the same as GENMOD uses.  You will notice that just like in GENMOD, HPGENSELECT will set the reference level to 0 and have those levels appear in the Parameter Estimates table.

 

Either way however, the results from the two procedures will correspond to the same model fit and thus Log-likelihood values, predicted values, etc. will all the same.  The differences will only be in the Parameter Estimates table and that is simply because of the different parameterization.

JamesBlack
Fluorite | Level 6

You are definitely right in what you are saying. However, there are still couple of things that I (and my colleagues) don't understand.

 

I have prepared some data...

 

DATA data;
 INPUT id var1 $ var2 $ target ;
 DATALINES;
1 A X 6
2 A X 2
3 A X 4
4 A Y 6
5 A Y 3
6 A Y 3
7 A Z 0
8 A Z 1
9 A Z 2
10 B X 5
11 B X 7
12 B X 8
13 B Y 0
14 B Y 9
15 B Y 3
16 B Z 7
17 B Z 7
18 B Z 2
19 A Y 1
20 B X 3
 ;
 RUN;

 

I am using Poisson regression...

 

proc genmod data=data;
   class var1(ref = "A") var2(ref = "X");
   model target = var1 var2 var1*var2
/dist=Poisson link=log type1 type3;
run;

proc hpgenselect data=data;
   class var1(ref = "A") var2(ref = "X");
   model target = var1 var2 var1*var2
	/dist=Poisson link=log;
	selection method=NONE details=all;
run;

 

So far, so good... both procedures give identical results. Problem appears when I want only the interaction in the model. Like this:

 

proc genmod data=data;
   class var1(ref = "A") var2(ref = "X");
   model target = var1*var2
/dist=Poisson link=log type1 type3;
run;

proc hpgenselect data=data;
   class var1(ref = "A") var2(ref = "X");
   model target = var1*var2
	/dist=Poisson link=log;
	selection method=NONE details=all;
run;

 

This way GENMOD gives exactly what's needed. HPGENSELECT insists on previous coding and gives estimates only for intercept, {B,Y} and {B,Z}. In my opinion, that is incorrect. Of course this way the two model give different results, predicted values etc.

 

As you say, I can get all interaction combinations when excluding ref setting in class statement like this...

 

proc hpgenselect data=data;
   class var1 var2;
   model target = var1*var2
	/dist=Poisson link=log;
	selection method=NONE details=all;
run;

 

The problem is that I do need to have reference categories set. Seems like dead end to me... or am I still missing something?

 

Thank you for your reply.

SAS_Rob
SAS Employee

With GLM coding (the default in GENMOD) the procedure implicitly fits the main effects as well in an interaction only model, that is, it absorbs those DF in the calculation of parameter estimates.  HPGENSELECT would do the same thing if you use GLM coding.

 

proc hpgenselect data=data;
class var1(ref = "A") var2(ref = "X")/param=glm;
model target = var1*var2
/dist=Poisson link=log;
selection method=NONE details=all;
run;

 

 

 

 

 

 

JamesBlack
Fluorite | Level 6
Thank you very much indeed!

I would be very greatful if you had any advice on how to set overdispersion in this type of regression.

I know, I can set a fixed number via "disperion" but is there something similar to

scale=pearson

option which is in GENMOD?
SAS_Rob
SAS Employee

HPGENSELECT does not allow you to set the dispersion parameter to anything other than a constant.  In any regard I think most statisticians would suggest that if you think you have over/under-dispersion for a Poisson model, it is best to switch to a negative binomial model which can be written such that it is essentially a poisson distribution with an extra dispersion parameter

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 3071 views
  • 10 likes
  • 3 in conversation