BookmarkSubscribeRSS Feed
Jie111
Quartz | Level 8

I am using proc logistic to analyze the association between blood pressure and CVD (categorical outcomes: heart disease only, stroke only, heart and stroke). 

 

I want to consider the cluster variable, like hospital (data from 3 hospitals). In proc genmod, it only supports the ordinal multinomial model but not categorical multinomial.

 

how to adjust for the clustered variable when doing multinomial regression with categorical dependent variables?

15 REPLIES 15
PaigeMiller
Diamond | Level 26

You could add the variable HOSPITAL to the model in PROC LOGISTIC.

--
Paige Miller
Jie111
Quartz | Level 8

Thanks for the quick reply.

 

Sorry, but I did not describe the question correctly. if the clustered variable was family, and we have about 1000 families. it is not possible to adjust it in the logsitc model directly.

 

proc logistic data=data descend;
       class  blood_pressure_g(ref="2") / param=ref;
       model inc_multinorminal=blood_pressure_g/ link=glogit;
  run;



proc genmod data=data descend;
      class family blood_pressure_g(ref="2") / param=ref;
       model inc_multinorminal=blood_pressure_g/ dist=MULTINOMIAL link=cumlogit;
            repeated family=alt_pairid / corr=IND covb;
run;
PaigeMiller
Diamond | Level 26

I don't really have a good answer.

 

When you have a categorical variable with one thousand levels, I don't really know of any modeling technique that will do a good job here. There are two reasons for this: potentially small number of data points for most (all?) levels, and that you may run out of memory. Of course, you can still try it and see what results, but I am not optimistic.

--
Paige Miller
SteveDenham
Jade | Level 19

To me, this sounds like an analysis where PROC GLIMMIX can be applied.  It can model nominal multinomial responses, and by treating clustering variables as random effects, should be able to accomplish what you wish to do.

proc glimmix data=data descend method=laplace;
      class family blood_pressure_g(ref="2");
       model inc_multinorminal=blood_pressure_g/ dist=MULTINOMIAL link=glogit;
         RANDOM family/subject=alt_pairid ;
run;

For this, be sure to sort your dataset by alt_pairid (which I assume is numeric).  If it is not numeric,  then it should be added to the CLASS statement.   I removed the global param=ref, as it is not supported in GLIMMIX, which uses GLM parameterization.

 

SteveDenham

Jie111
Quartz | Level 8

Thanks for the kind reply.

 

I tried code like below. but ERROR: The SAS System stopped processing this step because of insufficient memory.

Maybe I have too many levels of the family...

 

 

proc glimmix data=data  method=laplace;
      class family blood_pressure_g(ref="2");
       model inc_multinorminal=blood_pressure_g/ dist=MULTINOMIAL link=glogit;
         RANDOM blood_pressure_g/subject=family group=inc_multinorminal;
run;
PaigeMiller
Diamond | Level 26

Yes, this is a major issue when trying to model a categorical variable with 1000 levels.

 

You could try to somehow cluster the families together, so instead of 1000 families, you have 25 clusters ... but I don't have any ideas off the top of my head how to do this.

--
Paige Miller
Jie111
Quartz | Level 8
Thanks a lot, Paige.
I would try it.
SteveDenham
Jade | Level 19

This line is causing a lot of the problem:

 

  RANDOM blood_pressure_g/subject=family group=inc_multinorminal;

The way this reads, blood pressure is both a fixed and a random effect, and the random effect has >1000 levels, which are further subdivided by the number of levels in inc_multinomial, which is your dependent variable.  Unless I am missing something, you have clustering by family.  You may have heterogeneity of variance by blood pressure group, so the possible RANDOM statements would be:

 

  RANDOM intercept/subject=family;

/*OR*/

  RANDOM intercept/subject=family group=blood_pressure_g;

The first (random intercept model) should not stress GLIMMIX - there is a single estimate, with as many BLUPs from that as you have subjects.

 

Please try that and see how the memory situation works.

 

SteveDenham

 

Jie111
Quartz | Level 8

Hi SteveDenham, thanks for the help.

 

I changed the code as follows

proc glimmix data=data   method=laplace;
         class   family    blood_pressure_g(ref="2");
        model inc_multinorminal (ref='0')=blood_pressure_g/ dist=MULTINOMIAL link=glogit;
          RANDOM intercept/subject=family   group=blood_pressure_g;
 run;

 

Then SAS reminds me that 

ERROR: Nominal models require that the response variable is a group effect on RANDOM statements.
You need to add 'GROUP=inc_multinorminal'.

 

 

so I changed the Group, as follows:

 

proc glimmix data=data   method=laplace;
         class   family    blood_pressure_g(ref="2");
        model inc_multinorminal (ref='0')=blood_pressure_g/ dist=MULTINOMIAL link=glogit;
          RANDOM intercept/subject=family   group=inc_multinorminal;
 run;

 

Still, ERROR: The SAS System stopped processing this step because of insufficient memory.

 

 

SteveDenham
Jade | Level 19

Digging around, I came up with one possibility, and one question for the wider audience.

 

The possibility would be to bootstrap your results, by using several subsets of the subject variable family.  You could use PROC SURVEYSELECT to sample with replacement from all families, then fit each of these subsets separately.  The parameter estimates and standard errors for the full group could then be obtained by model averaging (either straightforward or using PROC MIANALYZE in a clever way).  The key would be finding out what size the subsets need to be to avert the memory issue.

 

The question for the wider audience is this: It is not obvious to me that you must include group=<response_variable> in the RANDOM statement. The error message points out that it must be included  Could someone point me to where this is covered in the documentation (  @StatDave , @Rick_SAS) ?    Thanks to anyone that has info on this.

 

SteveDenham

StatDave
SAS Super FREQ

While PROC GENMOD does not support nominal multinomial logistic regression with clustered data, the newer PROC GEE does. You can specify DIST=MULT and LINK=GLOGIT and then use the REPEATED statement. For this and other types of logistic models that are available, see this note.

SteveDenham
Jade | Level 19

This really looks like a great alternative, @StatDave .

 

The example here  looks directly applicable to this analysis.  Here is the code I would consider:

 

proc gee data=data descend;
      class family blood_pressure_g ;
       model inc_multinorminal=blood_pressure_g/ dist=MULTINOMIAL link=glogit;
            repeated subject=family / within=altpairid;;
run;

You may need to sort the data by family and altpairid to get this to work.  Also, I don't know if altpairid is numeric so that it could be used in the within= option without including it in the CLASS statement. Also, it appears that PROC GEE only uses a GLM parameterization, and doesn't appear to support the ref= option in the CLASS statement, so interpretation will have to be made carefully.

 

SteveDenham

 

StatDave
SAS Super FREQ
For both GENMOD and GEE:
- The data do not need to be sorted by the SUBJECT= variable.
- Any variable in the SUBJECT= or WITHIN= option must be specified in the CLASS statement.
Jie111
Quartz | Level 8

Thanks a lot for your help, @SteveDenham @StatDave .

 

It seems that proc gee could help to deal with the question.

Unfortunately, my SAS reminds me that procedure GEE not found. 

I would try it and feedback here when the Proc GEE is available.😀

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 4635 views
  • 2 likes
  • 4 in conversation