Re: GLIMMIX for multilevel multinomial logistic regression

JeremyGelb · Posted 12-08-2016 11:14 AM

Dear all,

I'm a student and I want to modelize migrations from individual datas. Because I have many municipals datas, I want to perform a multilevel analysis, with only the intercept as random effect. My variable to predict is multinomial (not ordinal) and has 3 categories :

0 : no migration (reference)

1 : short migration (less than 40km)

2 : long migration (more than 40km)

So I'm trying to use the proc GLIMMIX but all the parameters are confusing and I dind't find a exemple for multinomial datas.

Could you help me to select the right syntax ?

By example for the empty model I use this syntax :

proc glimmix data=Mob_06.Datas method=LAPLACE NOCLPRINT;
   class DCRAN Migration;
   model Migration (ref=first) = /CL link=glogit dist=MULTINOMIAL solution;
   RANDOM intercept/SUBJECT=DCRAN GROUP=Migration  TYPE=VC SOLUTION CL;
   COVTEST / WALD;
run;

Migration is the Y variable, DCRAN is the municipal code.

I'm not sure that Migration must be put in the class statment, but otherwise, the model fail with this error :

"Model is too large to be fit by PROC GLIMMIX in a reasonable amount of time on this
system. Consider changing your model"

thank you for your help

Damien_Mather · Posted 12-08-2016 06:49 PM

my advice would be to use proc sql to generate a unique list of municipalities, then use surveyselect with method=srs to select a much smaller random sample of those, then proc sql again to do an inner join of the resuling municipality sample with your original data. Run your model on that sample. Keep taking smaller or larger samples until you find the tipping point for the error. The model then might then be your stopping point, or you can then allow you to usefully investigate other approaches that give you equivalent results that are not so memory hungry.

SteveDenham · Posted 12-09-2016 10:52 AM

First, I believe your multinomial response is ordinal. Consider that it will be generated by the following:

if distance_migrated = 0 then migration=0;

if 0<distance_migrated<=40 then migration=1;

if distance_migrated>40 then migration=2;

Consequently, you could then change the link from glogit to cumlogit, which would go a long ways towards reducing the model size and memory requirements.

But why categorize the response variable? You will always lose some power by categorizing the response variable (see Frank Harrell's website for more on this http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous).

If you fit a continuous model (such as a spline) with an appropriate distribution, I believe your results will be more interpretable, more powerful and much more precise.

Steve Denham