BookmarkSubscribeRSS Feed
righcoastmike
Quartz | Level 8

Hi All, 

 

I posted a similar question a little while ago, but I think I can articulate the problem better now so I thought I'd try again. I've managed to pick myslef a doozy of a model, and I'm hoping that someone can give me some insight. I'm running a regression on a population of approximately 75,000 individuals with a total of approximately 150,000 total records (multiple records per individual) . These individuals are also nested within postal codes.

 

here's the code I'm running:

 

proc glimmix data=(data) ic=q;
class Postal_code study_id catvar1 catvar2 catvar3;
model (binary outcome variable) = catvar1 catvar2 catvar3 fixedvar1.....fixedvar20 / offset=logexposure solution dist=possion link=log ddfm=satterthwait;
random intercept / solution subject=postal_code ;
random _residual_ / solution subject=study_id(postal_code) ;
covtest/ wald;
nloptions tech=nrridge;
run;


 A couple of notes:

1. I'm runnining _residual_ on the second random intercept statement because I kept getting memory errors with the "intercept command" I'm guessing it's because I have so many levels in the patient_id class. I think this should be fine, but I'm open to other solutions. 

 

2. The model actually converges fine without the offset, but as soon as I include it something breaks and I no longer have convergence. 

 

I've been struggling to get this right for over a week now. Any insight into a) whether the _residual command is a legit way to get around the memory issues, or b) why the offset might be affecting convergence would be very much appreicated. 

 

Thanks so much. 

 

Mike 

5 REPLIES 5
sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

For those who are interested, the previous post is here:https://communities.sas.com/t5/SAS-Statistical-Procedures/Does-this-PROC-GLIMMIX-Code-make-sense/m-p...

 

Something old/something new thoughts:

 

1. I still don't think the second RANDOM statement is appropriate.

 

2. If the response variable is binary, why specify DIST=POISSON instead of BINARY? I'm guessing that the response variable is actually a count?

 

3. Have you tried using study_id as a continuous variable with the dataset sorted appropriately? If so, how did that work out? I'm actually quite curious about that....

 

4. Have you tried METHOD=LAPLACE on the PROC GLIMMIX statement? Sometimes that option gives me convergence when I didn't have it before; sometimes it provides awful results. 

 

5. If the response variable is a count, perhaps it does not follow a Poisson distribution. Overdispersed? Zero-altered? 

 

righcoastmike
Quartz | Level 8
Hi sld,

Thanks for sticking with me on this one, so far no luck, here’s where we’re at:

1. I still don't think the second RANDOM statement is appropriate.
I was doing some research about this and came up with the following:
"Alternatively, for a random intercept model, the equivalent model can be specified using the REPEATED statement rather than the RANDOM statement. The REPEATED statement is less memory intensive than the RANDOM statement, especially when there are many levels of the SUBJECT= effect. In PROC GLIMMIX, you specify a repeated structure by adding the _RESIDUAL_ or RSIDE keyword to the RANDOM statement”
(SOURCE: http://support.sas.com/resources/papers/proceedings12/332-2012.pdf). I think it should be OK, but if you see something I don’t your thoughts would be much appreciated.

2. If the response variable is binary, why specify DIST=POISSON instead of BINARY? I'm guessing that the response variable is actually a count?
You are correct, the variable is in fact count data that I have recoded into binary because my data set has a variety of categories of the non event. (e.g. 1=event 2= non-event “A” 3=non-event “B” etc.) Since I don’t really care about the type of non event, I recoded to 1 for event and 0 for non event.

3. Have you tried using study_id as a continuous variable with the dataset sorted appropriately? If so, how did that work out? I'm actually quite curious about that….
This threw up the error “Model is too large to be fit by PROC GLIMMIX in a reasonable amount of time on this system. Consider changing your model.

4. Have you tried METHOD=LAPLACE on the PROC GLIMMIX statement? Sometimes that option gives me convergence when I didn't have it before; sometimes it provides awful results.
Same error as #3

5. If the response variable is a count, perhaps it does not follow a Poisson distribution. Overdispersed? Zero-altered?
I’ve done a couple of tests and it doesn’t look like I’ve got either of these problems, however I’m going to keep at it.

Thanks again for working through this with me, it’s much appreciated!

Mike
sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Hi Mike,

 

1. Oh, yes, this is an excellent reparameterization trick! To achieve the same model, you must add TYPE=CS

 

random _residual_ / solution subject=study_id(postal_code) type=cs;

Here's a demo

data demo2;
    call streaminit(87564);
    mu = 5;
    array aa[3] _temporary_ (1 0.5 0);
    do b= 1 to 20;
        var_b = rand("normal",0,2);
        do c= 1 to 10;
            var_c = rand("normal",0,1);
            do a= 1 to 3;
                y = mu + aa[a] + var_b + +var_c + rand("normal",0,1);
                output;
                end;
            end;
        end;
    run;
proc glimmix data=demo2;
    class a b c;
    model y = a / solution ddfm=kr2;
    random intercept / subject=b;
    random a / subject=b;
    random intercept / subject=c(b);
    run;
proc glimmix data=demo2;
    class a b c;
    model y = a / solution ddfm=kr2;
    random intercept / subject=b;
    random a / subject=b;
    random _residual_ / subject=c(b) type=cs;
    run;

2. The original variable with multiple, non-ordered categorical outcomes strikes me as a multinomial variable, not a count variable; I don't think that the Poisson distribution was ever pertinent. When you compile the various outcomes into event/nonevent, you've created a variable with a binary (Bernoulli) distribution; your model now is a binary logistic regression. Using an offset with a binary distribution to accommodate varying exposures is not as straightforward as replacing "poisson" with "binary". One approach is to use the complementary log-log link; see https://www.r-bloggers.com/modelling-occurence-of-events-with-some-exposure/ and https://rpubs.com/bbolker/logregexp, or search on "logistic regression offset" for more hits.

 

3. Too bad. I notice that this approach is presented on p 3 of the Kiernan et al paper you link to in your response. But maybe once you fix the distribution specification, you might try it again with more success.

 

Let me know what works and doesn't!

 

NMB82
Obsidian | Level 7

Did you ever finalize this model? I'm having a similar problem with a 3-level model nesting hospital admissions (n=32,000) within patients (n=20,000) and patients within hospitals (n=9).

 

https://communities.sas.com/t5/Statistical-Procedures/GLIMMIX-3-level-model-stops-with-quot-insuffic... 

STAT_Kathleen
SAS Employee

If you are still having issues with convergence or memory if you can provide the data set I can see if there are any recommendations that will help your  model run without errors.  Please keep in mind you may also contact Technical Support for help when you encounter convergence or out of memory errors and we will try to provide recommendations.

 

Kathleen

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 2233 views
  • 1 like
  • 4 in conversation