BookmarkSubscribeRSS Feed
zemastear
Calcite | Level 5

Hello there,

I am trying to fit a multilevel random effect model on my data.

The model looks like this:

proc glimmix data=dataname initglm /*abspconv=1E-4*/ method=quad;

    model smoking(event="1") = dum60 dum70 dum80  age age*dum60 age*dum70 age*dum80 / dist=binary link=logit cl covb s;

    random intercept age / subject=id;

by sex;

run; quit;

Smoking is a dichotomous outcome variable (yes = 1, no = 0), age as a continuous variable is one of the predictors and there is an interaction of age with generation. I made dummies (dum50 (reference category), dum60, dum70, dum80) for each generation. Generations are defined by their age at baseline. So, we have a generation which is age 60-69 at baseline (dum60), etc.

I am using a long dataset and we have 4 observations for the subjects (cohort study). Hence we are using proc glimmix, to correct for repeated measurements.

I have a few problems with this model.

- Note the statement "by sex". I want to analyse men and women separately. When I run the model, everything works perfectly for men. For women, the model "doesn't work". I get an error: ERROR: Infeasible parameter values for evaluation of objective function with 1 quadrature point. I tried to google this, but I couldn't find it anywhere. I talked to a statistician already, but he couldn't really help me, except that he told me to change the converging criteria. I did a few things:

1) nloptions gconv = 1E-3 fconv = 1E-3.

2) abspconv=1E-4.

3) change method from=quad to method=laplace.

All didn't work.

- Another, totally different problem, is that when the model works with age and intercept as random effects, the prevalence estimates do not correspond with the prevalence I observe when making frequency tables.

For example, I ran some estimates together with the model:

estimate 'gen 60-69 at age = 70' intercept 1 gendum60 1 leeftijd 70 leeftijd*gendum60 70/ilink;

It gave me a prevalence at age 70 of 1% or something, using the ilink feature, which calcs the outcome back to prevalences (right?). When I run frequency tables for this generation, I see that at age 70, they have a prevalence of ~20%. And it's not only for this specific generation at this age, it's for every estimate I do. The model doesn't represent my data well, at all. Somethimes it even gives prevalences of 1E-7, which is of course very weird. Again, I talked to the statistician and we tried to run the model w/o random intercept and age. The prevalences as estimated by the model were very accurate as compared to the real data! But my question is: what happens when you remove the random intercept and age? I understand where you correct for, when using them. But when removing them, am I still correcting for repeated measurements for every subject? One thing that I noted is that without random effects, 'Subjects in Blocks' is 1, instead of the ~1000 I usually have for men.

Lots of stuff, I hope you guys can help me Smiley Happy

13 REPLIES 13
SteveDenham
Jade | Level 19

See my reply in the SAS Procedures forum.  After thinking another 10 minutes, and seeing what is going on when you remove the random effect of age, I am more convinced than ever that you need to include the random effect of age in your estimate statement.

Try:

estimate 'gen 60-69 at age = 70' intercept 1 gendum60 1 leeftijd 70 leeftijd*gendum60 70 | leeftijd 70 /ilink; /* Added leeftijd as a random effect */

This is untested, and I will be curious if it gives you what you might be looking for.

Steve Denham

zemastear
Calcite | Level 5

Smoking full model

The GLIMMIX Procedure

Data Set **
Response Variable smoker
Response Distribution Binary
Link Function Logit
Variance Function Default
Variance Matrix Blocked By id
Estimation Technique Maximum Likelihood
Likelihood Approximation Gauss-Hermite Quadrature
Degrees of Freedom Method Containment

Number of Observations Read 5790
Number of Observations Used 3473

Ordered Value smoker Total Frequency
10 2651
21 822

The GLIMMIX procedure is modeling

the probability that roker='1'.

G-side Cov. Parameters 2
Columns in X 8
Columns in Z per Subject 2
Subjects (Blocks in V) 931
Max Obs per Subject 6

Optimization Technique Dual Quasi-Newton
Parameters in Optimization 10
Lower Boundaries 2
Upper Boundaries 0
Fixed Effects Not Profiled
Starting From GLM estimates
Quadrature Points 7

    

00 4 2664.2076557 . 247518.4
10 15 2531.8794796 132.32817606 10382.23
20 3 2413.6629589 118.21652075 12986.84
30 4 2411.987591 1.67536786 12781.79
40 4 2411.7530459 0.23454511 12714.24
50 3 2410.6125111 1.14053478 12417.07
60 4 2389.3924796 21.22003154 2125.985
70 2 2370.5918749 18.80060463 8539.3
80 2 2340.5986904 29.99318457 10467.24
90 3 2333.1846772 7.41401318 20957.11
100 4 2309.6450045 23.53967266 3675.545
110 3 2305.5393632 4.10564133 3792.51
120 3 2303.766906 1.77245715 1238.087
130 3 2303.3039486 0.46295749 929.9581
140 3 2303.0939216 0.21002696 224.6986
150 4 2299.8431727 3.25074887 2301.9
160 2 2297.2327543 2.61041844 1237.504
170 2 2293.4149064 3.81784791 535.4416
180 2 2288.5668127 4.84809373 2638.173
190 4 2287.8921513 0.67466138 1248.713
200 3 2287.5832383 0.30891300 125.6845
210 3 2287.5658631 0.01737515 127.9899
220 2 2287.5444346 0.02142856 101.2652
230 6 2287.0016694 0.54276516 394.4775
240 3 2286.6363686 0.36530084 150.4604
250 3 2286.4653755 0.17099304 193.7834
260 3 2286.3540602 0.11131532 36.6376
270 3 2286.3430099 0.01105026 35.43862
280 3 2286.3427712 0.00023876 9.727946
290 3 2286.3427548 0.00001641 3.908154

Convergence criterion (GCONV=1E-8) satisfied.

-2 Log Likelihood 2286.34
AIC (smaller is better) 2306.34
AICC (smaller is better) 2306.41
BIC (smaller is better) 2354.71
CAIC (smaller is better) 2364.71
HQIC (smaller is better) 2324.79

-2 log L(smoker| r. effects) 690.83
Pearson Chi-Square 690.37
Pearson Chi-Square / DF 0.20

   

Covariance Parameter Estimates
Cov Parm
Subject Estimate Standard Error
Interceptrespnr 24.9367 4.9515
agerespnr 0.002104 .

        

Solutions for Fixed Effects
Effect Estimate Standard Error DF t Value Pr > |t| Alpha Lower Upper
Intercept 10.7707 2.1945 927 4.91 <.0001 0.05 6.4641 15.0774
dum60 1.6981 3.0139 1610 0.56 0.5732 0.05 -4.2135 7.6096
dum70 7.3854 4.1763 1610 1.77 0.0772 0.05 -0.8062 15.5770
dum80 3.7303 9.8024 1610 0.38 0.7036 0.05 -15.4965 22.9571
age -0.2081 0.03392 928 -6.13 <.0001 0.05 -0.2747 -0.1415
dum60*age -0.03188 0.04490 1610 -0.71 0.4778 0.05 -0.1200 0.05619
dum70*age -0.07526 0.05662 1610 -1.33 0.1840 0.05 -0.1863 0.03580
dum80*age -0.02561 0.1173 1610 -0.22 0.8272 0.05 -0.2557 0.2045

    

Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
dum60 1 1610 0.32 0.5732
dum70 1 1610 3.13 0.0772
dum80 1 1610 0.14 0.7036
age 1 928 37.62 <.0001
dum60*age 1 1610 0.50 0.4778
dum70*age 1 1610 1.77 0.1840
dum80*age 1 1610 0.05 0.8272

         

Covariance matrix for fixed effects
Intercept1 4.8156 -4.6886 -4.6077 -4.6707 -0.07191 0.06941 0.06868 0.06894
dum602 -4.6886 9.0835 4.8457 4.8053 0.06926 -0.1317 -0.07167 -0.07161
dum703 -4.6077 4.8457 17.4418 4.8857 0.06754 -0.07216 -0.2319 -0.07319
dum804 -4.6707 4.8053 4.8857 96.0872 0.06902 -0.07145 -0.07225 -1.1404
age5 -0.07191 0.06926 0.06754 0.06902 0.001151 -0.00110 -0.00108 -0.00109
dum60*age6 0.06941 -0.1317 -0.07216 -0.07145 -0.00110 0.002016 0.001139 0.001138
dum70*age7 0.06868 -0.07167 -0.2319 -0.07225 -0.00108 0.001139 0.003206 0.001154
dum80*age8 0.06894 -0.07161 -0.07319 -1.1404 -0.00109 0.001138 0.001154 0.01377

Estimates used

estimate 'gen 55-59 vs. gen 60-69 at age = 80' dum60 -1 age*dum60 -70 / cl;

estimate 'gen 55-59 at age = 70' intercept 1 age 70 / ilink;

estimate 'gen 60-69 at age = 70' intercept 1 dum60 1 age 70 age*dum60 70 / ilink;

Label                                               Estimate Standard Error DF t Value Pr > |t| Alpha Lower Upper Mean

Standard

Error

Mean

Lower

Mean

Upper

Mean

gen 60-69 vs. gen 70-79 at age = 80 -2.2174 0.7119 1610 -3.11 0.0019 0.05 -3.6138 -0.8210 Non-est . . .
gen 60-69 at age = 80 -6.7284 0.5968 1610 -11.27 <.0001 0.05 -7.8991 -5.5577 0.001195 0.000712 0.000371 0.003843
gen 70-79 at age = 80 -4.5110 0.5080 1610 -8.88 <.0001 0.05 -5.5074 -3.5147 0.01087 0.005461 0.004040 0.02890

This is the output of my analysis (as asked for in the other topic, but let's continue here).

I will try adding the random effect in the estimate statement. Do I need to add the random effect in all my estimates? So, also in the first estimate for the difference between two generations (gen 60-69 vs. gen 70-79 at age = 80)? If so, do I put it in the same way as for the other statement? (| leeftijd 70)?

Thnx a lot Steve, I will let you know what happens.

zemastear
Calcite | Level 5

Adding the random effect of age in the estimate statement (did it for the second and third estimate, not the first) did cause the prevalence to rise, but not near as high as it 'should' be / as I want it to be / as the raw data looks like.

SteveDenham
Jade | Level 19

After seeing the estimates of the random effect, I don't think that will put things in the right order of magnitude.  The logit of the response is log (822/3473) - log (2651/3473) = -1.171.  When I plug in values, I get values that are off by the magnitude you are reporting.

What happens when you use the LSMEANS statement?

Try

lsmeans gendum60/at means ilink e;

lsmeans gendum60/at age=70 ilink e;

Now I don't expect this to really work, since I think you have correctly specified the estimate statement, but it might trigger something in my dinosaur brain on a Monday morning.  If the e option doesn't give back what is going into the estimate statement, then we have what I am looking for.

There is something niggling at my thoughts--Why not make the 4 age cohorts class variables?  I think we can get at the comparisons of interest through the LSMESTIMATE statement.  It would also enable the use of the noint option on the model statement.  Something like:

proc glimmix data=dataname initglm /*abspconv=1E-4*/ method=quad;

class cohort(ref=first);/* Assumes that this is on the dataset already, and sets the age50 cohort as the reference group */

    model smoking(event="1") = cohort age age*cohort/ noint dist=binary link=logit cl covb s;

    random intercept age / subject=id;

lsmeans cohort/at means ilink e;

lsmeans cohort/at age=70 ilink e;

lsmeans cohort/at age=80 ilink e;

by sex;

run; quit;

See if these give you values that look more appropriate.  If they look about right, then we can work on comparisons of interest.  Some of these will have values, but be meaningless--for instance, the 80-89 cohort at age=70.  If the lsmeans on the ilink scale are still off by several orders of magnitude, then I think some examination of data, etc. is where we need to go.

Steve Denham

Message was edited by: Steve Denham

zemastear
Calcite | Level 5

Hey Steve,

Thanks again.

I am far from an expert in sas, but I had looked around for myself a bit the other day and I already thought that it should be possible somehow to model what I want to model, and that lsmeans/lsmestimate looked like the good solution of first sight. I had no clue how to do it though, since I only know the statistics procedure that I have done in the past. Everything else is new to me, as is this.

I talked to my statistician yesterday and now we decided to go another way (a way I do not really like so much, especially if your suggestions turns out to be working), but for now I will go on with what I have already, since time is running out for my project. I expect to have more time near the end of this month / start of next month (as I will be reviewing my article with the co-authors, I will have plenty of time inbetween I think) and that is when I want to try your method. So, I will let this here for now, but I will get back to it sometime. I just think its impossible that there is no way in sas to do what I want to do. Therefore I am interested in experimenting with your suggestion somewhere in the near future.

SteveDenham
Jade | Level 19

Good luck.  I hope that your method works out (just guessing that you will be calculating odds ratios using frequency tables).

Steve Denham

zemastear
Calcite | Level 5

We will just present the difference in prevalences based on the raw data. In order to say it those differences are significant, we will look at the model.

So, if we see a difference of let's say 8% between generation 55-59 (reference) and 60-69at age 70, in the raw data, we will present that number. Then, we will look at this estimate:

estimate 'gen 55-59 vs. gen 60-69 at age = 70' dum60 -1 age*dum60 -70 / cl;

If this is significant ("Pr > |t|" < 0.05) then we will say: 8% difference is significant.

I don't like this method so much....

SteveDenham
Jade | Level 19

Me either, since the individual estimates are so poorly aligned with the raw data.

Picking battles that you can win is the key to military success.  Also to statistical consulting.

Steve Denham

zemastear
Calcite | Level 5

Here I am again. The results of my analyses were so poor that I decided to try this.

After seeing the estimates of the random effect, I don't think that will put things in the right order of magnitude.  The logit of the response is log (822/3473) - log (2651/3473) = -1.171.  When I plug in values, I get values that are off by the magnitude you are reporting.

I don't know exactly what you are calculating there and if it is important, but when I calculate this, I get -0.50853791?

What happens when you use the LSMEANS statement?

Try

lsmeans gendum60/at means ilink e;

lsmeans gendum60/at age=70 ilink e;

lsmeans gendum60/at means ilink e;

ERROR: Only class variables allowed in this effect.

Smoking Full

The GLIMMIX Procedure

Data SetWORK.dataname
Response Variablesmoking
Response DistributionBinary
Link FunctionLogit
Variance FunctionDefault
Variance Matrix Blocked Byid
Estimation TechniqueMaximum Likelihood
Likelihood ApproximationGauss-Hermite Quadrature
Degrees of Freedom MethodContainment

ageclass41 2 3 4

Number of Observations Read11964
Number of Observations Used8182

105605
212577

G-side Cov. Parameters2
Columns in X9
Columns in Z per Subject2
Subjects (Blocks in V)2991
Max Obs per Subject3

Optimization TechniqueDual Quasi-Newton
Parameters in Optimization10
Lower Boundaries2
Upper Boundaries0
Fixed EffectsNot Profiled
Starting FromGLM estimates
Quadrature Points7

0047924.0808061.296043.2
10147464.3488984459.7319076411431.85
2037235.5407215228.808176871259.722
3047222.962113812.578607711491.717
4047217.06814085.893973061635.135
5047216.28144850.786692221605.412
6047206.6131189.668330503964.965
7047142.915026563.6980915410381.62
8037138.47276334.442263243083.605
9027131.82285286.649910524798.037
10047111.943066419.879786342394.794
11027101.056247810.886818624136.373
12037096.49723034.559017542491.948
13037094.45436142.04286881918.8764
14047086.97106537.483296114893.221
15027082.72882344.242241951764.441
16037080.05299682.67582655442.8867
17037079.64734630.40565057864.7367
18027079.42267270.22467353608.6162
19027079.04409240.3785803445.46735
20047078.20982570.834266731435.201
21047076.29353731.91628834893.0885
22037075.97430260.31923472297.7529
23037075.89622930.07807334104.861
24047075.66270130.233527991142.108
25047073.78818871.874512601057.395
26037073.04492240.74326630340.2352
27037072.96440570.08051665157.9887
28037072.9570570.0073486867.15604
29037072.95546330.0015937435.61595
30047072.94766560.0077977452.25779
31037072.94591770.001747856.514212
32037072.94579610.000121610.446825
33037072.9457930.000003120.040506

Convergence criterion (GCONV=1E-8) satisfied.

-2 Log Likelihood7072.95
AIC (smaller is better)7092.95
AICC (smaller is better)7092.97
BIC (smaller is better)7152.98
CAIC (smaller is better)7162.98
HQIC (smaller is better)7114.54

-2 log L(smoking | r. effects)2009.74
Pearson Chi-Square1444.48
Pearson Chi-Square / DF0.18

Interceptid23.53442.4922
ageid0.0041840.000835

ageclass10.84440.773122091.090.27490.05-0.67172.3605
ageclass21.74240.730422092.390.01710.050.31013.1746
ageclass31.03630.953722091.090.27740.05-0.83402.9066
ageclass49.09831.610222095.65<.00010.055.940612.2560
age-0.22680.028372978-8.00<.00010.05-0.2825-0.1712
age*ageclass10.13520.0371722093.640.00030.050.062330.2081
age*ageclass20.12370.0328822093.760.00020.050.059180.1881
age*ageclass30.14660.0335122094.37<.00010.050.080860.2123
age*ageclass40.......

ageclass422099.67<.0001
age12978107.62<.0001
age*ageclass322097.020.0001

ageclass110.59770.0029680.0044300.01476-0.00031-0.017380.0002160.000188
ageclass220.0029680.53340.0094980.03279-0.000680.000594-0.012010.000424
ageclass330.0044300.0094980.90960.04726-0.000980.0008600.000693-0.01710
ageclass440.014760.032790.047262.5928-0.044670.043890.043360.04314
age5-0.00031-0.00068-0.00098-0.044670.000805-0.00079-0.00078-0.00077
age*ageclass16-0.017380.0005940.0008600.04389-0.000790.0013820.0007630.000759
age*ageclass270.000216-0.012010.0006930.04336-0.000780.0007630.0010810.000753
age*ageclass380.0001880.000424-0.017100.04314-0.000770.0007590.0007530.001123
age*ageclass49

ageclass11
ageclass21
ageclass31
ageclass41
age45.4545.4545.4545.45
age*ageclass145.45
age*ageclass245.45
age*ageclass345.45
age*ageclass445.45

145.45514520-3.31970.50742209-6.54<.00010.034900.01709
245.45514520-2.94780.26422209-11.16<.00010.049840.01251
345.45514520-2.61210.24712209-10.57<.00010.068370.01574
445.45514520-1.21180.44042209-2.750.00600.22940.07785

ageclass11
ageclass21
ageclass31
ageclass41
age40404040
age*ageclass140
age*ageclass240
age*ageclass340
age*ageclass440

140.00514520-2.82040.40562209-6.95<.00010.056230.02153
240.00514520-2.38540.22922209-10.41<.00010.084290.01769
340.00514520-2.17460.28102209-7.74<.00010.10210.02575
440.005145200.024440.553422090.040.96480.50610.1383

ageclass11
ageclass21
ageclass31
ageclass41
leeftijd50505050
age*ageclass150
age*ageclass250
age*ageclass350
age*ageclass450

150.00514520-3.73660.60252209-6.20<.00010.023280.01370
250.00514520-3.41730.31532209-10.84<.00010.031760.009695
350.00514520-2.97730.25232209-11.80<.00010.048460.01163
450.00514520-2.24400.37002209-6.07<.00010.095870.03207

Basically I just run the model you suggested and presented you the output.

The ref=first statement somehow didn't work so I left it out. The model is now using category 4 as reference, right? I don't think it should matter much, when it is about the estimates/lsmeans that we are interested in? Though it would be nice to still have the youngest ageclass as reference. I tried order = data, but that didn't change things.

Note: I used other age classes now (20-29, 30-39, 40-49, 50-59), since I am using another dataset now (I have two datasets that I use for this project). But it shouldn't matter, since I had the problems of not being able to estimate the correct prevalences in both datasets. I think the method is incorrect and therefore don't expect the datasets to influence the outcomes.

So, what do we get here? We have three lsmeans statements. One that estimates for every ageclass the prevalence at their mean age? And two others which estimate the prevalence at age = 40 and age = 50? Looking at the tables, it looks like it still gives low prevalences? In these age classes, based in the raw data, I expect prevalences of around 30% (ranging from 20% to 40%) depending on the ageclass of interest, but definately not the 2-10% we are seeing now, am I right?

SteveDenham
Jade | Level 19

The mean age lsmeans are at the mean across all groups, not within each group.

The more this goes on, I become worried that age is missing for a lot of these observations, but that your calculation of raw prevalences still includes those subjects with missing age values.  Is that a possibility?

Steve Denham

zemastear
Calcite | Level 5

I took a look, but age is not missing. It's actually quite complete.

SteveDenham
Jade | Level 19

Any miscodings?  Possible high leverage points?  I am at a loss for explanations at this point.

Steve Denham

zemastear
Calcite | Level 5

I do not expect something to be wrong with the data. It has been used for many other projects and it has been around for years. It is also constantly managed by someone who does all the data operations on it. I am only making use of the data as it is and did some minor editing, which I checked and I am sure are all correct.

If we are not able to find the cause of this, then too bad I'd say. Thanks a lot anyway.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 3168 views
  • 0 likes
  • 2 in conversation