Re: Modeling smoking in PROC GLIMMIX

zemastear · Posted 12-07-2012 10:39 AM

Hello there,

I am trying to fit a multilevel random effect model on my data.

The model looks like this:

proc glimmix data=dataname initglm /*abspconv=1E-4*/ method=quad;

model smoking(event="1") = dum60 dum70 dum80 age age*dum60 age*dum70 age*dum80 / dist=binary link=logit cl covb s;

random intercept age / subject=id;

by sex;

run; quit;

Smoking is a dichotomous outcome variable (yes = 1, no = 0), age as a continuous variable is one of the predictors and there is an interaction of age with generation. I made dummies (dum50 (reference category), dum60, dum70, dum80) for each generation. Generations are defined by their age at baseline. So, we have a generation which is age 60-69 at baseline (dum60), etc.

I am using a long dataset and we have 4 observations for the subjects (cohort study). Hence we are using proc glimmix, to correct for repeated measurements.

I have a few problems with this model.

- Note the statement "by sex". I want to analyse men and women separately. When I run the model, everything works perfectly for men. For women, the model "doesn't work". I get an error: ERROR: Infeasible parameter values for evaluation of objective function with 1 quadrature point. I tried to google this, but I couldn't find it anywhere. I talked to a statistician already, but he couldn't really help me, except that he told me to change the converging criteria. I did a few things:

1) nloptions gconv = 1E-3 fconv = 1E-3.

2) abspconv=1E-4.

3) change method from=quad to method=laplace.

All didn't work.

- Another, totally different problem, is that when the model works with age and intercept as random effects, the prevalence estimates do not correspond with the prevalence I observe when making frequency tables.

For example, I ran some estimates together with the model:

estimate 'gen 60-69 at age = 70' intercept 1 gendum60 1 leeftijd 70 leeftijd*gendum60 70/ilink;

It gave me a prevalence at age 70 of 1% or something, using the ilink feature, which calcs the outcome back to prevalences (right?). When I run frequency tables for this generation, I see that at age 70, they have a prevalence of ~20%. And it's not only for this specific generation at this age, it's for every estimate I do. The model doesn't represent my data well, at all. Somethimes it even gives prevalences of 1E-7, which is of course very weird. Again, I talked to the statistician and we tried to run the model w/o random intercept and age. The prevalences as estimated by the model were very accurate as compared to the real data! But my question is: what happens when you remove the random intercept and age? I understand where you correct for, when using them. But when removing them, am I still correcting for repeated measurements for every subject? One thing that I noted is that without random effects, 'Subjects in Blocks' is 1, instead of the ~1000 I usually have for men.

Lots of stuff, I hope you guys can help me

SteveDenham · Posted 12-10-2012 08:23 AM

See my reply in the SAS Procedures forum. After thinking another 10 minutes, and seeing what is going on when you remove the random effect of age, I am more convinced than ever that you need to include the random effect of age in your estimate statement.

Try:

estimate 'gen 60-69 at age = 70' intercept 1 gendum60 1 leeftijd 70 leeftijd*gendum60 70 | leeftijd 70 /ilink; /* Added leeftijd as a random effect */

This is untested, and I will be curious if it gives you what you might be looking for.

Steve Denham

zemastear · Posted 12-10-2012 08:50 AM

Smoking full model

The GLIMMIX Procedure

sex respondent=1

Data Set	**
Response Variable	smoker
Response Distribution	Binary
Link Function	Logit
Variance Function	Default
Variance Matrix Blocked By	id
Estimation Technique	Maximum Likelihood
Likelihood Approximation	Gauss-Hermite Quadrature
Degrees of Freedom Method	Containment

Number of Observations Read	5790
Number of Observations Used	3473

Ordered Value	smoker	Total Frequency
1	0	2651
2	1	822

The GLIMMIX procedure is modeling

the probability that roker='1'.

G-side Cov. Parameters	2
Columns in X	8
Columns in Z per Subject	2
Subjects (Blocks in V)	931
Max Obs per Subject	6

Optimization Technique	Dual Quasi-Newton
Parameters in Optimization	10
Lower Boundaries	2
Upper Boundaries	0
Fixed Effects	Not Profiled
Starting From	GLM estimates
Quadrature Points	7

0	4	2664.2076557	.	247518.4
1	15	2531.8794796	132.32817606	10382.23
2	3	2413.6629589	118.21652075	12986.84
3	4	2411.987591	1.67536786	12781.79
4	4	2411.7530459	0.23454511	12714.24
5	3	2410.6125111	1.14053478	12417.07
6	4	2389.3924796	21.22003154	2125.985
7	2	2370.5918749	18.80060463	8539.3
8	2	2340.5986904	29.99318457	10467.24
9	3	2333.1846772	7.41401318	20957.11
10	4	2309.6450045	23.53967266	3675.545
11	3	2305.5393632	4.10564133	3792.51
12	3	2303.766906	1.77245715	1238.087
13	3	2303.3039486	0.46295749	929.9581
14	3	2303.0939216	0.21002696	224.6986
15	4	2299.8431727	3.25074887	2301.9
16	2	2297.2327543	2.61041844	1237.504
17	2	2293.4149064	3.81784791	535.4416
18	2	2288.5668127	4.84809373	2638.173
19	4	2287.8921513	0.67466138	1248.713
20	3	2287.5832383	0.30891300	125.6845
21	3	2287.5658631	0.01737515	127.9899
22	2	2287.5444346	0.02142856	101.2652
23	6	2287.0016694	0.54276516	394.4775
24	3	2286.6363686	0.36530084	150.4604
25	3	2286.4653755	0.17099304	193.7834
26	3	2286.3540602	0.11131532	36.6376
27	3	2286.3430099	0.01105026	35.43862
28	3	2286.3427712	0.00023876	9.727946
29	3	2286.3427548	0.00001641	3.908154

Convergence criterion (GCONV=1E-8) satisfied.

-2 Log Likelihood	2286.34
AIC (smaller is better)	2306.34
AICC (smaller is better)	2306.41
BIC (smaller is better)	2354.71
CAIC (smaller is better)	2364.71
HQIC (smaller is better)	2324.79

-2 log L(smoker\| r. effects)	690.83
Pearson Chi-Square	690.37
Pearson Chi-Square / DF	0.20

Covariance Parameter Estimates

Cov Parm

Subject

Estimate

Standard Error

Intercept

respnr

24.9367

4.9515

age

respnr

0.002104

.

Solutions for Fixed Effects
Effect	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Alpha	Lower	Upper
Intercept	10.7707	2.1945	927	4.91	<.0001	0.05	6.4641	15.0774
dum60	1.6981	3.0139	1610	0.56	0.5732	0.05	-4.2135	7.6096
dum70	7.3854	4.1763	1610	1.77	0.0772	0.05	-0.8062	15.5770
dum80	3.7303	9.8024	1610	0.38	0.7036	0.05	-15.4965	22.9571
age	-0.2081	0.03392	928	-6.13	<.0001	0.05	-0.2747	-0.1415
dum60*age	-0.03188	0.04490	1610	-0.71	0.4778	0.05	-0.1200	0.05619
dum70*age	-0.07526	0.05662	1610	-1.33	0.1840	0.05	-0.1863	0.03580
dum80*age	-0.02561	0.1173	1610	-0.22	0.8272	0.05	-0.2557	0.2045

Type III Tests of Fixed Effects
Effect	Num DF	Den DF	F Value	Pr > F
dum60	1	1610	0.32	0.5732
dum70	1	1610	3.13	0.0772
dum80	1	1610	0.14	0.7036
age	1	928	37.62	<.0001
dum60*age	1	1610	0.50	0.4778
dum70*age	1	1610	1.77	0.1840
dum80*age	1	1610	0.05	0.8272

Covariance matrix for fixed effects
Intercept	1	4.8156	-4.6886	-4.6077	-4.6707	-0.07191	0.06941	0.06868	0.06894
dum60	2	-4.6886	9.0835	4.8457	4.8053	0.06926	-0.1317	-0.07167	-0.07161
dum70	3	-4.6077	4.8457	17.4418	4.8857	0.06754	-0.07216	-0.2319	-0.07319
dum80	4	-4.6707	4.8053	4.8857	96.0872	0.06902	-0.07145	-0.07225	-1.1404
age	5	-0.07191	0.06926	0.06754	0.06902	0.001151	-0.00110	-0.00108	-0.00109
dum60*age	6	0.06941	-0.1317	-0.07216	-0.07145	-0.00110	0.002016	0.001139	0.001138
dum70*age	7	0.06868	-0.07167	-0.2319	-0.07225	-0.00108	0.001139	0.003206	0.001154
dum80*age	8	0.06894	-0.07161	-0.07319	-1.1404	-0.00109	0.001138	0.001154	0.01377

Estimates used

estimate 'gen 55-59 vs. gen 60-69 at age = 80' dum60 -1 age*dum60 -70 / cl;

estimate 'gen 55-59 at age = 70' intercept 1 age 70 / ilink;

estimate 'gen 60-69 at age = 70' intercept 1 dum60 1 age 70 age*dum60 70 / ilink;

Label	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Alpha	Lower	Upper	Mean	Standard Error Mean	Lower Mean	Upper Mean
gen 60-69 vs. gen 70-79 at age = 80	-2.2174	0.7119	1610	-3.11	0.0019	0.05	-3.6138	-0.8210	Non-est	.	.	.
gen 60-69 at age = 80	-6.7284	0.5968	1610	-11.27	<.0001	0.05	-7.8991	-5.5577	0.001195	0.000712	0.000371	0.003843
gen 70-79 at age = 80	-4.5110	0.5080	1610	-8.88	<.0001	0.05	-5.5074	-3.5147	0.01087	0.005461	0.004040	0.02890

This is the output of my analysis (as asked for in the other topic, but let's continue here).

I will try adding the random effect in the estimate statement. Do I need to add the random effect in all my estimates? So, also in the first estimate for the difference between two generations (gen 60-69 vs. gen 70-79 at age = 80)? If so, do I put it in the same way as for the other statement? (| leeftijd 70)?

Thnx a lot Steve, I will let you know what happens.

zemastear · Posted 12-10-2012 09:21 AM

Adding the random effect of age in the estimate statement (did it for the second and third estimate, not the first) did cause the prevalence to rise, but not near as high as it 'should' be / as I want it to be / as the raw data looks like.

SteveDenham · Posted 12-10-2012 10:10 AM

After seeing the estimates of the random effect, I don't think that will put things in the right order of magnitude. The logit of the response is log (822/3473) - log (2651/3473) = -1.171. When I plug in values, I get values that are off by the magnitude you are reporting.

What happens when you use the LSMEANS statement?

Try

lsmeans gendum60/at means ilink e;

lsmeans gendum60/at age=70 ilink e;

Now I don't expect this to really work, since I think you have correctly specified the estimate statement, but it might trigger something in my dinosaur brain on a Monday morning. If the e option doesn't give back what is going into the estimate statement, then we have what I am looking for.

There is something niggling at my thoughts--Why not make the 4 age cohorts class variables? I think we can get at the comparisons of interest through the LSMESTIMATE statement. It would also enable the use of the noint option on the model statement. Something like:

proc glimmix data=dataname initglm /*abspconv=1E-4*/ method=quad;

class cohort(ref=first);/* Assumes that this is on the dataset already, and sets the age50 cohort as the reference group */

model smoking(event="1") = cohort age age*cohort/ noint dist=binary link=logit cl covb s;

random intercept age / subject=id;

lsmeans cohort/at means ilink e;

lsmeans cohort/at age=70 ilink e;

lsmeans cohort/at age=80 ilink e;

by sex;

run; quit;

See if these give you values that look more appropriate. If they look about right, then we can work on comparisons of interest. Some of these will have values, but be meaningless--for instance, the 80-89 cohort at age=70. If the lsmeans on the ilink scale are still off by several orders of magnitude, then I think some examination of data, etc. is where we need to go.

Steve Denham

Message was edited by: Steve Denham

zemastear · Posted 12-11-2012 08:06 AM

Hey Steve,

Thanks again.

I am far from an expert in sas, but I had looked around for myself a bit the other day and I already thought that it should be possible somehow to model what I want to model, and that lsmeans/lsmestimate looked like the good solution of first sight. I had no clue how to do it though, since I only know the statistics procedure that I have done in the past. Everything else is new to me, as is this.

I talked to my statistician yesterday and now we decided to go another way (a way I do not really like so much, especially if your suggestions turns out to be working), but for now I will go on with what I have already, since time is running out for my project. I expect to have more time near the end of this month / start of next month (as I will be reviewing my article with the co-authors, I will have plenty of time inbetween I think) and that is when I want to try your method. So, I will let this here for now, but I will get back to it sometime. I just think its impossible that there is no way in sas to do what I want to do. Therefore I am interested in experimenting with your suggestion somewhere in the near future.

SteveDenham · Posted 12-11-2012 08:23 AM

Good luck. I hope that your method works out (just guessing that you will be calculating odds ratios using frequency tables).

Steve Denham

zemastear · Posted 12-11-2012 08:30 AM

We will just present the difference in prevalences based on the raw data. In order to say it those differences are significant, we will look at the model.

So, if we see a difference of let's say 8% between generation 55-59 (reference) and 60-69at age 70, in the raw data, we will present that number. Then, we will look at this estimate:

estimate 'gen 55-59 vs. gen 60-69 at age = 70' dum60 -1 age*dum60 -70 / cl;

If this is significant ("Pr > |t|" < 0.05) then we will say: 8% difference is significant.

I don't like this method so much....

SteveDenham · Posted 12-11-2012 08:35 AM

Me either, since the individual estimates are so poorly aligned with the raw data.

Picking battles that you can win is the key to military success. Also to statistical consulting.

Steve Denham

zemastear · Posted 12-12-2012 10:17 AM

Here I am again. The results of my analyses were so poor that I decided to try this.

After seeing the estimates of the random effect, I don't think that will put things in the right order of magnitude. The logit of the response is log (822/3473) - log (2651/3473) = -1.171. When I plug in values, I get values that are off by the magnitude you are reporting.

I don't know exactly what you are calculating there and if it is important, but when I calculate this, I get -0.50853791?

What happens when you use the LSMEANS statement?

Try

lsmeans gendum60/at means ilink e;

lsmeans gendum60/at age=70 ilink e;

lsmeans gendum60/at means ilink e;

ERROR: Only class variables allowed in this effect.




Smoking Full





The GLIMMIX Procedure

sex=1

Data Set	WORK.dataname
Response Variable	smoking
Response Distribution	Binary
Link Function	Logit
Variance Function	Default
Variance Matrix Blocked By	id
Estimation Technique	Maximum Likelihood
Likelihood Approximation	Gauss-Hermite Quadrature
Degrees of Freedom Method	Containment

ageclass	4	1 2 3 4

Number of Observations Read	11964
Number of Observations Used	8182

1	0	5605
2	1	2577

G-side Cov. Parameters	2
Columns in X	9
Columns in Z per Subject	2
Subjects (Blocks in V)	2991
Max Obs per Subject	3

Optimization Technique	Dual Quasi-Newton
Parameters in Optimization	10
Lower Boundaries	2
Upper Boundaries	0
Fixed Effects	Not Profiled
Starting From	GLM estimates
Quadrature Points	7

0	4	7924.0808061	.	296043.2
1	14	7464.3488984	459.73190764	11431.85
2	3	7235.5407215	228.80817687	1259.722
3	4	7222.9621138	12.57860771	1491.717
4	4	7217.0681408	5.89397306	1635.135
5	4	7216.2814485	0.78669222	1605.412
6	4	7206.613118	9.66833050	3964.965
7	4	7142.9150265	63.69809154	10381.62
8	3	7138.4727633	4.44226324	3083.605
9	2	7131.8228528	6.64991052	4798.037
10	4	7111.9430664	19.87978634	2394.794
11	2	7101.0562478	10.88681862	4136.373
12	3	7096.4972303	4.55901754	2491.948
13	3	7094.4543614	2.04286881	918.8764
14	4	7086.9710653	7.48329611	4893.221
15	2	7082.7288234	4.24224195	1764.441
16	3	7080.0529968	2.67582655	442.8867
17	3	7079.6473463	0.40565057	864.7367
18	2	7079.4226727	0.22467353	608.6162
19	2	7079.0440924	0.37858034	45.46735
20	4	7078.2098257	0.83426673	1435.201
21	4	7076.2935373	1.91628834	893.0885
22	3	7075.9743026	0.31923472	297.7529
23	3	7075.8962293	0.07807334	104.861
24	4	7075.6627013	0.23352799	1142.108
25	4	7073.7881887	1.87451260	1057.395
26	3	7073.0449224	0.74326630	340.2352
27	3	7072.9644057	0.08051665	157.9887
28	3	7072.957057	0.00734868	67.15604
29	3	7072.9554633	0.00159374	35.61595
30	4	7072.9476656	0.00779774	52.25779
31	3	7072.9459177	0.00174785	6.514212
32	3	7072.9457961	0.00012161	0.446825
33	3	7072.945793	0.00000312	0.040506

Convergence criterion (GCONV=1E-8) satisfied.

-2 Log Likelihood	7072.95
AIC (smaller is better)	7092.95
AICC (smaller is better)	7092.97
BIC (smaller is better)	7152.98
CAIC (smaller is better)	7162.98
HQIC (smaller is better)	7114.54

-2 log L(smoking \| r. effects)	2009.74
Pearson Chi-Square	1444.48
Pearson Chi-Square / DF	0.18

Intercept	id	23.5344	2.4922
age	id	0.004184	0.000835

ageclass	1	0.8444	0.7731	2209	1.09	0.2749	0.05	-0.6717	2.3605
ageclass	2	1.7424	0.7304	2209	2.39	0.0171	0.05	0.3101	3.1746
ageclass	3	1.0363	0.9537	2209	1.09	0.2774	0.05	-0.8340	2.9066
ageclass	4	9.0983	1.6102	2209	5.65	<.0001	0.05	5.9406	12.2560
age		-0.2268	0.02837	2978	-8.00	<.0001	0.05	-0.2825	-0.1712
age*ageclass	1	0.1352	0.03717	2209	3.64	0.0003	0.05	0.06233	0.2081
age*ageclass	2	0.1237	0.03288	2209	3.76	0.0002	0.05	0.05918	0.1881
age*ageclass	3	0.1466	0.03351	2209	4.37	<.0001	0.05	0.08086	0.2123
age*ageclass	4	0	.	.	.	.	.	.	.

ageclass	4	2209	9.67	<.0001
age	1	2978	107.62	<.0001
age*ageclass	3	2209	7.02	0.0001

ageclass	1	1	0.5977	0.002968	0.004430	0.01476	-0.00031	-0.01738	0.000216	0.000188
ageclass	2	2	0.002968	0.5334	0.009498	0.03279	-0.00068	0.000594	-0.01201	0.000424
ageclass	3	3	0.004430	0.009498	0.9096	0.04726	-0.00098	0.000860	0.000693	-0.01710
ageclass	4	4	0.01476	0.03279	0.04726	2.5928	-0.04467	0.04389	0.04336	0.04314
age		5	-0.00031	-0.00068	-0.00098	-0.04467	0.000805	-0.00079	-0.00078	-0.00077
age*ageclass	1	6	-0.01738	0.000594	0.000860	0.04389	-0.00079	0.001382	0.000763	0.000759
age*ageclass	2	7	0.000216	-0.01201	0.000693	0.04336	-0.00078	0.000763	0.001081	0.000753
age*ageclass	3	8	0.000188	0.000424	-0.01710	0.04314	-0.00077	0.000759	0.000753	0.001123
age*ageclass	4	9

ageclass	1	1
ageclass	2		1
ageclass	3			1
ageclass	4				1
age		45.45	45.45	45.45	45.45
age*ageclass	1	45.45
age*ageclass	2		45.45
age*ageclass	3			45.45
age*ageclass	4				45.45

1	45.45	514520	-3.3197	0.5074	2209	-6.54	<.0001	0.03490	0.01709
2	45.45	514520	-2.9478	0.2642	2209	-11.16	<.0001	0.04984	0.01251
3	45.45	514520	-2.6121	0.2471	2209	-10.57	<.0001	0.06837	0.01574
4	45.45	514520	-1.2118	0.4404	2209	-2.75	0.0060	0.2294	0.07785

ageclass	1	1
ageclass	2		1
ageclass	3			1
ageclass	4				1
age		40	40	40	40
age*ageclass	1	40
age*ageclass	2		40
age*ageclass	3			40
age*ageclass	4				40

1	40.00	514520	-2.8204	0.4056	2209	-6.95	<.0001	0.05623	0.02153
2	40.00	514520	-2.3854	0.2292	2209	-10.41	<.0001	0.08429	0.01769
3	40.00	514520	-2.1746	0.2810	2209	-7.74	<.0001	0.1021	0.02575
4	40.00	514520	0.02444	0.5534	2209	0.04	0.9648	0.5061	0.1383

ageclass	1	1
ageclass	2		1
ageclass	3			1
ageclass	4				1
leeftijd		50	50	50	50
age*ageclass	1	50
age*ageclass	2		50
age*ageclass	3			50
age*ageclass	4				50

1	50.00	514520	-3.7366	0.6025	2209	-6.20	<.0001	0.02328	0.01370
2	50.00	514520	-3.4173	0.3153	2209	-10.84	<.0001	0.03176	0.009695
3	50.00	514520	-2.9773	0.2523	2209	-11.80	<.0001	0.04846	0.01163
4	50.00	514520	-2.2440	0.3700	2209	-6.07	<.0001	0.09587	0.03207

Basically I just run the model you suggested and presented you the output.

The ref=first statement somehow didn't work so I left it out. The model is now using category 4 as reference, right? I don't think it should matter much, when it is about the estimates/lsmeans that we are interested in? Though it would be nice to still have the youngest ageclass as reference. I tried order = data, but that didn't change things.

Note: I used other age classes now (20-29, 30-39, 40-49, 50-59), since I am using another dataset now (I have two datasets that I use for this project). But it shouldn't matter, since I had the problems of not being able to estimate the correct prevalences in both datasets. I think the method is incorrect and therefore don't expect the datasets to influence the outcomes.

So, what do we get here? We have three lsmeans statements. One that estimates for every ageclass the prevalence at their mean age? And two others which estimate the prevalence at age = 40 and age = 50? Looking at the tables, it looks like it still gives low prevalences? In these age classes, based in the raw data, I expect prevalences of around 30% (ranging from 20% to 40%) depending on the ageclass of interest, but definately not the 2-10% we are seeing now, am I right?

SteveDenham · Posted 12-13-2012 07:53 AM

The mean age lsmeans are at the mean across all groups, not within each group.

The more this goes on, I become worried that age is missing for a lot of these observations, but that your calculation of raw prevalences still includes those subjects with missing age values. Is that a possibility?

Steve Denham

zemastear · Posted 12-13-2012 08:11 AM

I took a look, but age is not missing. It's actually quite complete.

SteveDenham · Posted 12-13-2012 08:15 AM

Any miscodings? Possible high leverage points? I am at a loss for explanations at this point.

Steve Denham

zemastear · Posted 12-13-2012 08:27 AM

I do not expect something to be wrong with the data. It has been used for many other projects and it has been around for years. It is also constantly managed by someone who does all the data operations on it. I am only making use of the data as it is and did some minor editing, which I checked and I am sure are all correct.

If we are not able to find the cause of this, then too bad I'd say. Thanks a lot anyway.

Catch up on SAS Innovate 2026