BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
edhuang
Obsidian | Level 7

Hi,

 

I am new to proc glimmix, but trying to get intraclass correlation in a 2 level nested model with a binary outcome (polyp - yes or no).  Level-1 is physician and Level-2 is clinic in which physicians are nested in.

 

proc glimmix data=ADR noclprint method=laplace nobound;
class MD MDlocation patient  ;
model polyp_yes(event=last)=/CL Dist=binary link=logit solution;
random intercept/sub=MDlocation type=vc s cl;
random intercept/sub=MD(MDlocation) type=vc s cl;
run;

 
Here my output for Covariance Parameter Estimates:
 
Cov Parm Subject                 Estimate StandardError
InterceptMDlocation-0.7062 
InterceptMD(MDlocation)13.6852.

 

Here are my questions.

1) Is my proc glimmix code correct?

2) How come my standard error is missing? 

3) How do I calculate the intraclass correlation for each of the two levels for a binary outcome?  In other words, what % of total variation is accounted for by MD and what % of total variation is accounted for MDlocation?

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

I can take a shot at numbers 1 and 3, but I really lack expertise on ICC so I will let number 2 go.

 

1.  A good rule of thumb for estimating a variance component for binary data is to have at least 10 clusters for the level in question.  My source here is from the R community (see anything online from Bolker or Zuur).

3.  For this design, the patient level is completely confounded with residual error, so there is no need to include it as a level. There should be one more covariance parameter estimate given for residual in your output.  If not, then there is something else going on.

 

Additionally, although I hate throwing out data, consider eliminating those MD's with 5 or fewer observations.  Also, you may want to have more levels for MDLocation, so perhaps a more granular classification would be in order.  You may not be able to come up with 10, but certainly there is some information that would lead to more than 3 levels.

 

SteveDenham

View solution in original post

11 REPLIES 11
sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

The missing SEs suggest there is something that is incompatible between your data and your model. Can you show us an example of what your data set looks like? Also, how many clinics? How many physicians within clinics? How many patients within physicians within clinic? (Roughly; it's probably unbalanced.)

 

 

edhuang
Obsidian | Level 7

I have 130,681 patients.  30 physicians and 3 clinics.  If it is unbalanced, anything I can do?  Analyze on subset of patients/MDs?

 

 

polyp_yes

 

 

0

1

MDlocation

 

 

0

 15795

  8936

1

37505

17034

2

38236

13175

 

 

polyp_yes

 

 

0

1

MD

 

 

5

4087

4

6

1666

0

7

3188

2751

9

5014

2384

12

2902

1748

13

3221

1259

15

2087

1390

19

1894

1264

20

3574

0

21

1773

518

22

1806

249

25

8277

4525

27

1491

1270

29

688

559

30

4395

1

32

1550

0

33

5342

1

36

5583

2915

37

3360

1802

39

3361

0

40

447

0

41

922

508

43

3855

2454

44

1894

1084

45

4148

2634

46

4618

2114

49

2378

1390

52

4635

1736

54

3682

2641

58

3377

1874

 

 

MDlocation

 

 

 

0

1

2

MD

 

 

 

5

0

0

282

6

1666

0

0

7

0

0

5939

9

0

0

7398

12

4650

0

0

13

0

4480

0

15

3477

0

0

19

3158

0

0

20

0

0

3574

21

0

2291

0

22

0

2055

0

25

0

12802

0

27

2761

0

0

29

0

0

1247

30

0

0

4396

32

0

1550

0

33

0

0

5343

36

0

8498

0

37

0

0

5162

39

0

0

3361

40

0

0

447

41

0

0

1430

43

0

0

6309

44

0

2978

0

45

0

6782

0

46

0

6732

0

49

3768

0

0

52

0

6371

0

54

0

0

6323

58

5251

0

0

 

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Thank you, that's very helpful.

 

With only 3 levels of MDlocation, your ability to estimate a variance is limited, there's just not enough information. In the model run that you reported, the variance was estimated as a negative value because you specified the nobound option; otherwise the estimate would have been set to zero. If your goal is to estimate an ICC for the MDlocation and MD variances, you just don't have enough data to work with.

 

I am puzzled by an aspect of your tables. In the second table, MD = 5 appears to have 4087 + 4 = 4091 observations. But in the third table, MD = 5 is reported as being at MDlocation = 2, with 287 observations. On quick scan, the next few MDs look OK, but I didn't look at all 30. Maybe it's just a copy/paste error in the message.

 

Also, I'm struck by how some MDs have very few polyp_yes observations, while many have about 30%. What distinguishes these two groups of physicians, if anything?

 

 

 

edhuang
Obsidian | Level 7
Thanks. That's helpful. I made a copy/pasting error on MD5. Thanks for catching that. The polyp observations were missing for some physicians, hence the variability.

Follow-up questions:
1) How do you estimate the number of locations you need to get an SE? Is it a ratio between level 1 and level 2 predictors?
2) Assuming that I did get a SE, is there a formula to calculate the ICC for a two level model with a binary outcome? I understand it is estimate/(estimate+3.29) for 1 level, but what about for two levels?
3) I also tried using patient level (level 1) and then nested within physicians (level 2). However, log said the system can't handle this large number. Is there an easy way to deal with this?
SteveDenham
Jade | Level 19

I can take a shot at numbers 1 and 3, but I really lack expertise on ICC so I will let number 2 go.

 

1.  A good rule of thumb for estimating a variance component for binary data is to have at least 10 clusters for the level in question.  My source here is from the R community (see anything online from Bolker or Zuur).

3.  For this design, the patient level is completely confounded with residual error, so there is no need to include it as a level. There should be one more covariance parameter estimate given for residual in your output.  If not, then there is something else going on.

 

Additionally, although I hate throwing out data, consider eliminating those MD's with 5 or fewer observations.  Also, you may want to have more levels for MDLocation, so perhaps a more granular classification would be in order.  You may not be able to come up with 10, but certainly there is some information that would lead to more than 3 levels.

 

SteveDenham

edhuang
Obsidian | Level 7
Hi Steve,

Thanks for your reply. This is extremely helpful. I believe I can get further granularity on the location. So will try that.
I will try to eliminate the MD with low observations.
sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Q1: There are rules of thumb as @SteveDenham notes; rather than 10, I would have said 20-30, but that's what rules of thumb are 🙂 More formally, you can determine the sample size required to estimate a variance (or standard deviation) with a given precision, analogous to determining sample size required to estimate a mean with a given precision. An internet search will turn up several resources.

 

Q2: Like Steve, I don't know much about computing ICC in a mixed model with binary data. On a quick scan, this blogpost looks reasonable, and it points out that with binary data, there is no residual variance per se, because the variance and the mean are determined by the same parameters. For more detail, there are also papers, among them:

 

Nakagawa S, Johnson P, Schielzeth H (2017) The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisted and expanded. J. R. Soc. Interface 14. https://doi.org/10.1098/rsif.2017.0213

 

Wu S, Crespi CM, Wong WK. 2012. Comparison of methods for estimating the intraclass correlation coefficient for binary responses in cancer prevention cluster randomized trials. Contemp Clin Trials 33(5):869-80. doi: 10.1016/j.cct.2012.05.004

 

Q3: For the binary mixed model, the residual variance depends upon the expected value and so it cannot be estimated directly from the data. Thus, you do not want to force your model to estimate a residual variance.

 

See Section 7 in Nakagawa et al (2017) for a discussion of the distinction between using an observation-level variance (estimated using the delta method) versus a distribution-specific variance.

 

I hope this helps.

 

edhuang
Obsidian | Level 7
Hi Sld,

Thanks for you and Steve's reply. They are very helpful. I will use your references. It will come in handy in the future!
sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Regarding that polyp data were missing for some physicians: What is the nature of this missingness? Does this mean that some of the patients in the polyp=0 category actually had polyps that their physician did not note in the chart? I would be concerned about this as a source of bias. How many of your physicians are bad record keepers, and should they be omitted from the analysis?

 

edhuang
Obsidian | Level 7
Hi,

You raise good points. The nature of missingness is likely due to both technical issues from data extraction and also lack of reporting by physicians. Yes, so they may actually have polyps. Fortunately, most of these are smaller observations. And you are right that bias may be introduced. Not sure if it is random bias. I will perform my analysis with and without them to determine the final model.
sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

It's so dichotomous (physicians have either almost zero or about 30% polyps) that I doubt it is random. The "smaller observations" is actually a problem, not a salve. Merely running the analysis with and without these potentially problematic physicians does not address the underlying issue. I say, give this more thought. Good luck!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 1350 views
  • 5 likes
  • 3 in conversation