I am trying to understand how to interpret the intraclass correlation coefficient, which I am calculating from proc glimmix.
• For this example I am using the Hospital, Doctor, Patient (HDP) dataset from UCLA. https://stats.oarc.ucla.edu/r/codefragments/mesimulation/
• I will be modeling for the outcome of cancer remission and my fixed predictors will be IL6, CancerStage, Age, and FamilyHx, and patients are nested within hospitals (HID)
• I will assume that I do not have doctor ID's (DID), so I will not be nesting by doctors, nor do I have variables describing the characteristics of hospitals
• I want to know the odds of remission by the specified predictors, while also accounting for differences in remission between or within hospitals - which is where the intraclass correlation coefficient comes in.
• These resource states to calculate the ICC from the "empty, unconditional model with no predictors", only the random intercept: https://support.sas.com/resources/papers/proceedings15/3430-2015.pdf , https://www.lesahoffman.com/CLDP945/CLDP945_Example07a_Binary_Clustered.pdf , https://www.lexjansen.com/sesug/2015/173_Final_PDF.pdf , https://support.sas.com/resources/papers/proceedings12/350-2012.pdf – Do I calculate the ICC from the random intercept model, or from the full model? How do I interpret the ICC?
With the random intercept model, I calculate an ICC of .093. Does this mean that 9% of the variability in going into remission is accounted for by the hospitals? --> ICC = Covariance Parameter Estimate/(Covariance Parameter Estimates + 3.29) -->
ICC = 0.3366/(0.3366+3.29) = .093
filename csvFile url "https://stats.idre.ucla.edu/stat/data/hdp.csv" termstr=crlf;
proc import datafile=csvFile
dbms=csv
out=hdp ;
getnames=yes;
guessingrows=3000;
run;
proc sql;
select
distinct(FamilyHx)
from hdp;
quit;
proc glimmix data=hdp PLOTS=(ALL ODDSRATIO(logbase=10 TYPE=HORIZONTALSTAT)) namelen=30 ;
class HID ;
model remission(event='1') =
/ distribution=binary link=logit cl oddsratio solution ddfm=betwithin;
random intercept / subject=HID solution CL ;
COVTEST /WALD;
run;
quit;
/* Covariance Parameter Estimates for HID = 0.3366
ICC = Covariance Parameter Estimate/(Covariance Parameter Estimates + 3.29)
ICC = 0.3366/(0.3366+3.29) = .093
9% of the variability in going into remission is accounted for by the hospitals? */
With the full model, I calculate an ICC of 10%.
proc glimmix data=hdp PLOTS=(ALL ODDSRATIO(logbase=10 TYPE=HORIZONTALSTAT)) namelen=30 ;
class HID CancerStage(ref='I') FamilyHx(ref="no");
model remission(event='1') = IL6 CancerStage Age FamilyHx
/ distribution=binary link=logit cl oddsratio solution ddfm=betwithin;
random intercept / subject=HID solution CL ;
COVTEST / WALD;
run;
quit;
/* Covariance Parameter Estimates for HID = 0.3721
ICC = Covariance Parameter Estimate/(Covariance Parameter Estimates + 3.29)
ICC = 0.3721/(0.3721+3.29) = .1016
10% of the variability in going into remission is accounted for by the hospitals? */
Thank you.
Hello,
I am sorry for not answering your question directly, but I just wanted to make you aware about this :
S A S S A M P L E L I B R A R Y
Intraclass Correlations
https://support.sas.com/documentation/onlinedoc/stat/ex_code/131/intracc.html
Also, please visit https://www.lexjansen.com/
Then enter 'intraclass correlation' as search terms and hit the magnifying glass.
Good luck,
Koen
@sbxkoenk wrote:
Hello,
I am sorry for not answering your question directly, but I just wanted to make you aware about this :
S A S S A M P L E L I B R A R Y
Intraclass Correlationshttps://support.sas.com/documentation/onlinedoc/stat/ex_code/131/intracc.html
Also, please visit https://www.lexjansen.com/
Then enter 'intraclass correlation' as search terms and hit the magnifying glass.
Good luck,
Koen
Thanks for the feedback. The example from the SAS Sample Library is unfortunately not helpful because it calculates inter-rater reliability. I searched lexjansen.com for "intraclass correlation" AND "3.29" and for "intraclass correlation" AND "glimmix", and the pertinent responses are listed in my original post. I'm now looking for insight from other experts.
Interestingly, a 2018 SAS paper states:
"How Do You Compute an Intra-Class Correlation Coefficient (ICC) for a Non-Normal Response Model in PROC GLIMMIX (Binary, Multinomial, and so on)? SAS is not aware of a generally accepted method for calculating an ICC in a logistic model, mainly because there is no concept of a residual variance in a logistic regression model. You can form your own ratios of the variance components in your model, but SAS does not endorse such an ICC calculation for a non-normal response model." - https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2179-2018.pdf
I wonder if SAS has since changed their position on endorsing an ICC calculation for a binary response response model? @jiltao
I doubt that there has been a change in the position. The variance component ratios can be calculated, but the are NOT correlations. In particular, there is no "residual" variance for the binomial distribution as the variance is a function of the point estimate, and changes for each level of a categorical predictor or each point along a continuous variable.
SteveDenham
Hi @SteveDenham. Is the residual variance for a response variable with a logistic distribution in a mixed model not π2/3 or 3.29? Thanks
"In HGLMs, there is assumed to be no error at level-1, therefore, a slight modification is needed to calculate the ICC. This modification assumes the dichotomous outcome comes from an unknown latent continuous variable with a level-1 residual that follows a logistic distribution with a mean of 0 and a variance of 3.29 (Snijders & Bosker, 1999 as cited in O’Connell et al., 2008). Therefore, 3.29 will be used as our level-1 error variance in calculating the ICC." - https://support.sas.com/resources/papers/proceedings15/3430-2015.pdf
STATA documentation states: "In a mixed-effects logistic and ordered logistic regression, errors are assumed to be logistic with mean 0 and variance γ. Random intercepts are assumed to be normally distributed with mean 0 and variance σ22 and to be independent of error terms. The intraclass correlation for this model is ρ = Corr(y*ij, y*i'j ) = σ22 /γ + σ22 where γ = σ21 for a mixed-effects linear regression, γ = 1 for a mixed-effects probit and ordered probit regression, γ = π 2/3 for a mixed-effects logistic and ordered logistic regression, and γ = π 2/6 for a mixed-effects complementary log–log regression. The intraclass correlation corresponds to the correlation between the latent responses i and i' from the same group j." - https://www.stata.com/manuals/meestaticc.pdf
The logit "model also admits a latent variable formulation, except that this time the individual error terms eij are assumed to follow standard logistic rather than standard normal distributions. To be precise, we assume that Yij = 1 if, and only if, Y ∗ ij > 0, just as before. We further assume that Y ∗ ij follows the linear mixed model in (1) with mean η and ui ∼ N(0, σ2u), but eij now has a logistic distribution with mean 0 and variance σ2e. Just as before, we note that the scale of the latent variable is not identified. For convenience, we take eij to have a standard logistic distribution, which happens to have variance π2/3 or approximately 3.29." - https://ageconsearch.umn.edu/record/116030/files/sjart_st0031.pdf
Well, I learned something there. The variance will converge in distribution to that value, but the estimate could be about anything with small to medium datasets. And note that the mean converges to zero. Now, I believe that is a good assumption for the null hypothesis of no differences, but the realization, given the data at hand, has for each level a non-zero mean, and if this is a multifactor model, the estimates of the level means won't necessarily sum to zero, if there is any imbalance in the data That is one thing to consider.
The other thing that bothers me a bit is that if I plug into the formula for the sampling variance of a binomial (equivalent to the null model), it looks like n * p hat * (i - p hat). If I plug in your null model value (0.3366) and assume a p hat value of 0.5 (a best case estimate) and solve for n, I get an n of about 13. Does that seem correct for your data?
SteveDenham
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.