I am creating multiple GEEs with the same covariates, but I am testing different clustering variables. The outcome is a binary yes/no variable and both VAR1 and VAR2 are binary, as well. Patients can be seen at 1 of 7 clinics, and each clinic has multiple doctors. Thus, there is an argument to be made that we should cluster based on clinic or doctor (at this time, we are not considering multiple random effect models). I want to create the ROC curve for both the GEE clustered by clinic and clustered by doctor, but when I do so using this SAS note, I am getting the same ROC curve. I have checked the predicted probabilities, and they are different for the two models. Anyone know what is going on or why this is happening? Appreciate the help and insights.
Here is my code and the outputted ROC curve for both.
*CLUSTERED BY DOCTOR;
PROC GLIMMIX data = DATA ORDER=INTERNAL empirical=root ;
class DOCTOR VAR1 VAR2;
model OUTCOME(event='1')= VAR1|VAR2
/ dist=bin link=logit covb solution ;
output out=info1 pred(ilink)=phat lcl(ilink)=low ucl(ilink)=up resid=res student=student;
RANDOM _residual_ / SUBJECT=DOCTOR TYPE=cs gcorr solution ;
Run;
proc logistic data=INFO1;
model OUTCOME(event='1') = / nofit;
roc "GLIMMIX model" PRED=phat;
run;
*CLUSTERED BY CLINIC;
PROC GLIMMIX data = DATA ORDER=INTERNAL empirical=root ;
class CLINIC VAR1 VAR2;
model OUTCOME(event='1')= VAR1|VAR2
/ dist=bin link=logit covb solution ;
output out=info1 pred(ilink)=phat lcl(ilink)=low ucl(ilink)=up resid=res student=student;
RANDOM _residual_ / SUBJECT=CLINIC TYPE=cs gcorr solution ;
Run;
proc logistic data=INFO1;
model OUTCOME(event='1') = / nofit;
roc "GLIMMIX model" PRED=phat;
run;
Is it the random residuals or random intercepts that are supposed to be used to calculate the predicted probabilities? Are you certain you have repeated measures?
- https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2179-2018.pdf
/* "The following SAS DATA step inputs the respiratory data and creates an observation for each response. The baseline and follow-up responses are actually measured on a five-point scale, from terrible to excellent, and this ordinal response is analyzed later in the chapter. For this analysis, the dichotomous outcome of whether the patient experienced good or excellent response
is analyzed with a logistic regression. The second DATA step creates the SAS data set RESP2 and computes response variable DICHOT and dichotomous baseline variable DI_BASE. Note that the baseline variable, which was recorded on a five-point scale, could be managed as either ordinal or
dichotomous." - pg 515. Stokes, Davis, and Koch. Categorical Data Analysis Using SAS, Third Edition.
https://tinyurl.com/3rj67wad
data source: https://support.sas.com/rnd/app/stat/cat/edition3/samples/chapter15.html
*/
data resp;
input center id treatment $ sex $ age baseline
visit1-visit4 @@;
visit=1; outcome=visit1; output;
visit=2; outcome=visit2; output;
visit=3; outcome=visit3; output;
visit=4; outcome=visit4; output;
datalines;
1 53 A F 32 1 2 2 4 2 2 30 A F 37 1 3 4 4 4
1 18 A F 47 2 2 3 4 4 2 52 A F 39 2 3 4 4 4
1 54 A M 11 4 4 4 4 2 2 23 A F 60 4 4 3 3 4
1 12 A M 14 2 3 3 3 2 2 54 A F 63 4 4 4 4 4
1 51 A M 15 0 2 3 3 3 2 12 A M 13 4 4 4 4 4
1 20 A M 20 3 3 2 3 1 2 10 A M 14 1 4 4 4 4
1 16 A M 22 1 2 2 2 3 2 27 A M 19 3 3 2 3 3
1 50 A M 22 2 1 3 4 4 2 16 A M 20 2 4 4 4 3
1 3 A M 23 3 3 4 4 3 2 47 A M 20 2 1 1 0 0
1 32 A M 23 2 3 4 4 4 2 29 A M 21 3 3 4 4 4
1 56 A M 25 2 3 3 2 3 2 20 A M 24 4 4 4 4 4
1 35 A M 26 1 2 2 3 2 2 2 A M 25 3 4 3 3 1
1 26 A M 26 2 2 2 2 2 2 15 A M 25 3 4 4 3 3
1 21 A M 26 2 4 1 4 2 2 25 A M 25 2 2 4 4 4
1 8 A M 28 1 2 2 1 2 2 9 A M 26 2 3 4 4 4
1 30 A M 28 0 0 1 2 1 2 49 A M 28 2 3 2 2 1
1 33 A M 30 3 3 4 4 2 2 55 A M 31 4 4 4 4 4
1 11 A M 30 3 4 4 4 3 2 43 A M 34 2 4 4 2 4
1 42 A M 31 1 2 3 1 1 2 26 A M 35 4 4 4 4 4
1 9 A M 31 3 3 4 4 4 2 14 A M 37 4 3 2 2 4
1 37 A M 31 0 2 3 2 1 2 36 A M 41 3 4 4 3 4
1 23 A M 32 3 4 4 3 3 2 51 A M 43 3 3 4 4 2
1 6 A M 34 1 1 2 1 1 2 37 A M 52 1 2 1 2 2
1 22 A M 46 4 3 4 3 4 2 19 A M 55 4 4 4 4 4
1 24 A M 48 2 3 2 0 2 2 32 A M 55 2 2 3 3 1
1 38 A M 50 2 2 2 2 2 2 3 A M 58 4 4 4 4 4
1 48 A M 57 3 3 4 3 4 2 53 A M 68 2 3 3 3 4
1 5 P F 13 4 4 4 4 4 2 28 P F 31 3 4 4 4 4
1 19 P F 31 2 1 0 2 2 2 5 P F 32 3 2 2 3 4
1 25 P F 35 1 0 0 0 0 2 21 P F 36 3 3 2 1 3
1 28 P F 36 2 3 3 2 2 2 50 P F 38 1 2 0 0 0
1 36 P F 45 2 2 2 2 1 2 1 P F 39 1 2 1 1 2
1 43 P M 13 3 4 4 4 4 2 48 P F 39 3 2 3 0 0
1 41 P M 14 2 2 1 2 3 2 7 P F 44 3 4 4 4 4
1 34 P M 15 2 2 3 3 2 2 38 P F 47 2 3 3 2 3
1 29 P M 19 2 3 3 0 0 2 8 P F 48 2 2 1 0 0
1 15 P M 20 4 4 4 4 4 2 11 P F 48 2 2 2 2 2
1 13 P M 23 3 3 1 1 1 2 4 P F 51 3 4 2 4 4
1 27 P M 23 4 4 2 4 4 2 17 P F 58 1 4 2 2 0
1 55 P M 24 3 4 4 4 3 2 39 P M 11 3 4 4 4 4
1 17 P M 25 1 1 2 2 2 2 40 P M 14 2 1 2 3 2
1 45 P M 26 2 4 2 4 3 2 24 P M 15 3 2 2 3 3
1 40 P M 26 1 2 1 2 2 2 41 P M 15 4 3 3 3 4
1 44 P M 27 1 2 2 1 2 2 33 P M 19 4 2 2 3 3
1 49 P M 27 3 3 4 3 3 2 13 P M 20 1 4 4 4 4
1 39 P M 28 2 1 1 1 1 2 34 P M 20 3 2 4 4 4
1 2 P M 28 2 0 0 0 0 2 45 P M 33 3 3 3 2 3
1 14 P M 30 1 0 0 0 0 2 22 P M 36 2 4 3 3 4
1 10 P M 37 3 2 3 3 2 2 18 P M 38 4 3 0 0 0
1 31 P M 37 1 0 0 0 0 2 35 P M 42 3 2 2 2 2
1 7 P M 43 2 3 2 4 4 2 44 P M 43 2 1 0 0 0
1 52 P M 43 1 1 1 3 2 2 6 P M 45 3 4 2 1 2
1 4 P M 44 3 4 3 4 2 2 46 P M 48 4 4 0 0 0
1 1 P M 46 2 2 2 2 2 2 31 P M 52 2 3 4 3 4
1 46 P M 49 2 2 2 2 2 2 42 P M 66 3 3 3 4 4
1 47 P M 63 2 2 2 2 2
;
data resp2; set resp;
dichot=(outcome=3 or outcome=4);
di_base = (baseline=3 or baseline=4);
run;
/* Kathleen Kiernan. https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2179-2018.pdf
"The data for this example is from a clinical trial (Stokes, Davis, and Koch 2012) that was conducted to compare two treatments for a respiratory illness. Patients in each of two centers were randomly assigned to two groups: one group received the active treatment and one group received a placebo.
During treatment, respiratory status was determined for each of four visits and is represented by the variable OUTCOME (coded here as 0=poor, 1=good). The variables CENTER, TREATMENT, SEX, and BASELINE (baseline respiratory status) are classification variables that have two levels. The variable AGE (age at time of entry into the study) is a continuous variable. The variable ID is the patient
identification number.
The following statements fit the model:"
*/
/* Marginal GEE type of model */
proc glimmix data=resp2 empirical method=rspl;
class id sex treatment visit;
model dichot (event='1')=sex treatment visit age baseline / dist=binary link=logit;
random _residual_ / subject=id type=cs;
output out=gmxout_id pred=xbeta_id pred(ilink)=predprob_id pred(ilink
noblup)=fix_predprob_id;
run;
/* with random residual, clustering by ID */
proc logistic data=gmxout_id plots(only)=roc;
model dichot = predprob_id;
ods select roccurve;
run;
/* C-statistic = 0.79 */
proc glimmix data=resp2 empirical method=rspl;
class center sex treatment visit;
model dichot (event='1')=sex treatment visit age baseline / dist=binary link=logit;
random _residual_ / subject=center type=cs;
output out=gmxout_center pred=xbeta_center pred(ilink)=predprob_center pred(ilink
noblup)=fix_predprob_center;
run;
/* with random residual, clustering by center */
proc logistic data=gmxout_center plots(only)=roc;
model dichot = predprob_center;
ods select roccurve;
run;
/* C-statistic = 0.79 */
/* nofit option still produces the same output as above */
proc logistic data=gmxout_center plots(only)=roc;
model dichot(event='1') = / nofit;
roc pred=predprob;
ods select roccurve;
run;
/*
Kathleen Kiernan. https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2179-2018.pdf
"The choice of the marginal (population-averaged) model or conditional (subject-specific) model often depends on the goal of your analysis: whether you are interested in population-averaged effects or subject-specific effects.
The GEE model is a marginal, or population-averaged model. If you are interested in making predictions about individuals, then you would use GLIMMIX to the fit the conditional model using G-side random effects and obtain the subject specific estimates. For example:"
*/
/* conditional model subject specific estimates */
proc glimmix data=resp2 ;
class id sex treatment visit;
model dichot (event='1')=sex treatment visit age baseline/s dist=binary link=logit;
random intercept /s subject=id;
output out=gmxout_subj_id pred=xbeta_subj_id pred(ilink)=predprob_subj_id pred(ilink
noblup)=fix_predprob_subj_id;
run;
/* with random intercept, clustering by ID */
proc logistic data=gmxout_subj_id plots(only)=roc;
model dichot = predprob_subj_id;
ods select roccurve;
run;
/* C-statistic = 0.88 */
proc glimmix data=resp2 ;
class center sex treatment visit;
model dichot (event='1')=sex treatment visit age baseline/s dist=binary link=logit;
random intercept /s subject=center;
output out=gmxout_subj_cntr pred=xbeta_subj_cntr pred(ilink)=predprob_subj_cntr pred(ilink
noblup)=fix_predprob_subj_cntr;
run;
/* with random intercept, clustering by center */
proc logistic data=gmxout_subj_cntr plots(only)=roc;
model dichot = predprob_subj_cntr;
ods select roccurve;
run;
/* C-statistic = 0.79 */
I would guess that the second GLIMMIX did not converge, or something along those lines. As a result, the dataset INFO1 is never replaced and then the two PROC LOGISTICs are using the same data. Check the output listing for a message regarding either number of iterations or issues with the G matrix not being positive definite. The latter could lead to identical BLUPs if all the G matrix entries went to zero.
The other thing that might be going on is that in both LOGISTIC calls you are modeling the same thing - the OUTCOME variable. I think you need to be modeling the phat (predicted variables on the original scale) values to see the effect of the clustering. On that note, I will defer to @StatDave .
SteveDenham
Appreciate the thoughts and insights. I just double checked; both GLIMMIXs converged. I also tested with the original p-hats (without the ILINK function), and the ROC curves did not change.
I believe that if the two sets of predicted probabilities are only shifted relative to each other, but keep the same ordering, then the ROC curve is unaffected.
I calculated the sensitivity/specificity for both models with a cut point of predicted probability = 50%, and they were different. For model 1, sensitivity = 33% specificity =80% and for model 2, sensitivity = 94% specificity =39%. Thus, the ROC curves should not be the same. Or am I missing something?
Is it the random residuals or random intercepts that are supposed to be used to calculate the predicted probabilities? Are you certain you have repeated measures?
- https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2179-2018.pdf
/* "The following SAS DATA step inputs the respiratory data and creates an observation for each response. The baseline and follow-up responses are actually measured on a five-point scale, from terrible to excellent, and this ordinal response is analyzed later in the chapter. For this analysis, the dichotomous outcome of whether the patient experienced good or excellent response
is analyzed with a logistic regression. The second DATA step creates the SAS data set RESP2 and computes response variable DICHOT and dichotomous baseline variable DI_BASE. Note that the baseline variable, which was recorded on a five-point scale, could be managed as either ordinal or
dichotomous." - pg 515. Stokes, Davis, and Koch. Categorical Data Analysis Using SAS, Third Edition.
https://tinyurl.com/3rj67wad
data source: https://support.sas.com/rnd/app/stat/cat/edition3/samples/chapter15.html
*/
data resp;
input center id treatment $ sex $ age baseline
visit1-visit4 @@;
visit=1; outcome=visit1; output;
visit=2; outcome=visit2; output;
visit=3; outcome=visit3; output;
visit=4; outcome=visit4; output;
datalines;
1 53 A F 32 1 2 2 4 2 2 30 A F 37 1 3 4 4 4
1 18 A F 47 2 2 3 4 4 2 52 A F 39 2 3 4 4 4
1 54 A M 11 4 4 4 4 2 2 23 A F 60 4 4 3 3 4
1 12 A M 14 2 3 3 3 2 2 54 A F 63 4 4 4 4 4
1 51 A M 15 0 2 3 3 3 2 12 A M 13 4 4 4 4 4
1 20 A M 20 3 3 2 3 1 2 10 A M 14 1 4 4 4 4
1 16 A M 22 1 2 2 2 3 2 27 A M 19 3 3 2 3 3
1 50 A M 22 2 1 3 4 4 2 16 A M 20 2 4 4 4 3
1 3 A M 23 3 3 4 4 3 2 47 A M 20 2 1 1 0 0
1 32 A M 23 2 3 4 4 4 2 29 A M 21 3 3 4 4 4
1 56 A M 25 2 3 3 2 3 2 20 A M 24 4 4 4 4 4
1 35 A M 26 1 2 2 3 2 2 2 A M 25 3 4 3 3 1
1 26 A M 26 2 2 2 2 2 2 15 A M 25 3 4 4 3 3
1 21 A M 26 2 4 1 4 2 2 25 A M 25 2 2 4 4 4
1 8 A M 28 1 2 2 1 2 2 9 A M 26 2 3 4 4 4
1 30 A M 28 0 0 1 2 1 2 49 A M 28 2 3 2 2 1
1 33 A M 30 3 3 4 4 2 2 55 A M 31 4 4 4 4 4
1 11 A M 30 3 4 4 4 3 2 43 A M 34 2 4 4 2 4
1 42 A M 31 1 2 3 1 1 2 26 A M 35 4 4 4 4 4
1 9 A M 31 3 3 4 4 4 2 14 A M 37 4 3 2 2 4
1 37 A M 31 0 2 3 2 1 2 36 A M 41 3 4 4 3 4
1 23 A M 32 3 4 4 3 3 2 51 A M 43 3 3 4 4 2
1 6 A M 34 1 1 2 1 1 2 37 A M 52 1 2 1 2 2
1 22 A M 46 4 3 4 3 4 2 19 A M 55 4 4 4 4 4
1 24 A M 48 2 3 2 0 2 2 32 A M 55 2 2 3 3 1
1 38 A M 50 2 2 2 2 2 2 3 A M 58 4 4 4 4 4
1 48 A M 57 3 3 4 3 4 2 53 A M 68 2 3 3 3 4
1 5 P F 13 4 4 4 4 4 2 28 P F 31 3 4 4 4 4
1 19 P F 31 2 1 0 2 2 2 5 P F 32 3 2 2 3 4
1 25 P F 35 1 0 0 0 0 2 21 P F 36 3 3 2 1 3
1 28 P F 36 2 3 3 2 2 2 50 P F 38 1 2 0 0 0
1 36 P F 45 2 2 2 2 1 2 1 P F 39 1 2 1 1 2
1 43 P M 13 3 4 4 4 4 2 48 P F 39 3 2 3 0 0
1 41 P M 14 2 2 1 2 3 2 7 P F 44 3 4 4 4 4
1 34 P M 15 2 2 3 3 2 2 38 P F 47 2 3 3 2 3
1 29 P M 19 2 3 3 0 0 2 8 P F 48 2 2 1 0 0
1 15 P M 20 4 4 4 4 4 2 11 P F 48 2 2 2 2 2
1 13 P M 23 3 3 1 1 1 2 4 P F 51 3 4 2 4 4
1 27 P M 23 4 4 2 4 4 2 17 P F 58 1 4 2 2 0
1 55 P M 24 3 4 4 4 3 2 39 P M 11 3 4 4 4 4
1 17 P M 25 1 1 2 2 2 2 40 P M 14 2 1 2 3 2
1 45 P M 26 2 4 2 4 3 2 24 P M 15 3 2 2 3 3
1 40 P M 26 1 2 1 2 2 2 41 P M 15 4 3 3 3 4
1 44 P M 27 1 2 2 1 2 2 33 P M 19 4 2 2 3 3
1 49 P M 27 3 3 4 3 3 2 13 P M 20 1 4 4 4 4
1 39 P M 28 2 1 1 1 1 2 34 P M 20 3 2 4 4 4
1 2 P M 28 2 0 0 0 0 2 45 P M 33 3 3 3 2 3
1 14 P M 30 1 0 0 0 0 2 22 P M 36 2 4 3 3 4
1 10 P M 37 3 2 3 3 2 2 18 P M 38 4 3 0 0 0
1 31 P M 37 1 0 0 0 0 2 35 P M 42 3 2 2 2 2
1 7 P M 43 2 3 2 4 4 2 44 P M 43 2 1 0 0 0
1 52 P M 43 1 1 1 3 2 2 6 P M 45 3 4 2 1 2
1 4 P M 44 3 4 3 4 2 2 46 P M 48 4 4 0 0 0
1 1 P M 46 2 2 2 2 2 2 31 P M 52 2 3 4 3 4
1 46 P M 49 2 2 2 2 2 2 42 P M 66 3 3 3 4 4
1 47 P M 63 2 2 2 2 2
;
data resp2; set resp;
dichot=(outcome=3 or outcome=4);
di_base = (baseline=3 or baseline=4);
run;
/* Kathleen Kiernan. https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2179-2018.pdf
"The data for this example is from a clinical trial (Stokes, Davis, and Koch 2012) that was conducted to compare two treatments for a respiratory illness. Patients in each of two centers were randomly assigned to two groups: one group received the active treatment and one group received a placebo.
During treatment, respiratory status was determined for each of four visits and is represented by the variable OUTCOME (coded here as 0=poor, 1=good). The variables CENTER, TREATMENT, SEX, and BASELINE (baseline respiratory status) are classification variables that have two levels. The variable AGE (age at time of entry into the study) is a continuous variable. The variable ID is the patient
identification number.
The following statements fit the model:"
*/
/* Marginal GEE type of model */
proc glimmix data=resp2 empirical method=rspl;
class id sex treatment visit;
model dichot (event='1')=sex treatment visit age baseline / dist=binary link=logit;
random _residual_ / subject=id type=cs;
output out=gmxout_id pred=xbeta_id pred(ilink)=predprob_id pred(ilink
noblup)=fix_predprob_id;
run;
/* with random residual, clustering by ID */
proc logistic data=gmxout_id plots(only)=roc;
model dichot = predprob_id;
ods select roccurve;
run;
/* C-statistic = 0.79 */
proc glimmix data=resp2 empirical method=rspl;
class center sex treatment visit;
model dichot (event='1')=sex treatment visit age baseline / dist=binary link=logit;
random _residual_ / subject=center type=cs;
output out=gmxout_center pred=xbeta_center pred(ilink)=predprob_center pred(ilink
noblup)=fix_predprob_center;
run;
/* with random residual, clustering by center */
proc logistic data=gmxout_center plots(only)=roc;
model dichot = predprob_center;
ods select roccurve;
run;
/* C-statistic = 0.79 */
/* nofit option still produces the same output as above */
proc logistic data=gmxout_center plots(only)=roc;
model dichot(event='1') = / nofit;
roc pred=predprob;
ods select roccurve;
run;
/*
Kathleen Kiernan. https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2179-2018.pdf
"The choice of the marginal (population-averaged) model or conditional (subject-specific) model often depends on the goal of your analysis: whether you are interested in population-averaged effects or subject-specific effects.
The GEE model is a marginal, or population-averaged model. If you are interested in making predictions about individuals, then you would use GLIMMIX to the fit the conditional model using G-side random effects and obtain the subject specific estimates. For example:"
*/
/* conditional model subject specific estimates */
proc glimmix data=resp2 ;
class id sex treatment visit;
model dichot (event='1')=sex treatment visit age baseline/s dist=binary link=logit;
random intercept /s subject=id;
output out=gmxout_subj_id pred=xbeta_subj_id pred(ilink)=predprob_subj_id pred(ilink
noblup)=fix_predprob_subj_id;
run;
/* with random intercept, clustering by ID */
proc logistic data=gmxout_subj_id plots(only)=roc;
model dichot = predprob_subj_id;
ods select roccurve;
run;
/* C-statistic = 0.88 */
proc glimmix data=resp2 ;
class center sex treatment visit;
model dichot (event='1')=sex treatment visit age baseline/s dist=binary link=logit;
random intercept /s subject=center;
output out=gmxout_subj_cntr pred=xbeta_subj_cntr pred(ilink)=predprob_subj_cntr pred(ilink
noblup)=fix_predprob_subj_cntr;
run;
/* with random intercept, clustering by center */
proc logistic data=gmxout_subj_cntr plots(only)=roc;
model dichot = predprob_subj_cntr;
ods select roccurve;
run;
/* C-statistic = 0.79 */
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.