Solved: Re: correlation between binary and continuous variable with PROC GLIMM...

Merdock · Posted 04-17-2022 11:34 PM

I have a dataset (data1) with assessments taken at different visits: baseline (visit=1), Day 8 (visit=2), Day 30 (visit=3), Day 60 (visit=4) and Day 180 (visit=5) on an ordinal variable with 5 levels (1=normal, 2=mild, 3=moderate, 4=severe), and a separate dataset (data2) with repeated measures for a continuous variable (biomarker), taken at the same visits (1, 2, 3, 4, 5).

One of the exploratory objectives of my exercise problem (I’m trying to teach myself GLMM) asks to investigate the correlation between the biomarker (as predictor) and the ordinal variable (as outcome). Since this is only an exploratory objective, i.e., the study was not specifically designed for this, not all participants in the study will have assessments performed at all those visits.

Since this is a longitudinal study, with repeated measurements taken on the same subjects, I am thinking of exploring the correlation between the continuous predictor and categorical outcome from baseline to Day 180 by using repeated measures logistic regression, implemented via PROC GLIMMIX.

In preparation for fitting the model, this is what I have done so far:

I merged data1 and data2 by id and visit to create one unified dataset (dataset “final” illustrated below) that contains both the outcome and predictor variables at the different visits.
For simplicity purposes, I converted my 5-level categorical outcome into a binary one (if “normal or mild” then outcome=0; if “moderate or severe” then outcome=1).
I noticed that only 15 out of a total of 46 participants in the study had measurements on both the outcome and predictor variables at both baseline (visit 1) and Day 180 (visit 5). The rest of participants either didn’t have the baseline visit, or didn’t have visit 5, or both. So for my model I used a smaller dataset that only contains these 15 participants who had both visit 1 and visit 5.

Outcome variable: Disease status (outcome= 0 means no disease present; outcome=1 means disease)

Independent variables: Biomarker (biom) but also visit as fixed-effect factor since we have repeated measurements

Dataset:

data final;
input id $ visit outcome biom;
datalines;
001 1 0	59.7
001 2 0	78.4
001 4 0	75.2
001 5 1	80.7
002 2 0	64.6
002 5 0	389
003 1 0	618
003 2 0	469
003 3 0	478
004 1 1	404
004 2 0	47.3
004 3 1	64.5
004 4 0	88.8
004 5 0	86.7
005 1 0	88.3
006 1 0	234
007 1 0	245
007 2 0	243
007 3 0	237
007 5 0	226
008 1 0	22.2
008 2 0	25.5
008 5 1	35.5
009 3 0	35.3
009 5 0	30.3
010 1 0	134
010 5 0	167
011 4 0	146
011 5 0	135
012 1 0	140
012 4 0	74.6
012 5 0	72.9
013 1 0	79.1
013 3 0	75.6
013 4 0	68.9
014 2 1	291
014 3 0	21.3
015 1 0	17
015 5 1	15.6
016 1 0	16.1
016 2 0	13.9
016 5 0	99.9
017 1 0	105
017 3 0	96.2
017 4 0	102
017 5 0	89.2
018 3 1	25.9
019 2 0	27.3
019 3 0	26.2
020 2 0	26.2
020 5 0	28.9
021 1 1	74.2
021 3 0	75.1
021 4 0	60.4
021 5 0	62.2
022 1 0	61.4
023 1 0	12.7
024 2 0	12.1
024 3 0	13.4
025 1 0	11.6
025 5 0	11.5
026 1 0	45.9
026 5 0	47.2
027 1 0	39
027 2 0	38.7
027 3 1	42.7
027 4 0	18.4
027 5 0	15.3
028 1 0	15.9
028 2 0	16.1
028 4 0	15.8
029 1 0	57.8
029 2 1	86.7
029 3 1	88.3
029 5 0	234
030 1 0	245
030 3 0	243
030 5 0	237
031 1 0	226
031 2 0	22.2
031 4 0	18.4
032 4 0	15.3
032 5 0	15.9
033 1 0	16.1
033 2 0	78.4
034 2 0	75.2
034 3 0	80.7
035 1 0	64.6
035 2 0	389
035 3 0	618
035 4 0	469
036 1 0	478
037 1 0	152
038 2 0	148
038 3 0	29.12
039 2 0	421
040 1 0	520
040 2 0	478
040 3 0	18.4
041 2 0	15.3
041 4 0	15.9
042 1 0	16.1
043 1 0	78.4
044 1 0	325
044 2 0	478
044 3 0	452
045 2 0	25.8
045 4 0	15.9
045 5 0	16.1
046 1 0	78.4
046 4 0	16.8
; 
run;
data final2;
input id $ visit outcome biom;
datalines;
001 1 0	59.7
001 2 0	78.4
001 4 0	75.2
001 5 1	80.7
004 1 1	404
004 2 0	47.3
004 3 1	64.5
004 4 0	88.8
004 5 0	86.7
007 1 0	245
007 2 0	243
007 3 0	237
007 5 0	226
008 1 0	22.2
008 2 0	25.5
008 5 1	35.5
010 1 0	134
010 5 0	167
012 1 0	140
012 4 0	74.6
012 5 0	72.9
015 1 0	17
015 5 1	15.6
016 1 0	16.1
016 2 0	13.9
016 5 0	99.9
017 1 0	105
017 3 0	96.2
017 4 0	102
017 5 0	89.2
021 1 1	74.2
021 3 0	75.1
021 4 0	60.4
021 5 0	62.2
025 1 0	11.6
025 5 0	11.5
026 1 0	45.9
026 5 0	47.2
027 1 0	39
027 2 0	38.7
027 3 1	42.7
027 4 0	18.4
027 5 0	15.3
029 1 0	57.8
029 2 1	86.7
029 3 1	88.3
029 5 0	234
030 1 0	245
030 3 0	243
030 5 0	237
; 
Run;

Model:
proc glimmix data=final2 method=quad(qpoints=50) noclprint;
class id visit;
model outcome = visit biom biom*visit / noint dist=binomial link=logit solution;
random intercept / subject=id;
output out=fitglmm pred(ilink noblup)=pred;
run;

My questions are:

Does my action plan for addressing this exploratory question seem correct?
Am I correct in keeping only those participants that have both a baseline visit, as well as visit 5 assessment for the final analysis dataset, and excluding the rest? Or is this problematic? If yes, what are some correct alternatives?
Are there any pre-modeling visualization techniques that I can/should use to further explore my data? Is it ok to use boxplots to look at look at the distribution of my continuous variable at each level of the binary outcome? Should I maybe use point-biserial correlation first to see if there’s any evidence of a relationship at all between my predictor and dependent variable before fitting the model? If yes, is there such a thing as point-biserial correlation for repeated measures data, or should I just use the baseline values of the variables?
Is my model setup correct/complete?
How can I check to see if my model fits the data well? I know that for regular linear models, there’s residual plots, QQ plots, check for outliers and influential points etc. But not sure what kind of model diagnostics are best for GLMMs?
Any other suggestions/recommendations?

Thank you so much for any help and guidance.

SteveDenham · Posted 04-18-2022 10:46 AM

Let's go through these by number:

My questions are:

Does my action plan for addressing this exploratory question seem correct? Not really sure why you collapsed a multinomial response down to a binomial, but I think your approach is defensible. It seems like a good start, in any case.
Am I correct in keeping only those participants that have both a baseline visit, as well as visit 5 assessment for the final analysis dataset, and excluding the rest? Or is this problematic? If yes, what are some correct alternatives? This is certainly a case where, if the data are MNAR (missing, not at random), you might want to restrict the analysis to the complete records. However, if your data are MAR or MCAR, the maximum likelihood analysis can handle the missing values.
Are there any pre-modeling visualization techniques that I can/should use to further explore my data? Is it ok to use boxplots to look at look at the distribution of my continuous variable at each level of the binary outcome? Should I maybe use point-biserial correlation first to see if there’s any evidence of a relationship at all between my predictor and dependent variable before fitting the model? If yes, is there such a thing as point-biserial correlation for repeated measures data, or should I just use the baseline values of the variables? What do you expect to learn from the boxplots? The point-biserial issue can be addressed by a cluster approach--plot time vs independent variable with the binary outcome as two different colors - see the second example in PROC FASTCLUS as an approach..
Is my model setup correct/complete? Your model assumes that the values at the various visits are not correlated. You may wish to impose some sort of covariance structure on visit.
How can I check to see if my model fits the data well? I know that for regular linear models, there’s residual plots, QQ plots, check for outliers and influential points etc. But not sure what kind of model diagnostics are best for GLMMs? This is undoubtedly one of the hardest questions in GLMMs. You have two outcomes, so you may want to look at cutpoints in your model for classification of false positives and false negatives (ROC curve). GLIMMIX doesn't do this automatically like LOGISTIC, but I am sure there are examples on the web of how to do this with DATA steps and PROC SGPLOT.
Any other suggestions/recommendations? Not at this stage, but eventually you will probably want to work with the multinomial response. At that point, things get much fuzzier (in my opinion).

SteveDenham

View solution in original post

SteveDenham · Posted 04-18-2022 10:46 AM

Let's go through these by number:

My questions are:

Does my action plan for addressing this exploratory question seem correct? Not really sure why you collapsed a multinomial response down to a binomial, but I think your approach is defensible. It seems like a good start, in any case.
Am I correct in keeping only those participants that have both a baseline visit, as well as visit 5 assessment for the final analysis dataset, and excluding the rest? Or is this problematic? If yes, what are some correct alternatives? This is certainly a case where, if the data are MNAR (missing, not at random), you might want to restrict the analysis to the complete records. However, if your data are MAR or MCAR, the maximum likelihood analysis can handle the missing values.
Are there any pre-modeling visualization techniques that I can/should use to further explore my data? Is it ok to use boxplots to look at look at the distribution of my continuous variable at each level of the binary outcome? Should I maybe use point-biserial correlation first to see if there’s any evidence of a relationship at all between my predictor and dependent variable before fitting the model? If yes, is there such a thing as point-biserial correlation for repeated measures data, or should I just use the baseline values of the variables? What do you expect to learn from the boxplots? The point-biserial issue can be addressed by a cluster approach--plot time vs independent variable with the binary outcome as two different colors - see the second example in PROC FASTCLUS as an approach..
Is my model setup correct/complete? Your model assumes that the values at the various visits are not correlated. You may wish to impose some sort of covariance structure on visit.
How can I check to see if my model fits the data well? I know that for regular linear models, there’s residual plots, QQ plots, check for outliers and influential points etc. But not sure what kind of model diagnostics are best for GLMMs? This is undoubtedly one of the hardest questions in GLMMs. You have two outcomes, so you may want to look at cutpoints in your model for classification of false positives and false negatives (ROC curve). GLIMMIX doesn't do this automatically like LOGISTIC, but I am sure there are examples on the web of how to do this with DATA steps and PROC SGPLOT.
Any other suggestions/recommendations? Not at this stage, but eventually you will probably want to work with the multinomial response. At that point, things get much fuzzier (in my opinion).

SteveDenham

Merdock · Posted 04-18-2022 11:38 PM

Hi Steve,

Thank you so much for your answers, that's really helpful!

To address your question about collapsing the mutinomial, I was also planning on doing a ROC analysis to estimate the predictive ability of the biomarker for the detection of disease, where participants with “normal” and “mildly decreased” values of the ordinal outcome variable would be categorized as belonging to the “non-disease” group, while participants with “moderately decreased” and “severely decreased” values would be considered part of the “disease" group. So I would like to assess the diagnostic performance of the biomarker for several cut-offs by computing accuracy, sensitivity, specificity, positive predictive value, negative predictive value and likelihood ratios. This ROC analysis is the reason why I dichotomized the multinomial response variable. I figured that I can use proc glimmix to explore the relationship between outcome and biomarker from baseline to visit 5, and then take the estimated probabilities of positivity for each observation based on the model use them in PROC LOGISTIC to build the ROC curve, though not sure how to programmatically obtain the positive predictive value, negative predictive value and likelihood ratios? Hopefully this makes sense.

Regarding your comment to my question 4 below, about imposing a covariance structure on visit, can I do that by adding another random visit statement, like shown below? I read that proc glimmix does not allow for a repeated statement like proc mixed does but instead we can equivalently use a random statement with the residual option:

proc glimmix data=final2 method=quad(qpoints=50) noclprint;

class id visit;

model outcome= visit biom biom*visit / noint dist=binomial link=logit solution;

random intercept/ subject=patid;

random visit/ subject=patid type=UN residual;

output out=fitglmm pred(ilink noblup)=pred;

run;

SteveDenham · Posted 04-19-2022 07:22 AM

That code will not run, as method=quad does not support a RANDOM residual structure. Two choices: remove the residual option from the new random statement, or leave it in and drop back to the default pseudo-likelihood method. The first will coincide with more of the published literature, while the second may be more appropriate in controlling Type I error (see Stroup & Claassen https://econpapers.repec.org/RePEc:spr:jagbes:v:25:y:2020:i:4:d:10.1007_s13253-020-00402-6 ) (paywall. I have a copy that I seem to have been able to access through my ASA membership)

SteveDenham

Merdock · Posted 04-20-2022 02:15 PM

Hi Steve,

Thank you for the suggestions. I tried the following fitting the model on the dataset above and unfortunately there seems to be an issue, as the model does not converge. Could this be due to small sample size (only 15 participants with visit 1 and 5 so total of 30 observations, and only 5 observations with outcome=1 at either visit 1 or 5), or something else?

Additionally, I'm also getting a message that the model does not have an intercept, is that because I included the noint option? That message does go away if I remove the noint. That said, does it even make sense or is it necessary to have a random intercept statement here?

I know that this would probably not cause the issues mentioned but wouldn't it also be better to use dist=binary instead of binomial since my dependent variable is 0/1 as opposed to events/trial?

Sorry for all these questions, I'm completely new to longitudinal/repeated measures analysis and just trying to get a better understanding of things.

proc glimmix data=final2 method=RSPL noclprint;
class id visit;
model outcome = visit biom biom*visit / noint dist=binomial link=logit solution;
random intercept / subject=id;
random visit/ subject=id type=UN residual;
output out=fitglmm pred(ilink noblup)=pred;
run;

jiltao · Posted 04-20-2022 05:20 PM

You might have an overspecified model.

Might want to try taking out the RANDOM intercept statement, or using a different covariance structure for the R-side, such as type=AR(1).

Merdock · Posted 04-20-2022 05:52 PM

Hi Jiltao, thanks for your suggestions. I took out the interaction term (visit*biom), in addition to also trying both the AR(1) covariance structure, as well as removing the random intercept statement but it still doesn’t converge.

Here is what I'm getting in each case:

When I delete the random intercept and use type=AR(1):

Versus when I delete random intercept but still use type=UN:

Could this be due maybe to lack of variability in the response variable which would result in a lack of predictive capability, since I only have 9 observations with outcome=1 vs 41 obs with outcome=0?

SteveDenham · Posted 04-21-2022 08:38 AM

The main issue I see in those outputs is that iterations stop at step 19 in the first and that the pseudo-likelihood update fails in outer iteration 6 in the second. Additionally in the first, the objective function is not proceeding smoothly, with a jump-reset at iteration 14.

Things to try (no guarantee): Add an NLOPTIONS statement, changing the technique to ridged Newton-Raphson [tech=NRRIDG] and allowing the number of iterations to increase [MAXITER=1000].

Change the method to Laplace. This would require removing the residual option from the RANDOM statement.

Adjust the convergence criterion (see the paper here https://www.lexjansen.com/mwsug/2012/SA/MWSUG-2012-SA15.pdf ) for some examples.

SteveDenham

Merdock · Posted 04-21-2022 10:28 PM

Hi Steve, I tried adding the options you suggested and it looks like that solved the convergence issue though now I’m getting another message saying the G matrix is not positive definite. From what I've been reading one of the possible causes for this might be that there is not enough variation in the response (which would be the case for my data). I also tried removing the random intercept statement and that took care of message about G matrix not being PD, but instead I started getting a message in the output table saying something about the convergence status being indeterminate. Any further ideas or suggestions? I'm starting to think that maybe a GLMM is not that well suited given the scarcity of my data and I should just use good old summary statistics instead of attempting to do any modeling.

SteveDenham · Posted 04-22-2022 08:08 AM

That looks much like what I found. You might try fitting just an R-side GLM in GLIMMIX or a GEE using GENMOD or GEE - no RANDOM statement. I think part of the problem with the indeterminate convergence comes from overspecifying the R matrix. Rather than UN, try CS. If you switch to a GEE, those two PROCs call that an exchangeable structure.

SteveDenham

Merdock · Posted 04-23-2022 12:44 PM

Hi Steve,

You're right - I tried various options and it converges if I comment out the random intercept statement, change the covariance structure to CS for the R matrix and add that NLOPTIONS statement at the end. I might also try playing around with a GEE to see the difference but ultimately I think this is definitely progress. How meaningful the results might be or how well this model fits my data is a different story but at least I got it to converge :). Thank you sooo much for all your step-by-step help, I truly appreciate it!

proc glimmix data=final2 noclprint;

class id visit (ref="1");

model outcome(event="1")= biom visit/ / dist=binary link=logit solution;

random visit/ subject=id residual type=cs;

output out=fitglmm pred(ilink noblup)=pred;

NLOPTIONS tech=NRRIDG Maxiter=1000;

run;

correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

Re: correlation between binary and continuous variable with PROC GLIMMIX

SAS Innovate 2025: Call for Content