I have a dataset (data1) with assessments taken at different visits: baseline (visit=1), Day 8 (visit=2), Day 30 (visit=3), Day 60 (visit=4) and Day 180 (visit=5) on an ordinal variable with 5 levels (1=normal, 2=mild, 3=moderate, 4=severe), and a separate dataset (data2) with repeated measures for a continuous variable (biomarker), taken at the same visits (1, 2, 3, 4, 5).
One of the exploratory objectives of my exercise problem (I’m trying to teach myself GLMM) asks to investigate the correlation between the biomarker (as predictor) and the ordinal variable (as outcome). Since this is only an exploratory objective, i.e., the study was not specifically designed for this, not all participants in the study will have assessments performed at all those visits.
Since this is a longitudinal study, with repeated measurements taken on the same subjects, I am thinking of exploring the correlation between the continuous predictor and categorical outcome from baseline to Day 180 by using repeated measures logistic regression, implemented via PROC GLIMMIX.
In preparation for fitting the model, this is what I have done so far:
I merged data1 and data2 by id and visit to create one unified dataset (dataset “final” illustrated below) that contains both the outcome and predictor variables at the different visits.
For simplicity purposes, I converted my 5-level categorical outcome into a binary one (if “normal or mild” then outcome=0; if “moderate or severe” then outcome=1).
I noticed that only 15 out of a total of 46 participants in the study had measurements on both the outcome and predictor variables at both baseline (visit 1) and Day 180 (visit 5). The rest of participants either didn’t have the baseline visit, or didn’t have visit 5, or both. So for my model I used a smaller dataset that only contains these 15 participants who had both visit 1 and visit 5.
Outcome variable: Disease status (outcome= 0 means no disease present; outcome=1 means disease)
Independent variables: Biomarker (biom) but also visit as fixed-effect factor since we have repeated measurements
Dataset:
data final;
input id $ visit outcome biom;
datalines;
001 1 0 59.7
001 2 0 78.4
001 4 0 75.2
001 5 1 80.7
002 2 0 64.6
002 5 0 389
003 1 0 618
003 2 0 469
003 3 0 478
004 1 1 404
004 2 0 47.3
004 3 1 64.5
004 4 0 88.8
004 5 0 86.7
005 1 0 88.3
006 1 0 234
007 1 0 245
007 2 0 243
007 3 0 237
007 5 0 226
008 1 0 22.2
008 2 0 25.5
008 5 1 35.5
009 3 0 35.3
009 5 0 30.3
010 1 0 134
010 5 0 167
011 4 0 146
011 5 0 135
012 1 0 140
012 4 0 74.6
012 5 0 72.9
013 1 0 79.1
013 3 0 75.6
013 4 0 68.9
014 2 1 291
014 3 0 21.3
015 1 0 17
015 5 1 15.6
016 1 0 16.1
016 2 0 13.9
016 5 0 99.9
017 1 0 105
017 3 0 96.2
017 4 0 102
017 5 0 89.2
018 3 1 25.9
019 2 0 27.3
019 3 0 26.2
020 2 0 26.2
020 5 0 28.9
021 1 1 74.2
021 3 0 75.1
021 4 0 60.4
021 5 0 62.2
022 1 0 61.4
023 1 0 12.7
024 2 0 12.1
024 3 0 13.4
025 1 0 11.6
025 5 0 11.5
026 1 0 45.9
026 5 0 47.2
027 1 0 39
027 2 0 38.7
027 3 1 42.7
027 4 0 18.4
027 5 0 15.3
028 1 0 15.9
028 2 0 16.1
028 4 0 15.8
029 1 0 57.8
029 2 1 86.7
029 3 1 88.3
029 5 0 234
030 1 0 245
030 3 0 243
030 5 0 237
031 1 0 226
031 2 0 22.2
031 4 0 18.4
032 4 0 15.3
032 5 0 15.9
033 1 0 16.1
033 2 0 78.4
034 2 0 75.2
034 3 0 80.7
035 1 0 64.6
035 2 0 389
035 3 0 618
035 4 0 469
036 1 0 478
037 1 0 152
038 2 0 148
038 3 0 29.12
039 2 0 421
040 1 0 520
040 2 0 478
040 3 0 18.4
041 2 0 15.3
041 4 0 15.9
042 1 0 16.1
043 1 0 78.4
044 1 0 325
044 2 0 478
044 3 0 452
045 2 0 25.8
045 4 0 15.9
045 5 0 16.1
046 1 0 78.4
046 4 0 16.8
;
run;
data final2;
input id $ visit outcome biom;
datalines;
001 1 0 59.7
001 2 0 78.4
001 4 0 75.2
001 5 1 80.7
004 1 1 404
004 2 0 47.3
004 3 1 64.5
004 4 0 88.8
004 5 0 86.7
007 1 0 245
007 2 0 243
007 3 0 237
007 5 0 226
008 1 0 22.2
008 2 0 25.5
008 5 1 35.5
010 1 0 134
010 5 0 167
012 1 0 140
012 4 0 74.6
012 5 0 72.9
015 1 0 17
015 5 1 15.6
016 1 0 16.1
016 2 0 13.9
016 5 0 99.9
017 1 0 105
017 3 0 96.2
017 4 0 102
017 5 0 89.2
021 1 1 74.2
021 3 0 75.1
021 4 0 60.4
021 5 0 62.2
025 1 0 11.6
025 5 0 11.5
026 1 0 45.9
026 5 0 47.2
027 1 0 39
027 2 0 38.7
027 3 1 42.7
027 4 0 18.4
027 5 0 15.3
029 1 0 57.8
029 2 1 86.7
029 3 1 88.3
029 5 0 234
030 1 0 245
030 3 0 243
030 5 0 237
;
Run;
Model:
proc glimmix data=final2 method=quad(qpoints=50) noclprint;
class id visit;
model outcome = visit biom biom*visit / noint dist=binomial link=logit solution;
random intercept / subject=id;
output out=fitglmm pred(ilink noblup)=pred;
run;
My questions are:
Does my action plan for addressing this exploratory question seem correct?
Am I correct in keeping only those participants that have both a baseline visit, as well as visit 5 assessment for the final analysis dataset, and excluding the rest? Or is this problematic? If yes, what are some correct alternatives?
Are there any pre-modeling visualization techniques that I can/should use to further explore my data? Is it ok to use boxplots to look at look at the distribution of my continuous variable at each level of the binary outcome? Should I maybe use point-biserial correlation first to see if there’s any evidence of a relationship at all between my predictor and dependent variable before fitting the model? If yes, is there such a thing as point-biserial correlation for repeated measures data, or should I just use the baseline values of the variables?
Is my model setup correct/complete?
How can I check to see if my model fits the data well? I know that for regular linear models, there’s residual plots, QQ plots, check for outliers and influential points etc. But not sure what kind of model diagnostics are best for GLMMs?
Any other suggestions/recommendations?
Thank you so much for any help and guidance.
... View more