I have a dataset that contains multiple test results (percentages) per participant, at various time points post kidney transplant. The dataset also contains the rejection group the participant belongs to (rej_group=0 if they didn't have allograft rejection, or 1 if they did have it). The idea is that this test (cfDNA), which is a blood test, has the potential to be a more non-invasive biomarker of allograft rejection (can discriminate rejection from non-rejection groups), as opposed to biopsy. Research has shown that usually participants wh express levels of cfDNA>1% have a higher likelihood of allograft rejection than those with levels under 1%. What I'm trying to do is, compute sensitivity, specificity, NPV, and PPV for the 1% cfDNA threshold. What I've done so far is displayed below (using a made up dummy dataset that has the same structure as my original data); basically I've summarized participant level data by taking the median of the cfDNA results to account for the repeated measures, and then categorized based on median_result>1%, and finally computed the Se, Sp, NPV and PPV but I'm really unsure whether this is the correct way to do it or not. It seems rather crude though not sure what else I could do with the data I have:
data test;
input id $ transdt:mmddyy. rej_group date:mmddyy. result;
format transdt mmddyy10. date mmddyy10.;
datalines;
1 8/26/2009 0 10/4/2019 0.15
1 8/26/2009 0 12/9/2019 0.49
1 8/26/2009 0 3/16/2020 0.41
1 8/26/2009 0 7/10/2020 0.18
1 8/26/2009 0 10/26/2020 1.2
1 8/26/2009 0 4/12/2021 0.2
1 8/26/2009 0 10/11/2021 0.17
1 8/26/2009 0 1/31/2022 0.76
1 8/26/2009 0 8/29/2022 0.12
1 8/26/2009 0 11/28/2022 1.33
1 8/26/2009 0 2/27/2023 1.19
1 8/26/2009 0 5/15/2023 0.16
1 8/26/2009 0 9/25/2023 0.65
2 2/15/2022 0 9/22/2022 1.32
2 2/15/2022 0 3/23/2023 1.38
3 3/25/2021 1 10/6/2021 3.5
3 3/25/2021 1 3/22/2022 0.18
3 3/25/2021 1 10/13/2022 1.90
3 3/25/2021 1 3/30/2023 0.23
4 7/5/2018 0 8/29/2019 0.15
4 7/5/2018 0 3/2/2020 0.12
4 7/5/2018 0 6/19/2020 6.14
4 7/5/2018 0 9/22/2020 0.12
4 7/5/2018 0 10/12/2020 0.12
4 7/5/2018 0 4/12/2021 0.29
5 8/19/2018 1 6/17/2019 0.15
6 1/10/2019 1 4/29/2019 1.58
6 1/10/2019 1 9/9/2019 1.15
6 1/10/2019 1 5/2/2020 0.85
6 1/10/2019 1 8/3/2020 0.21
6 1/10/2019 1 8/16/2021 0.15
6 1/10/2019 1 3/2/2022 0.3
7 7/16/2018 0 8/24/2021 0.28
7 7/16/2018 0 11/2/2021 0.29
7 7/16/2018 0 5/24/2022 2.27
7 7/16/2018 0 10/6/2022 0.45
8 4/3/2019 1 9/24/2020 1.06
8 4/3/2019 1 10/20/2020 0.51
8 4/3/2019 1 1/21/2021 0.39
8 4/3/2019 1 3/25/2021 2.44
8 4/3/2019 1 7/2/2021 0.59
8 4/3/2019 1 9/28/2021 5.54
8 4/3/2019 1 1/5/2022 0.62
8 4/3/2019 1 1/9/2023 1.43
8 4/3/2019 1 4/25/2023 1.41
8 4/3/2019 1 8/3/2023 1.13
9 3/12/2020 1 8/27/2020 0.49
9 3/12/2020 1 10/27/2020 0.29
9 3/12/2020 1 4/16/2021 0.12
9 3/12/2020 1 5/10/2021 0.31
9 3/12/2020 1 9/20/2021 0.31
9 3/12/2020 1 2/26/2022 0.24
9 3/12/2020 1 6/13/2022 0.92
9 3/12/2020 1 12/5/2022 2.34
9 3/12/2020 1 7/3/2023 2.21
10 10/10/2019 0 12/12/2019 0.29
10 10/10/2019 0 1/24/2020 0.32
10 10/10/2019 0 3/3/2020 0.28
10 10/10/2019 0 7/2/2020 0.24
;
run;
proc print data=test; run;
/* Create binary indicator for cfDNA > 1% */
data binary_grouping;
set test;
cfDNA_above=(result>1); /* 1 if cfDNA > 1%, 0 otherwise */
run;
proc freq data=binary_grouping; tables cfDNA_above*rej_group; run;
proc sql;
create table participant_level as
select id, rej_group, median(result) as median_result
from binary_grouping
group by id, rej_group;
quit;
proc print data=participant_level; run;
data cfDNA_classified;
set participant_level;
cfDNA_class = (median_result >1); /* Positive test if median cfDNA > 1% */
run;
proc freq data=cfDNA_classified;
tables cfDNA_class*rej_group/ nocol nopercent sparse out=confusion_matrix;
run;
data metrics;
set confusion_matrix;
if cfDNA_class=1 and rej_group=1 then TP = COUNT; /* True Positives */
if cfDNA_class=0 and rej_group=1 then FN = COUNT; /* False Negatives */
if cfDNA_class=0 and rej_group=0 then TN = COUNT; /* True Negatives */
if cfDNA_class=1 and rej_group=0 then FP = COUNT; /* False Positives */
run;
proc print data=metrics; run;
proc sql;
select
sum(TP)/(sum(TP)+sum(FN)) as Sensitivity,
sum(TN)/(sum(TN)+sum(FP)) as Specificity,
sum(TP)/(sum(TP)+sum(FP)) as PPV,
sum(TN)/(sum(TN)+sum(FN)) as NPV
from metrics;
quit;
Any advice/thoughts would be greatly appreciated!
Thanks!
Given your repeated measurements and really a continuous predictor, a different approach is needed than the simple calculation of those statistics for a 2x2 table such as discussed in this note. For these data, you could fit a logistic Generalized Estimating Equations (GEE) model to account for the correlation among measurements within subjects. To then get the desired statistics, you could then do an ROC analysis on the fitted model, but note that with those statistics are defined for each possible cutoff on the continuous predictor. The change in the sensitivity and specificity over the range of the predictor define the ROC curve. It would be wasteful of valuable information to categorize or dichotomize the predictor. By using its actual values, you can plot how the event (rejection) probability changes over its range along with confidence bands which presumably would be very useful.
The following statements do this. The EFFECTPLOT statement produces the plot I refer to above. The ESTIMATE statement provides an estimate of the predicted event probability at your value of interest, 1%. Note that at 1%, the estimated rejection probability is 0.5171 but with a pretty wide confidence interval. In fact, across the range of the predictor, the confidence interval remains wide and covers well beyond 0.5 on both sides. The OUTPUT statement saves the predicted probabilities and then uses those in the ROC statement in PROC LOGISTIC to do the ROC analysis as discussed in this note. The ROC curve is shown with the values of the predictor labeling the points. You can see visually approximate where 1% is and what its sensitivity and specificity would be. To nail that down a bit better, you can print the OUTROC= data set containing the points on the ROC curve. The point closest to the probability from the ESTIMATE statement (_PROB_=0.517) has probability 0.521 and you can see its sensitivity (0.4) and specificity (1-.24=.76). You can easily compute the PPV (12/(12+7)=.63) and NPV (22/(22+18)=0.55) from the columns of the 2x2 table produced by classifying the observations as predicted positives and negatives using the 0.521 cutoff on the event probability.
proc gee data=test descending;
class id rej_group;
model rej_group(event='1')=result / dist=b;
repeated subject=id;
effectplot / ilink;
estimate '@1%' intercept 1 result 1 / ilink cl;
output out=gout p=p;
run;
proc logistic data=gout rocoptions(id=id);
id result;
model rej_group(event='1')= / nofit outroc=or;
roc 'GEE model' pred=p;
run;
A few thoughts / questions about this:
1) do you have a date at which follow-up ended (at either rejection or some other form of censoring)?
2) ...if so, have you looked at the relationship between time to rejection and cfDNA result, e.g., does a higher cfDNA result predict earlier failure?
3) ...related to the above, is there a relationship between the cfDNA result and time since transplant, or is it kind of all over the place?
4) Have you tried binning cfDNA result into, say, quintiles or similar to see what the probability of rejection is at different levels and if that relationship is linear? (see very crude proc logistic below)
5) Do you have other potential explanatory / confounding variables that might make your model more precise?
6) Have you thought about a machine learning type approach (assuming you have a fairly good sample size) where you separate your data into a training set and a validation set (building a model on the training set and testing its predictive ability on the validation data)?
proc format;
value frescat
low-<1='res_0001'
1-<3='res_0103'
3-<7='res_0307'
7-high='res_07up'
;
run;
proc logistic data=participant_level descending;
format median_result frescat.;
class median_result (ref='res_0001');
model rej_group = median_result;
run;
Given your repeated measurements and really a continuous predictor, a different approach is needed than the simple calculation of those statistics for a 2x2 table such as discussed in this note. For these data, you could fit a logistic Generalized Estimating Equations (GEE) model to account for the correlation among measurements within subjects. To then get the desired statistics, you could then do an ROC analysis on the fitted model, but note that with those statistics are defined for each possible cutoff on the continuous predictor. The change in the sensitivity and specificity over the range of the predictor define the ROC curve. It would be wasteful of valuable information to categorize or dichotomize the predictor. By using its actual values, you can plot how the event (rejection) probability changes over its range along with confidence bands which presumably would be very useful.
The following statements do this. The EFFECTPLOT statement produces the plot I refer to above. The ESTIMATE statement provides an estimate of the predicted event probability at your value of interest, 1%. Note that at 1%, the estimated rejection probability is 0.5171 but with a pretty wide confidence interval. In fact, across the range of the predictor, the confidence interval remains wide and covers well beyond 0.5 on both sides. The OUTPUT statement saves the predicted probabilities and then uses those in the ROC statement in PROC LOGISTIC to do the ROC analysis as discussed in this note. The ROC curve is shown with the values of the predictor labeling the points. You can see visually approximate where 1% is and what its sensitivity and specificity would be. To nail that down a bit better, you can print the OUTROC= data set containing the points on the ROC curve. The point closest to the probability from the ESTIMATE statement (_PROB_=0.517) has probability 0.521 and you can see its sensitivity (0.4) and specificity (1-.24=.76). You can easily compute the PPV (12/(12+7)=.63) and NPV (22/(22+18)=0.55) from the columns of the 2x2 table produced by classifying the observations as predicted positives and negatives using the 0.521 cutoff on the event probability.
proc gee data=test descending;
class id rej_group;
model rej_group(event='1')=result / dist=b;
repeated subject=id;
effectplot / ilink;
estimate '@1%' intercept 1 result 1 / ilink cl;
output out=gout p=p;
run;
proc logistic data=gout rocoptions(id=id);
id result;
model rej_group(event='1')= / nofit outroc=or;
roc 'GEE model' pred=p;
run;
@StatDave,thanks so much, that's super helpful! One quick question: if I am also interested in getting an estimate of the predicted rejection probability at 0.5%, would I just change the estimate statement in the GEE model to say estimate '@0.5%' intercept 0.5 result 0.5?
Hi @Merdock,
Glad to see that StatDave's expert advice was useful to you. Then it would be fair and help later readers if you marked his most helpful reply as the accepted solution, not your own "thank you" post. Could you please change that? It's very easy: Select his post as the solution after clicking "Not the Solution" in the option menu (see icon below) of the current solution.
Hi @FreelanceReinh, thanks for pointing that out! That was an accident on my part (accepting my own response as the solution lol) but I fixed it now.
Cheers!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.