Hello, everyone. I am troubled by a problem regarding logistic regression.
The task I am to finish is to perform prediction model internal validation using Bootstrap resampling method.
According to Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression,..., the "flow chart" for a basic version of this process is:
(1) Fit logistic regression model for each Bootstrap sample (i.e. use each Bootstrap sample as the training set);
(2) Use each and every model to fit the original dataset (i.e. use the original dataset as the validation set);
(3) Average the statistics (e.g. area under ROC) of each model that was generated by the Bootstrap sample and fit to the original dataset.
Easy as the process might seem, I am stuck on the problem of outputting the statistics without printing it. Here is my code:
proc surveyselect data=a out=b method=balbootstrap reps=2000;/*a is the original dataset*/
run;
ods select none;
proc logistic data=b outmodel=m;
by replicate;
class a b c/param=ref ref=first;
model y(event='1')=a b c/parmlabel lackfit aggregate scale=pearson
selection=backward sls=0.09 ctable stb rsq;
store n;
run;
proc plm source=n;
score data=a out=s;
run;
proc logistic inmodel=m;
by replicate;
score data=a out=result fitstat outroc=roc;
run;
As the code shows, I used both PROC PLM and PROC LOGISTIC+INMODEL statement to do this job, yet none of them were satisfactory. Both PROC PLM and PROC LOGISTIC+INMODEL statement produced statistics on the individual level. That is, both of them calculated prior and posterior probabilities for each and every observation in the original dataset. That is not I want.
In PROC LOGISTIC, SAS can produce fit statistics by adding FITSTAT in the SCORE statement. As the code shows, I have done this. But it is frustrating that despite the fit statistics (e.g. R square, Brier score, area under ROC curve) are just what I want, there is no way of outputting them to a dataset. The "result" dataset does not contain these statistics. So it seems that the only way of "getting the numbers out of the printing window" is to invoke the printing process and output it to Excel by using statement like odstagsets. excelxp or ods excel. In Excel, I calculate the mean of the statistics.
But since there are 2000 replications (2000 Bootstrap samples), the printing work for the computer and the time spent on this may be formidable. What is more, the computer might simply crash, without generating anything useful, as is the case when I wished to output all of the displayed results to a PDF for 2000 Bootstrap samples on another occasion (the computer eventually generated a PDF as large as 1.3G, but there was no way of opening it, since the computer kept on reporting errors when I tried to open it again and again).
So here is my question: is there a method to output fit statistics to a dataset in SCORE statement of PROC LOGISTIC?
Many thanks!
As mentioned in this note, you can always save any table produced by any procedure using the ODS OUTPUT statement. You just need to find the name of the table you want to save. As shown in "ODS Table Names" section of the LOGISTIC documentation, the table name of the table of fit statistics from the FITSTAT option is ScoreFitStat. Also, if you want to suppress the display of all tables for a section of code you specify ODS EXCLUDE ALL; at the start and then ODS SELECT ALL; at the end.
Many thanks! The method was terrific. It worked perfectly well.
So here is another question I concern: the fit statistics displayed in FITSTAT statement is limited. I wish to have results of another statistics (e.g. Somer's D, sensitivity, specificity). The latter two (sensitivity and specificity) can be computed at an individual level by using PROC LOGISTIC+INMODEL statement. But what I wish to know is the sensitivity and specificity of each model at a given cut-off level. That is, instead of individual-level sensitivity and specificity, I wish to know model-level sensitivity and specificity.
As for Somer's D, there is no way of computing that statistics, no matter via PROC LOGISTIC+INMODEL statement or PROC PLM. So is there a way of computing that statistic when SCORE statement and INMODEL statement are used?
Thank you!
@StatDave wrote:
What is it that you ultimately are trying to do ... get the ROC area for each model?
Thank you for your detailed explanation. Area under ROC (also known as concordance index, or C-index in short) is a statistic describing the predictive efficacy (and some people say discriminative ability) of the model. In predictive model internal validation via Bootstrap resampling, I wish to know the concordance index of the predictive model. A basic approach of doing this is to average the concordance index (i.e. area under ROC) of each and every model that was generated by each and every Bootstrap sample and fit to the original sample (original dataset).
@StatDave wrote:
The sensitivity and specificity at one cutoff are just the coordinates of a single point on the ROC curve for a fitted model.
Yes, you are right. Prior to opening SAS Community on my mobile phone to check if anyone responded to my question, I suddenly noticed that there is no such thing as model-level sensitivity and specificity- the two statistics varies according to the cut-off even in a single model.
@StatDave wrote:
And Somer's D is just a transformation of the area under the ROC curve: D = 2c - 1, where c is the area. Both D and c are in a table that PROC LOGISTIC creates - again: its table name is in the ODS Table Names section of the documentation.
Thank you for your precious information.👍 And once again, it suddenly dawned upon me that area under ROC for each model has already been output before I checked my message box. You also reminded me of the fact, so that question has been solved right now.
I would like to give a more detailed explanation of my question regarding sensitivity and specificity. I wonder the misclassification (error) rate which can be computed upon request by FITSTAT option in SCORE statement is computed. As SAS Help shows, the formula of misclassification rate is . Simply speaking, according to this formula, the proportion of observations that were misclassified is designated as the misclassification rate.
But now a series of question follow: I looked up SAS Help to see what F_Yi and I_Yi stand for. SAS Help stated that F_Yi is the normalized level of the response variable Y of every observation in the data set being scored while I_Yi is the normalized level that the observation is classified into. Since the dependent variable (response variable) is discrete, I wonder the way that they are normalized, which is not illustrated in details in SAS Help.
A second and more important question is: How is each observation classified? In SAS Help, it is stated that an observation is classified into the level with the largest probability. So does it mean that SAS uses 0.5 as a cut-off to classify the observations by default when the dependent variable follows a binomial distribution? It can be easily inferred that for dependent variables following a binomial distribution, if the posterior probability of "success" of a given observation were larger than 0.5, then the posterior probability of "failure" of that observation would be less than 0.5. As a result, the observation would be classified as "success", according to the method mentioned in SAS Help.
It can be easily understood that 0.5 is not always the "best" cut-off. Therefore, for each model, I wish to classify the observations at a cut-off with the largest Youden index. Can I do this via SAS?
Many thanks!
Note that if your ultimate goal is to get an estimate the area under the ROC curve that removes the optimistic bias resulting from using the same data to fit and evaluate the model, then that can be easily done using cross validation as shown in the second half of this note. That provides a bias-reduced point estimate of the area. If you also want a standard error and confidence interval for the area (and also an estimate of Somer's D), then the same can be done by using the cross validated predicted probabilities. For example, using the remission data in one of the examples in the PROC LOGISTIC documentation:
proc logistic data=remiss plots(only)=roc;
model remiss(event="1") = smear blast;
output out=out predprobs=x;
run;
proc logistic data=out;
model remiss(event="1") = smear blast;
roc pred=xp_1;
run;
Thank you very much for your additional information and detailed illustration. I took a look at the note you referred to and found out that the method you mentioned was not the method I am trying to use. Still, it is of great importance to my future studies, as I am not that familiar with the usage of ROC statement of PROC LOGISTIC. Your illustration served as a useful example.
The method you mentioned follows the "leave-one-out paradigm" (a noun I made up myself), which is somehow similar to the basic logic of GCV criterion in generalized additive models.
According to Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression,..., the method I am using is called the "enhanced Bootstrap method", which also takes bias into account and tackles it, albeit not using the "leave-out-out paradigm". I previously mentioned a method in my original post and gave a cry on the frustration of not being able to output the fit statistics to another dataset. That method was called "the simple Bootstrap method", according to the author of the book I cited. Simple as it is, the results generated are somehow biased. So, I am moving on to using the "enhanced Bootstrap method".
It isn't clear that your method would be more effective at bias reduction than cross validation, but it certainly will be more cumbersome to compute.
Trivial algebra can show that classifying by the larger predicted probability is equivalent, in the binary response case, to using 0.5 as a cutoff.
"Normalized" is clearly defined in the "Details: Input and Output Data Sets: OUT= Output Data Set in a SCORE Statement" section of the LOGISTIC documentation. It simply means left-justified, formatted values of 16 characters or less.
As I often point out in this Community, questions of availability of a particular statistic or method in SAS should begin by checking the list of Frequently Asked-for Statistics (FASTats) in the Important Links section of the Statistical Procedures Community page. It isn't exhaustive, but many things are there including Youden's index and its formula. It is not directly provided by PROC LOGISTIC but could easily be computed for each observation from the sensitivity and specificity values in the OUTROC= data set. You would then have to find the observation with the maximum value to then get its predicted probability to use as a cutoff.
Thank you for your reply. You questioned the reliability on the so-called "enhanced Bootstrap model validation method" as compared to "leave-one-out" cross validation. That has aroused my interest on the differences between the two methods. Currently, I have read reviews reporting the deficiencies of "leave-one-out" cross validation, including a selection of model with a relatively large number of variables. However, as I am busy on weekdays, I have no time to read them carefully. I also have other question that would like to raise, so I will respond to your reply after reading the articles and sorting out my questions.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.