I'd like to check the ability of a model to predict additional data using roc curves. The problem is that all of the methods to calculate sensitivity and specificity involve the roc option in Proc logistic. Even the roc macros don't provide these values, only roc plots and area under the curve stats. With a column of observed responses (1 or 0) and a column of predicted probabilities does anyone know a way to calculate the sensitivity and specificity for a range of cut points (I think the cut points have to be the same as the predicted probabilities)?
The reason I'm having this problem is that I've estimated the model in proc genmod. I can get the parameter estimates (from genmod), and make the predictions but I cant get the parameter estimates into logistic and use the roc option, and I cant get the sensitivity and specificity, from the two Roc macros....
For instance, if your data set is named MyROC and it contains your binary response variable called Y and your predicted probability variable called PRED, this call to the ROC macro will provide point- and confidence interval estimates of the area under the ROC curve:
%roc(data=MyROC, response=Y, var=PRED)
Beginning in SAS 9.2, it's easy to do directly in PROC LOGISTIC:
proc logistic data=MyROC;
See the descriptions of the ROC and ROCCONTRAST statements in the SAS 9.2 LOGISTIC documentation:
Can I ask why you are using GENMOD for estimating the parameters of your model instead of fitting your model using PROC LOGISTIC. It is not clear from your post why you cannot use PROC LOGISTIC all the way.
But if you really need to fit your model with GENMOD and then construct an ROC curve for that fitted model, you can pass the linear predictor (X*beta_hat) from PROC GENMOD as the predictor variable in the LOGISTIC procedure. If you specify the NOINT option in PROC LOGISTIC and use the linear predictor from the GENMOD procedure (and take care to model the same level of the response as the event of interest), then you will fit exactly the same "model" in PROC LOGISTIC as you had fit using the GENMOD procedure.
sensitivity and specificity are defined in terms of the counts of True Positives, False Positives, True Negatives, and False Negatives. Given a model that forecasts the result, you simply compare the actual to predicted to determine which category the observation falls into for a given cut point. The ROC curve is simply a plot of observations (sensitivity, 1-specificity) calculated for a range of cut points.
You can write the appropriate data step code to score the data using your model equation, then generate TP, FP,TN,FN counts for selected cutpoints, and from that derive a series of plot points for the selected cutpoints and then generate your ROC curve.
The model does not have to be created by any specific SAS/STAT proc.
If you need more (simple explanation of , definition of) sensitivity and specificity and ROC curves then Google explains well enough for input to such a coding excercise. At the end of it you can validate what you have coded against one of the Proc Logistic samples. and use the CTABLE Option to generate the counts for the same cutpoints you use in your code.
You may also find it visually helpfill to overlay the line (0,0) to (1,1) on your graph of points comprising your ROC curve.
One advantage of Proc Logistic is that it does provide a measure of the area under the ROC curve for you. Generating the area under the curve using a data step requires a bit of geometry, and is marginally more difficult than the generation of the points for the ROC Curve itself.
Whether you approach this by coding your own utility for generating ROC curves and go as far as calculating the area under the curve as well or use a purely SAS proc driven approach depends on a number of factors, including availability of software and coding competence.
Proc logistic calculates all the sensitivity and 1-specificity values for the range of cutoff points. That's how it constructs the ROC curve. You can output these values in a datset that will allow you also to calculate PPV and NPV for the range of cutoff points.
Assume you have a numerical variable that you are going to use for discriminating between two groups (cases, coded as 1 and non-cases coded as 0). The following statement will plot the ROC curve, and produce a datatset with the components that will let you calculate specificity, sensitivity, PPV and NPV for each of the values of the numerical variable:
ods graphics on;
proc logistic data=your_data plots(only)=roc(id=obs);
model case (event='1')=numerical_variable;
ods graphics off;
For each cutoff point (i.e. for each value of the numerical variable), sorted in descending order, the dataset data_roc contains the following variables:
_pos_ (frequency of true cases)
_neg_ (frequency of true non-cases)
_falpos_ (frequency of non-cases wrongly classified as cases)
_falneg_ (frequency of cases wrongly classified as non-cases)
With these variables you can easily calculate the specificity:
Now, to calculate PPV and NPV:
Unfortunately the "data_roc" does not have the values of the numerical_variable. So you need to sort the original dataset and merge it with the "data_roc" set to have everything together. Then you can plot each measure or conduct other analyses.
The sorting and merging would be something like this:
proc sort data=your_data;
by descending numerical_variable;