Can anyone help me on how to plot a calibration curve/line with binary outcome? When I plot the predicted probability vs. the actual outcome I get straight line thru 0 and 1 because of binary outcome. I can't seem to figure out otherwise.
Thanks
I assume that you have one or more continuous explanatory variables?
I usually use PROC LOGISTIC to model the data and use the PLOTS=EFFECT statement or the newer EFFECTPLOT statement to graph the results:
If you prefer to do it "by hand," plot the predicted probabilities as Y and the explanatory variable as X.
Rick,
Thank you for you comment. I do have explanatory variables. However, I am not trying to plot the explanatory variable to predicted probability which is what EFFECTPLOT gives. What I am trying to do is plot the predicted probability versus the actual outcome. I already have a model (formula) that I can calculate the predicted probability with but my outcome is binary. So when I plot the predicted versus actual outcome, I get two lines thru o and 1 for binary. I think I need to divided the prob in deciles and plot against the actual frequency and I can seem to figure out how. Please let me know if this makes sense. Thanks for the help.
Unfortunately I am not understanding what you are trying to do. Can you link to a web page that has a picture of what you are trying to achieve?
Here is the link of the article. It is similar to what I am trying to do. Figure one is what I am trying to get. Thank you.
The link didn't post.
OK. That's clearer. This isn't really a "graph" question, it's a how do I compute the quantities needed for a graph" question.
Here's what you need:
1) Compute the deciles of the predicted probabilities
2) For each decile, compute the mean and upper/lower 95% confidence interval for the observed outcome. The mean is also the "percentage of observed values that are 1."
A "short answer" is that you can do it like this:
/* set deciles of predicted risk */
data Deciles;
set Pred; /* data that includes variable PredProb for predicted probabilities */
Decile = int(10*PredProb)/10;
run;
proc sgplot data=all;
dot decile / response=y stat=mean limitstat=clm;
run;
This will get you in the ballpark, and would be sufficient for "internal" plots that you intend for yourself or your group.
Unfortunately there are three problems with this approach if you are trying to EXACTLY reproduce the figure in the paper:
1) The DOT statement displays a graph with the deciles on the vertical axis, which is opposite from the graph in the paper.
2) The LIMITSTAT= option computes confidence limits by using the standard formula for normally distributed data. These data are binary, and therefore you should really use CIs for binomial proportions (not a big deal if you have lots of data, but still...)
3) The plot in the paper also overlays a curve which I assume is a nonparametric smoother (for example, a loess curve) through the (Y, PredProb) points.
All of these problems can be surmounted: call PROC FREQ to get the stats and then overlay the SCATTERPLOT / YUPPERLIMIT= YLOWERLIMIT= statement with a LOESS curve.
Oh, and properly you should use PROC RANK to get the deciles instead of the quick approximation that I used here.
Dear Rick:
Sorry to bother you! Would you please clarify a couple of points regarding the three differences, as you mentioned, between the graph of interest (Figure 1 in the attachment) and the graph you drew?
1)How would you going to fix the problem of deciles being on the vertical axis instead of the horizontal axis;
2)How exactly would you implement the "overlay the SCATTERPLOT / YUPPERLIMIT= YLOWERLIMIT= statement with a LOESS curve".
I understand that these codes may be super easy to you, but they are actually the bottleneck for me for this question. I googled for one day and found nothing. Thank you!
If you post sample data, we can make concrete suggestions. But it sounds like you want something like the following. Here I am using PROC SGPLOT, which has a simple syntax:
data Have;
input decile y low hi;
datalines;
1 1 0 2
2 3 1 2
3 4 2 5
4 6 4 7
5 5.5 5 6
6 5 4 6
7 4.5 4 7
8 3 1 5
9 2 1 3
;
proc sgplot data=Have;
scatter x=decile y=Y / YErrorLower=low YErrorUpper=hi;
loess x=decile y=y;
run;
Dear Rick:
Thank you so much for the instructions!
Now I'm just one step away from the figure in that article. Below are the codes I learned from you and applied to this question. The dataset was attached in the attachment. The variable phat_mean is the predicted risk by group, and the ob_risk is the observed risk by group. The 10 groups were divided based on the deciles of the predicted risk. My final question is: how to reproduce that dashed diagonal line, which appears to be a reference line, in Figure 1 of that article?
proc sgplot data=ning;
scatter x=phat_mean y=ob_risk / YErrorLower=Lower_CI YErrorUpper=Upper_CI;
loess x=phat_mean y=ob_risk;
run;
Looking forward to your further instruction!
Many thanks!
Ning
Glad you are making progress. I suggest you also add the NOMARKERS option to the LOESS statement. Then use the LINEPARM statement, like this:
proc sgplot data=ning;
scatter x=phat_mean y=ob_risk / YErrorLower=Lower_CI YErrorUpper=Upper_CI;
loess x=phat_mean y=ob_risk / nomarkers;
lineparm x=0 y=0 slope=1 / lineattrs=(pattern=dashed);
run;
Dear Rick:
Thank you for your tremendous help!
This "calibration plot" is an question in my assignment of a master-degree university course. To the best of my knowledge, your solution is the only, and THE BEST, resource that is available online as of now. In my homework, I cited your instructions and wrote: "The solution and the codes were developed under the guidance of, and virtually by, Mr. Rick from the SAS Institute. Web link listed below. https://communities.sas.com/t5/forums/replypage/board-id/sas_graph/message-id/11641"
The final graph is enclosed in the attachment for other SAS users' reference. This calibration plot is widely used to illustrate the performance of a "risk prediction model" in the medical land. It compares the predicted risk with the observed risk by level of the predicted risk. Statistically, it is also a visualization of the "Hosmer-Lemeshow test" that examines the extent to which the predicted values produced from the statistical model match the observed values obtained from the real world.
Many thanks again!
Ning
For future reference, here are two blog posts about calibration curves in SAS:
Calibration plots in SAS [using loess curves]
Decile calibration plots in SAS
The second article includes a comparison of the two methods. It ends with the following recommendation:
"Many leading researchers in logistic regression do not recommend the Hosmer-Lemeshow test for these reasons. The decile-based calibration curve shares the same drawbacks. Since SAS can easily create the loess-based calibration curve (see the previous article), there seems to be little reason to prefer the decile-based version."
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.