10-01-2012 11:23 PM
Can anyone help me on how to plot a calibration curve/line with binary outcome? When I plot the predicted probability vs. the actual outcome I get straight line thru 0 and 1 because of binary outcome. I can't seem to figure out otherwise.
10-02-2012 08:15 AM
I assume that you have one or more continuous explanatory variables?
I usually use PROC LOGISTIC to model the data and use the PLOTS=EFFECT statement or the newer EFFECTPLOT statement to graph the results:
If you prefer to do it "by hand," plot the predicted probabilities as Y and the explanatory variable as X.
10-03-2012 07:00 PM
Thank you for you comment. I do have explanatory variables. However, I am not trying to plot the explanatory variable to predicted probability which is what EFFECTPLOT gives. What I am trying to do is plot the predicted probability versus the actual outcome. I already have a model (formula) that I can calculate the predicted probability with but my outcome is binary. So when I plot the predicted versus actual outcome, I get two lines thru o and 1 for binary. I think I need to divided the prob in deciles and plot against the actual frequency and I can seem to figure out how. Please let me know if this makes sense. Thanks for the help.
10-03-2012 07:28 PM
Unfortunately I am not understanding what you are trying to do. Can you link to a web page that has a picture of what you are trying to achieve?
10-04-2012 12:19 PM
OK. That's clearer. This isn't really a "graph" question, it's a how do I compute the quantities needed for a graph" question.
Here's what you need:
1) Compute the deciles of the predicted probabilities
2) For each decile, compute the mean and upper/lower 95% confidence interval for the observed outcome. The mean is also the "percentage of observed values that are 1."
A "short answer" is that you can do it like this:
/* set deciles of predicted risk */
set Pred; /* data that includes variable PredProb for predicted probabilities */
Decile = int(10*PredProb)/10;
proc sgplot data=all;
dot decile / response=y stat=mean limitstat=clm;
This will get you in the ballpark, and would be sufficient for "internal" plots that you intend for yourself or your group.
Unfortunately there are three problems with this approach if you are trying to EXACTLY reproduce the figure in the paper:
1) The DOT statement displays a graph with the deciles on the vertical axis, which is opposite from the graph in the paper.
2) The LIMITSTAT= option computes confidence limits by using the standard formula for normally distributed data. These data are binary, and therefore you should really use CIs for binomial proportions (not a big deal if you have lots of data, but still...)
3) The plot in the paper also overlays a curve which I assume is a nonparametric smoother (for example, a loess curve) through the (Y, PredProb) points.
All of these problems can be surmounted: call PROC FREQ to get the stats and then overlay the SCATTERPLOT / YUPPERLIMIT= YLOWERLIMIT= statement with a LOESS curve.
02-19-2017 08:14 PM - edited 02-19-2017 08:17 PM
Sorry to bother you! Would you please clarify a couple of points regarding the three differences, as you mentioned, between the graph of interest (Figure 1 in the attachment) and the graph you drew?
1)How would you going to fix the problem of deciles being on the vertical axis instead of the horizontal axis;
2)How exactly would you implement the "overlay the SCATTERPLOT / YUPPERLIMIT= YLOWERLIMIT= statement with a LOESS curve".
I understand that these codes may be super easy to you, but they are actually the bottleneck for me for this question. I googled for one day and found nothing. Thank you!
02-20-2017 08:18 AM - edited 02-20-2017 08:18 AM
If you post sample data, we can make concrete suggestions. But it sounds like you want something like the following. Here I am using PROC SGPLOT, which has a simple syntax:
data Have; input decile y low hi; datalines; 1 1 0 2 2 3 1 2 3 4 2 5 4 6 4 7 5 5.5 5 6 6 5 4 6 7 4.5 4 7 8 3 1 5 9 2 1 3 ; proc sgplot data=Have; scatter x=decile y=Y / YErrorLower=low YErrorUpper=hi; loess x=decile y=y; run;
02-20-2017 12:03 PM
Thank you so much for the instructions!
Now I'm just one step away from the figure in that article. Below are the codes I learned from you and applied to this question. The dataset was attached in the attachment. The variable phat_mean is the predicted risk by group, and the ob_risk is the observed risk by group. The 10 groups were divided based on the deciles of the predicted risk. My final question is: how to reproduce that dashed diagonal line, which appears to be a reference line, in Figure 1 of that article?
proc sgplot data=ning;
scatter x=phat_mean y=ob_risk / YErrorLower=Lower_CI YErrorUpper=Upper_CI;
loess x=phat_mean y=ob_risk;
Looking forward to your further instruction!
02-20-2017 01:01 PM
Glad you are making progress. I suggest you also add the NOMARKERS option to the LOESS statement. Then use the LINEPARM statement, like this:
proc sgplot data=ning; scatter x=phat_mean y=ob_risk / YErrorLower=Lower_CI YErrorUpper=Upper_CI; loess x=phat_mean y=ob_risk / nomarkers; lineparm x=0 y=0 slope=1 / lineattrs=(pattern=dashed); run;
02-20-2017 06:36 PM - edited 02-20-2017 06:42 PM
Thank you for your tremendous help!
This "calibration plot" is an question in my assignment of a master-degree university course. To the best of my knowledge, your solution is the only, and THE BEST, resource that is available online as of now. In my homework, I cited your instructions and wrote: "The solution and the codes were developed under the guidance of, and virtually by, Mr. Rick from the SAS Institute. Web link listed below. https://communities.sas.com/t5/forums/replypage/board-id/sas_graph/message-id/11641"
The final graph is enclosed in the attachment for other SAS users' reference. This calibration plot is widely used to illustrate the performance of a "risk prediction model" in the medical land. It compares the predicted risk with the observed risk by level of the predicted risk. Statistically, it is also a visualization of the "Hosmer-Lemeshow test" that examines the extent to which the predicted values produced from the statistical model match the observed values obtained from the real world.
Many thanks again!