Hi Guys!
Hope someone could help me. Am performing binary logistic regression in SAS and am trying to figure out if it is possible to manually compute (via Excel) for the predicted probabilities (individual, cross validated, and cumulative) generated by SAS? If yes, what is the formula?
I am using the formula exp(bo+(b1*x1)+...+(bn*xn))/(1+exp(bo+(b1*x1)+...+(bn*xn))) but the values don't match with the predicted probabilities.
I've used sample from the SAS documentation and added the predicted probabilities (predprobs) in output.
Data Neuralgia;
input Treatment $ Sex $ Age Duration Pain $ @@;
datalines;
P F 68 1 No B M 74 16 No P F 67 30 No
P M 66 26 Yes B F 67 28 No B F 77 16 No
A F 71 12 No B F 72 50 No B F 76 9 Yes
A M 71 17 Yes A F 63 27 No A F 69 18 Yes
B F 66 12 No A M 62 42 No P F 64 1 Yes
A F 64 17 No P M 74 4 No A F 72 25 No
P M 70 1 Yes B M 66 19 No B M 59 29 No
A F 64 30 No A M 70 28 No A M 69 1 No
B F 78 1 No P M 83 1 Yes B F 69 42 No
B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No A F 69 12 No B F 65 14 No
B M 70 1 No B M 67 23 No A M 76 25 Yes
P M 78 12 Yes B M 77 1 Yes B F 69 24 No
P M 66 4 Yes P F 65 29 No P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No
P F 72 27 No P F 70 13 Yes A M 75 6 Yes
B F 65 7 No P F 68 27 Yes P M 68 11 Yes
P M 67 17 Yes B M 70 22 No A M 65 15 No
P F 67 1 Yes A M 67 10 No P F 72 11 Yes
A F 74 1 No B M 80 21 Yes A F 69 3 No
;
run;
proc logistic data=Neuralgia;
class Treatment Sex;
model Pain= Treatment Sex Treatment*Sex Age Duration / selection=backward expb;
run;
proc logistic data=Neuralgia;
class Treatment Sex;
model Pain= Treatment Sex Age/expb ctable;
output predprobs=(INDIVIDUAL CUMULATIVE CROSSVALIDATE);
run;
proc print data=data1;
Hi @Cindy789 and welcome to the SAS Support Communities!
I haven't opened your Excel table (I don't have Excel installed on my SAS workstation), only the preview. But it seems that you calculated the predicted probabilities as if reference cell coding had been used as the parameterization method for the classification variables. However, the default is effect coding (see documentation of the CLASS statement, option PARAM=). I think the easiest correction will be to switch to reference cell coding:
class Treatment Sex / param=ref;
This means in your case, more explicitly:
class Treatment(ref='P') Sex(ref='M') / param=ref;
If you rerun the PROC LOGISTIC step with this modification (either of the above two versions), you will notice the changes in sections "Class Level Information" and "Analysis of Maximum Likelihood Estimates" of the output. Then you can leave the Excel formulas unchanged and just use the new estimates for Intercept etc. (no change for Age, of course). Otherwise you'd have to take into account the design variable values of −1 for Treatment='P' and Sex='M' (see table "Class Level Information") in your Excel formulas.
The remaining discrepancies (such as 0.0001, 0.0006 etc. for obs. 6, 7, ...) will be due to rounding errors. (Not surprisingly, the only substantial differences had occurred where Treatment='P' or Sex='M'.) You can reduce the rounding errors by using values with more decimals for the estimates. Just request an ODS output dataset:
ods output ParameterEstimates=est;
(before or in the PROC LOGISTIC step) and then print it with a suitable format in order to obtain those more precise values:
proc print data=est;
format estimate best18.;
run;
As to the "cumulative" probabilities: These are redundant for binary logistic regression (given the individual probabilities): As you've surely noticed, CP_No=IP_No and CP_Yes=1 in your dataset DATA1.
I don't think there is an easy way to calculate the cross-validated predicted probabilities in Excel.
Hi @Cindy789 and welcome to the SAS Support Communities!
I haven't opened your Excel table (I don't have Excel installed on my SAS workstation), only the preview. But it seems that you calculated the predicted probabilities as if reference cell coding had been used as the parameterization method for the classification variables. However, the default is effect coding (see documentation of the CLASS statement, option PARAM=). I think the easiest correction will be to switch to reference cell coding:
class Treatment Sex / param=ref;
This means in your case, more explicitly:
class Treatment(ref='P') Sex(ref='M') / param=ref;
If you rerun the PROC LOGISTIC step with this modification (either of the above two versions), you will notice the changes in sections "Class Level Information" and "Analysis of Maximum Likelihood Estimates" of the output. Then you can leave the Excel formulas unchanged and just use the new estimates for Intercept etc. (no change for Age, of course). Otherwise you'd have to take into account the design variable values of −1 for Treatment='P' and Sex='M' (see table "Class Level Information") in your Excel formulas.
The remaining discrepancies (such as 0.0001, 0.0006 etc. for obs. 6, 7, ...) will be due to rounding errors. (Not surprisingly, the only substantial differences had occurred where Treatment='P' or Sex='M'.) You can reduce the rounding errors by using values with more decimals for the estimates. Just request an ODS output dataset:
ods output ParameterEstimates=est;
(before or in the PROC LOGISTIC step) and then print it with a suitable format in order to obtain those more precise values:
proc print data=est;
format estimate best18.;
run;
As to the "cumulative" probabilities: These are redundant for binary logistic regression (given the individual probabilities): As you've surely noticed, CP_No=IP_No and CP_Yes=1 in your dataset DATA1.
I don't think there is an easy way to calculate the cross-validated predicted probabilities in Excel.
Appreciate your help @FreelanceReinh and @Reeza, thank you so much for explaining it very well for me. The changes in the program worked! Have added the (ref=' ') as well as the param=ref. As you've said, there are minimal discrepancies due to rounding errors but other than that, it worked perfectly! Thank you for sharing your expertise!
You forgot the PARAM=REF portion on the CLASS statement. Check the design matrix in the output, it's using a GLM method instead of referential coding. Once thats fixed it matches pretty closely (0.0000X) usually.
proc logistic data=Neuralgia; class Treatment Sex / PARAM=REF; model Pain= Treatment Sex Age/expb ctable; output predprobs=(INDIVIDUAL CUMULATIVE CROSSVALIDATE); run;
But, a blast from the past (5 years ago) I did work this out with someone else, step by step so hopefully this thread can help you out if you have further issues.
PS. Thanks for posting a fully worked example, makes it much easier to verify/answer your question.
EDIT: Didn't see @FreelanceReinh solution for some reason before I posted this, his response is more detailed than mine and should be marked correct.
@Cindy789 wrote:
Hi Guys!
Hope someone could help me. Am performing binary logistic regression in SAS and am trying to figure out if it is possible to manually compute (via Excel) for the predicted probabilities (individual, cross validated, and cumulative) generated by SAS? If yes, what is the formula?
I am using the formula exp(bo+(b1*x1)+...+(bn*xn))/(1+exp(bo+(b1*x1)+...+(bn*xn))) but the values don't match with the predicted probabilities.
I've used sample from the SAS documentation and added the predicted probabilities (predprobs) in output.
Data Neuralgia;
@input Treatment $ Sex $ Age Duration Pain $ @@;
datalines;
P F 68 1 No B M 74 16 No P F 67 30 No
P M 66 26 Yes B F 67 28 No B F 77 16 No
A F 71 12 No B F 72 50 No B F 76 9 Yes
A M 71 17 Yes A F 63 27 No A F 69 18 Yes
B F 66 12 No A M 62 42 No P F 64 1 Yes
A F 64 17 No P M 74 4 No A F 72 25 No
P M 70 1 Yes B M 66 19 No B M 59 29 No
A F 64 30 No A M 70 28 No A M 69 1 No
B F 78 1 No P M 83 1 Yes B F 69 42 No
B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No A F 69 12 No B F 65 14 No
B M 70 1 No B M 67 23 No A M 76 25 Yes
P M 78 12 Yes B M 77 1 Yes B F 69 24 No
P M 66 4 Yes P F 65 29 No P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No
P F 72 27 No P F 70 13 Yes A M 75 6 Yes
B F 65 7 No P F 68 27 Yes P M 68 11 Yes
P M 67 17 Yes B M 70 22 No A M 65 15 No
P F 67 1 Yes A M 67 10 No P F 72 11 Yes
A F 74 1 No B M 80 21 Yes A F 69 3 No
;
run;proc logistic data=Neuralgia;
class Treatment Sex;
model Pain= Treatment Sex Treatment*Sex Age Duration / selection=backward expb;
run;proc logistic data=Neuralgia;
class Treatment Sex;
model Pain= Treatment Sex Age/expb ctable;
output predprobs=(INDIVIDUAL CUMULATIVE CROSSVALIDATE);
run;proc print data=data1;
As I mentioned in my previous response - thank you @Reeza! Am definitely happy to know the correct solution for my problem and also glad I made it easier for you to verify. Thank you again!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.