Greetings all. I am trying to create my regression forumula from the estimates output from proc logistic. Thinking back to multiple regression (and it was several years ago), I could simply take the intercept + (estimate1*variable1) + (extimateN*variableN). However, if I use this methodology, I seem to get some results that are counter intuitive. My question is, how does the 'Exp(Est)' affect the parameter estimate with respect to putting it in the regression forumula? I have attached a copy of my log, and sure would appreciate it if anyone would be willing to put some experienced eyes on it. For example, variable 'bad_debt_at_connection' indicates a customer opened a new account when they had an account that at some point in the past went to a collection agency for non-payment. If I am trying to predict accounts that will go to collections, it seems to me if this condition were true for any customer, they would be more likely to go to collections if they have already done so in the past. However, the parameter estimate of -0.1582 seems to me to indicate this is not the case. The two class variables are binary, having either a 1 or 0. The rest of the variables are numeric, and should be treated as such. Thank you.
Greg.
The intercept does matter.
I pulled the exact estimates from the model instead of typing it it. But if you type it in, it's pretty close.
proc logistic data=Neuralgia2 outest=sample;
class sex (ref='0') / param=ref ;
model Pain (event='1') = age sex ;
output out=pred p=phat
predprob=(individual crossvalidate) ;
run ;
/* the formula I am trying to replicate in Excel is the 'myformula' variable
in the below data set
*/
data logformula (keep= age sex pain ip_1 myformula difference);
set pred;
if _n_=1 then set sample (keep = intercept age sex1 rename = (age=age_estimate sex1=sex_estimate));
length sex_estimate age_estimate intercept myformula 8. ;
myformula = 1/(1+exp(-1*(intercept+(sex_estimate*sex) + (age_estimate*age)))); *<<< can't get this to match ip_1;
difference=phat-myformula;
format difference 12.8;
run;
proc print data=logformula (obs=25) ;
var sex age pain ip_1 myformula ; * I need to be able to replicate ip_1 given MLE values;
run ;
You didn't fit your model properly. You need to specify that you want your class variables to have referential coding not effect coding.
Try referring to this site to interpret your output and basics of logistic regression. A logistic regression isn't linear, so the way you're trying to write the equation isn't correct. If you can find a statistician to help you out.
Thank you Reeza. Unfortuneately, I do not have access to an in-house statustician, so I have been trying to figure it out on my own for several days now. Below is what I am trying.
proc logistic data = credit.crnp_logreg;
class bad_debt_at_conn dep_unpaid;
model chg_off (event='1') = bad_debt_at_conn dep_unpaid cr_scor arrears qy_fc_c qy_fc_b qy_st_c qy_st_b / selection=none expb;
quit;
Would you be able to show me how to run it so it performs as you suggets, or to point me to a good reference? I have tried googling everything I can think of. Thank you.
Greg
Statistical Computing Seminars: Introduction to SAS proc logistic
If I'm working on a new proc what I like to do is first try it out with the example data and make sure I understand how to interpret it for the example data. Then go back and make sure I understand it for my data.
SAS has a bunch of examples in the documentation that is pretty good. The link above from UCLA is good. You can also try googling proc logistic at lexjansen.com.
My main suggestion is to add /param=ref; to your class statement like below. You may also want to specify what reference levels you want it coded it but that's up to you.
proc logistic data = credit.crnp_logreg;
class bad_debt_at_conn dep_unpaid/param=ref;
model chg_off (event='1') = bad_debt_at_conn dep_unpaid cr_scor arrears qy_fc_c qy_fc_b qy_st_c qy_st_b / selection=none expb;
quit;
Reeza, thank you so, so much for that link. That is *exactly* what I need.
Greg
Ok, I am still missing something. I used / rsq lackfit, to see if my model even made sense, and indeed it is a good fit. I guess what I am missing is that SAS must be using some kind of formula to determine the probability of each observation having a response of either 0 or 1. I am still unsure how to do this. Do I need to look at the probability of each variable independently, the put them all together? Thank you.
Greg
I'm confused as to what your question is, what do you need to put them together for?
You can see the basic calculations P(x=0) and P(x=1) in the output dataset if you're looking for that. Take a look at the 'Pred' dataset. If you want to figure out how to calculate those by hand its also possible, and a good exercise when starting out.
Using the remission data in the first example in the SAS documentation:
proc logistic data=Remission outest=betas covout;
model remiss(event='1')=cell smear infil li blast temp ;
output out=pred p=phat lower=lcl upper=ucl
predprob=(individual crossvalidate);
run;
proc print data=pred;
run;
Reeza, thank you so much for continuing to answer my newbie questions. I ran my model with the output as you have shown above, and I am googling madly to find help in deciphering the results. In the mean time, perhaps an explanation of what I am needing to do would help. Using historical account data, with a binary response, two binary predictors, and several interval predictors, I need to come up with a formula to predict which accounts will ultimately not pay, and end up being sent to collections. The system housing the data is a DB2 mainframe, and the DB2 programmers need to have a formula to sort the bad accounts having the highest probability of ending up in collections. They have, in a DB2 table, all of the variables in my SAS model, but they need a formula to calculate the probability any account will end up in collections. So, while it helps me to see the predicted probability of each observation in my dataset, I am needing to be able to figure out how to calculate that probability by hand, so I can give the mainframers a formula. I hope this helps explain my dilema. Thank you again.
Reeza, thank you for the link. That is *exactly* what I am trying to do. I've got close, but can't seem to get it exactly right. I have my spread sheet set up just like the poster in the other post has done, with the MLE estimates populated above the variable names. the closest I could get to calculating the individual probability of any one observation, and this is attempting to use the forumula you suggested to the other poster...
1/(1 + EXP(-1*(var_1*estimate_1)+(var_2*estimate_2)+(varN*estimate_N)))
The value calculated with this formula is slightly different (with an absolute difference less than .01) than the 'Individual Probability of event=1' value calculated by SAS. I'm also not sure where the intercept fits in. I found this document... http://support.sas.com/resources/papers/proceedings12/317-2012.pdf and it seems to imply the intercept need not be taking into consideration in some cases. Do you see anything wrong with my forumula? Again, thank you so much for your help.
Greg
How are you getting your estimates from SAS to excel, a difference of .01 may be rounding.
I'm right clicking the dataset, then chosing export. It seems to be keeping the precision as is in the SAS dataset.
When I check one, I get the probabilities equivalent to .0001 in Excel.
Close enough for me. However, you may want to ensure that you've done the parametrization correctly in terms of implementing the equation. That's the only think I can think of without seeing the results and equations.
Reeza, I took your advice, and started over with a documented example I understood. I looked at the Neuralgia example 51.2 avalailable at http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_logistic_sec...
I started by altering the data a bit to match my scenario by changing 'M' and 'F' to 0 and 1, and also to change 'No' and 'Yes' to 0 and 1. Other than that, it pretty much seems the same as my situation, with the exeption that I have 3 binary class predictors, and 3 numeric predictors. I tried using the same forumula that got me close yesterday on my own data, and now I am not even coming close, as all of the results of my formula are .9999 when rounded. I have attached the sas code for the example, including my changegs, the proc logistic, and a final dataset with the current iteration of the formula I am trying to get to work. It just seems like I should be able to replicate the predicted probabilities given the estimates and intercept, but after trying everything I can think of, I just can't get it to work. I have referenced my stats book, and I don't see anything contrary to what you have suggested in the other post, and what I am trying. Thank you.
Greg
The intercept does matter.
I pulled the exact estimates from the model instead of typing it it. But if you type it in, it's pretty close.
proc logistic data=Neuralgia2 outest=sample;
class sex (ref='0') / param=ref ;
model Pain (event='1') = age sex ;
output out=pred p=phat
predprob=(individual crossvalidate) ;
run ;
/* the formula I am trying to replicate in Excel is the 'myformula' variable
in the below data set
*/
data logformula (keep= age sex pain ip_1 myformula difference);
set pred;
if _n_=1 then set sample (keep = intercept age sex1 rename = (age=age_estimate sex1=sex_estimate));
length sex_estimate age_estimate intercept myformula 8. ;
myformula = 1/(1+exp(-1*(intercept+(sex_estimate*sex) + (age_estimate*age)))); *<<< can't get this to match ip_1;
difference=phat-myformula;
format difference 12.8;
run;
proc print data=logformula (obs=25) ;
var sex age pain ip_1 myformula ; * I need to be able to replicate ip_1 given MLE values;
run ;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.
