BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
lichee
Quartz | Level 8

Hi all,

 

I'm trying to assess if the logit model I'm using is a good one to estimate probability of an event. The overall rate of the event is around 4%. The sample size is nearly four million. Below is the decile calibration plot of predicted probability and observed probability. Does it mean it's a poor model? For such a large sample, should I split it into subsamples to improve model estimation/prediction? Thank you!

lichee_0-1712164470397.png

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
Since you have a big table, I would like to introduce PROC HPLOGISTIC .
Check "partition" statement and Hosmer-Lemeshow Test.

View solution in original post

6 REPLIES 6
ballardw
Super User

I suggest providing LOG from running your code including the code and all the notes and messages involved.

 

Without the code it is extremely hard to guess what options you may have used that might affect the output.

Also the notes from the code would include number of observations actually used. The data set may have 4 million observations but it is not impossible that fewer were actually used. If any of your observations included missing values for any variables on a model statement they would typically not be used by default.

Also there might be other diagnostic hints.

lichee
Quartz | Level 8

Thank you! I'm attaching the log of PROC LOGISTIC and calibration plot. 

 

I believe all the 3.96 million observations were used in the regression. Any insight is appreciated!

lichee
Quartz | Level 8
I followed https://blogs.sas.com/content/iml/2018/05/16/decile-calibration-plots-sas.html to do calibration plot to compare the observed probability and the estimated probability along the 45 degree diagonal line. With such a large sample size, would splitting the sample into a few smaller random samples make goodness of fit meaningful? Or stratify the sample into a few subsamples to estimate probability within each meaningful subsample.
Ksharp
Super User
I think
https://blogs.sas.com/content/iml/2018/05/16/decile-calibration-plots-sas.html
is good enough, No need to split your data into many small sub-data.
Anyway, @Rick_SAS @StatDave might have insight in it.
Ksharp
Super User
Since you have a big table, I would like to introduce PROC HPLOGISTIC .
Check "partition" statement and Hosmer-Lemeshow Test.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 381 views
  • 3 likes
  • 3 in conversation