Hi all,
I'm trying to assess if the logit model I'm using is a good one to estimate probability of an event. The overall rate of the event is around 4%. The sample size is nearly four million. Below is the decile calibration plot of predicted probability and observed probability. Does it mean it's a poor model? For such a large sample, should I split it into subsamples to improve model estimation/prediction? Thank you!
I suggest providing LOG from running your code including the code and all the notes and messages involved.
Without the code it is extremely hard to guess what options you may have used that might affect the output.
Also the notes from the code would include number of observations actually used. The data set may have 4 million observations but it is not impossible that fewer were actually used. If any of your observations included missing values for any variables on a model statement they would typically not be used by default.
Also there might be other diagnostic hints.
Thank you! I'm attaching the log of PROC LOGISTIC and calibration plot.
I believe all the 3.96 million observations were used in the regression. Any insight is appreciated!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.