topic Calibration of logit model with large sample size in Statistical Procedures

Calibration of logit model with large sample size

lichee — Wed, 03 Apr 2024 17:14:39 GMT

Hi all,

I'm trying to assess if the logit model I'm using is a good one to estimate probability of an event. The overall rate of the event is around 4%. The sample size is nearly four million. Below is the decile calibration plot of predicted probability and observed probability. Does it mean it's a poor model? For such a large sample, should I split it into subsamples to improve model estimation/prediction? Thank you!

Re: Calibration of logit model with large sample size

ballardw — Wed, 03 Apr 2024 21:11:31 GMT

I suggest providing LOG from running your code including the code and all the notes and messages involved.

Without the code it is extremely hard to guess what options you may have used that might affect the output.

Also the notes from the code would include number of observations actually used. The data set may have 4 million observations but it is not impossible that fewer were actually used. If any of your observations included missing values for any variables on a model statement they would typically not be used by default.

Also there might be other diagnostic hints.

Re: Calibration of logit model with large sample size

lichee — Thu, 04 Apr 2024 02:11:47 GMT

Thank you! I'm attaching the log of PROC LOGISTIC and calibration plot.

I believe all the 3.96 million observations were used in the regression. Any insight is appreciated!

Re: Calibration of logit model with large sample size

Ksharp — Thu, 04 Apr 2024 02:17:14 GMT

Your sample size is too big which make Goodneed Of Fitness Test is nonsense.
Check @Rick_SAS blogs:
https://blogs.sas.com/content/iml/2019/02/20/easier-calibration-plot-sas.html
https://blogs.sas.com/content/iml/2018/05/16/decile-calibration-plots-sas.html
https://blogs.sas.com/content/iml/2018/05/14/calibration-plots-in-sas.html
https://blogs.sas.com/content/iml/2020/11/23/decile-plots-in-sas.html

Re: Calibration of logit model with large sample size

lichee — Thu, 04 Apr 2024 03:00:52 GMT

I followed https://blogs.sas.com/content/iml/2018/05/16/decile-calibration-plots-sas.html to do calibration plot to compare the observed probability and the estimated probability along the 45 degree diagonal line. With such a large sample size, would splitting the sample into a few smaller random samples make goodness of fit meaningful? Or stratify the sample into a few subsamples to estimate probability within each meaningful subsample.

Re: Calibration of logit model with large sample size

Ksharp — Thu, 04 Apr 2024 03:15:30 GMT

I think
https://blogs.sas.com/content/iml/2018/05/16/decile-calibration-plots-sas.html
is good enough, No need to split your data into many small sub-data.
Anyway, @Rick_SAS @StatDave might have insight in it.

Re: Calibration of logit model with large sample size

Ksharp — Thu, 04 Apr 2024 06:27:24 GMT

Since you have a big table, I would like to introduce PROC HPLOGISTIC .
Check "partition" statement and Hosmer-Lemeshow Test.