Solved: Calibration of logit model with large sample size

lichee · Posted 04-03-2024 01:14 PM

Hi all,

I'm trying to assess if the logit model I'm using is a good one to estimate probability of an event. The overall rate of the event is around 4%. The sample size is nearly four million. Below is the decile calibration plot of predicted probability and observed probability. Does it mean it's a poor model? For such a large sample, should I split it into subsamples to improve model estimation/prediction? Thank you!

Ksharp · Posted 04-04-2024 02:27 AM

Since you have a big table, I would like to introduce PROC HPLOGISTIC .
Check "partition" statement and Hosmer-Lemeshow Test.

View solution in original post

ballardw · Posted 04-03-2024 05:11 PM

I suggest providing LOG from running your code including the code and all the notes and messages involved.

Without the code it is extremely hard to guess what options you may have used that might affect the output.

Also the notes from the code would include number of observations actually used. The data set may have 4 million observations but it is not impossible that fewer were actually used. If any of your observations included missing values for any variables on a model statement they would typically not be used by default.

Also there might be other diagnostic hints.

lichee · Posted 04-03-2024 10:11 PM

Thank you! I'm attaching the log of PROC LOGISTIC and calibration plot.

I believe all the 3.96 million observations were used in the regression. Any insight is appreciated!

Ksharp · Posted 04-03-2024 10:17 PM

Your sample size is too big which make Goodneed Of Fitness Test is nonsense.
Check @Rick_SAS blogs:
https://blogs.sas.com/content/iml/2019/02/20/easier-calibration-plot-sas.html
https://blogs.sas.com/content/iml/2018/05/16/decile-calibration-plots-sas.html
https://blogs.sas.com/content/iml/2018/05/14/calibration-plots-in-sas.html
https://blogs.sas.com/content/iml/2020/11/23/decile-plots-in-sas.html

lichee · Posted 04-03-2024 11:00 PM

I followed https://blogs.sas.com/content/iml/2018/05/16/decile-calibration-plots-sas.html to do calibration plot to compare the observed probability and the estimated probability along the 45 degree diagonal line. With such a large sample size, would splitting the sample into a few smaller random samples make goodness of fit meaningful? Or stratify the sample into a few subsamples to estimate probability within each meaningful subsample.

Ksharp · Posted 04-03-2024 11:15 PM

I think
https://blogs.sas.com/content/iml/2018/05/16/decile-calibration-plots-sas.html
is good enough, No need to split your data into many small sub-data.
Anyway, @Rick_SAS @StatDave might have insight in it.

Ksharp · Posted 04-04-2024 02:27 AM

Since you have a big table, I would like to introduce PROC HPLOGISTIC .
Check "partition" statement and Hosmer-Lemeshow Test.

Calibration of logit model with large sample size

Re: Calibration of logit model with large sample size

Re: Calibration of logit model with large sample size

Re: Calibration of logit model with large sample size

Re: Calibration of logit model with large sample size

Re: Calibration of logit model with large sample size

Re: Calibration of logit model with large sample size

Re: Calibration of logit model with large sample size

Registration is open