06-08-2018 04:10 PM
I am estimating a multinomial logit model on a very large dataset (>100Gb) which requires the use of proc hplogistic as proc logistic takes way too long to complete. We want to test for the independence of irrelevant alternatives which requires the covariance matrix as an output.
Proc Logistic has the capability of outputting the covariance matrix; however estimating the model with proc logistic takes way too long to estimate.
I see from Proc HPLogistic's documentation that option 'nostderr' suppresses the computation of the covariance matrix and standard errors, so I believe the values are being calculated, however I cannot find any information on how to get them. What works on a very small dataset is:
proc logistic data=dat outset=betas_cov covout; model choice = a b c d/link=glogit; run;
But, what I need to do is the same thing but with the HPLogistic proc. Any suggestions? Or, another alternative for getting at the Hausman McFadden test?
06-08-2018 05:38 PM
I'm guessing that creating a covariance matrix of a 100GB data set is going to take a few bazillion millibleems, that's why it's not in HPLOGISTIC ... but I don't know, I never tried. Instead of telling us 100GB data set, which really doesn't tell us anything, why don't you give us the number of x variables in your Logistic model, and the number of rows in this data set?
06-11-2018 09:10 AM
As PaigeMiller says, the relevant information is the number of observations and variables. However, without seeing any of your data I predict that if you have hundreds of millions of observations then the Hausman McFadden test (whatever that might be) will REJECT the null hypothesis. I say this because almost EVERY statistical test rejects when you have gazillion observations! For some examples and discussion, see "Goodness-of-fit tests: A cautionary tale for large and small samples."
The general philosophy of the HP procedures is that these are for predictive modeling, not for making an inference. The rationale is that "all tests are statistically significant" for large data. That is why many HP procedures do not include the same inferential tests that are provided by there classical counterparts.
06-11-2018 11:39 AM
Thank you both. To answer PaigeMiller's question, there are several different models I'm trying to retrieve the covariance matrix for, but as an example, one of them estimates on 42mm rows and has 325 variables.
Rick_SAS, I see your point regarding the estimation data size and this is likely an outcome. This is good information regarding the HP procedures.