BookmarkSubscribeRSS Feed
paulrallo
Calcite | Level 5

I am estimating a multinomial logit model on a very large dataset (>100Gb) which requires the use of proc hplogistic as proc logistic takes way too long to complete.  We want to test for the independence of irrelevant alternatives which requires the covariance matrix as an output.

 

Proc Logistic has the capability of outputting the covariance matrix; however estimating the model with proc logistic takes way too long to estimate.

 

I see from Proc HPLogistic's documentation that option 'nostderr' suppresses the computation of the covariance matrix and standard errors, so I believe the values are being calculated, however I cannot find any information on how to get them.  What works on a very small dataset is:

proc logistic data=dat outset=betas_cov covout;
model choice = a b c d/link=glogit;
run;

  But, what I need to do is the same thing but with the HPLogistic proc.  Any suggestions? Or, another alternative for getting at the Hausman McFadden test?

 

Thank you!

10 REPLIES 10
PaigeMiller
Diamond | Level 26

I'm guessing that creating a covariance matrix of a 100GB data set is going to take a few bazillion millibleems, that's why it's not in HPLOGISTIC ... but I don't know, I never tried. Instead of telling us 100GB data set, which really doesn't tell us anything, why don't you give us the number of x variables in your Logistic model, and the number of rows in this data set?

--
Paige Miller
Rick_SAS
SAS Super FREQ

As PaigeMiller says, the relevant information is the number of observations and variables. However, without seeing any of your data I predict that if you have hundreds of millions of observations then the Hausman McFadden test (whatever that might be) will REJECT the null hypothesis. I say this because almost EVERY statistical test rejects when you have gazillion observations! For some examples and discussion, see "Goodness-of-fit tests: A cautionary tale for large and small samples."

 

The general philosophy of the HP procedures is that these are for predictive modeling, not for making an inference. The rationale is that "all tests are statistically significant" for large data. That is why many HP procedures do not include the same inferential tests that are provided by there classical counterparts.

paulrallo
Calcite | Level 5

Thank you both.  To answer PaigeMiller's question, there are several different models I'm trying to retrieve the covariance matrix for, but as an example, one of them estimates on 42mm rows and has 325 variables.

 

Rick_SAS, I see your point regarding the estimation data size and this is likely an outcome.  This is good information regarding the HP procedures. 

jmg_va
Calcite | Level 5

Hi,

 

I'm also trying to get the covariance matrix from PROC HPLOGISTIC. Is there a way? I believe that my data set is small enough that this may be possible. I am running stepwise logistic on a multiply imputed data set (i.e. by imputation) - so need the covariance matrix to feed it into PROC MIANALYZE. to get final odds ratio confidence intervals that appropriately account for the variance

 

Thanks for any help you can give!

Sincerely,

Janet Grubber

Durham VA

Rick_SAS
SAS Super FREQ

I assume you are talking about the COVB matrix = covariance of the betas.

No. The HPLOGISTIC procedure does not output the COVB matrix.

 

Even if it did, I don't think you could use it for "stepwise logistic on a multiply imputed data set." In general, each imputed data set might result in a different selection of effects. Thus the first set of imputed values might select X1, X2, X3 whereas the second set could select X1, X4, X5, and X6.  The corresponding COVB matrices aren't for the same variables, so I don't think you can combine them in any meaningful way.

 

I think the best you can hope for is to select a fixed model and then use PROC LOGISTIC and the COVB option to generate the COVB= data set. Hopefully each BY group won't take too long. I suggest you use the INEST= option on the PROC LOGISTIC statement to provide good starting values for the parameter estimates. You can use PROC HPLOGISTIC to compute the starting values. If you use the final estimates from PROC HPLOGISTIC, then PROC LOGISTIC should only require one iteration,

jmg_va
Calcite | Level 5

Very good points. Thank you!

Janet

 

jmg_va
Calcite | Level 5

Second Thoughts!

 

Upon further thought, it seems potentially still be helpful to have a CovB output data set from PROC HPLOGISTIC. Once final model was chosen with hplogistic - it would be convenient to run (final model) in hplogistic as well (instead of switching back to proc logistic) by selecting method=none, and running the HPLOGISTIC models by imputation, Then it seems like the covb matrix could be fed into PROC MIANALYZE in the same way that it could be fed in from PROC LOGISTIC output data set.  Is this the case?

 

Thank you!

Janet

 

Rick_SAS
SAS Super FREQ

That sounds reasonable, and I don't dispute that you might be able to take advantage of a COVB option in HPLOGISTIC for your imputation. However, HPLOGISTIC does not support that option, so I was suggesting a workaround.

 

If the data sets are huge, you might ask yourself whether imputation is necessary. For example, if 10% of the observations are missing and you have 1 million observations, you can use listwise deletion and fit the model by using 900K observations.

jmg_va
Calcite | Level 5

Thank you...good work around...and that is what I'll be doing  For the future, is it possible to add the options for outputting COVB and ODDSRATIOS to HPLOGISTIC?

 

Unfortunately, listwise deletion is not going to be an option for this analysis - but it has given the opportunity to learn more about PROC MI, PROC MIANALYZE, etc.!

 

Thanks again!

Janet

 

Rick_SAS
SAS Super FREQ

I have no idea. Although I work at SAS, I do not speak for SAS nor am I developer of that procedure. 

 

I discussed your problem with a colleague, and he remembered that PROC GENSELECT supports the COV option on the PROC statement, so maybe you can model your logistic model by using  PROC GENSELECT COV DATA=.... Give it a try.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1549 views
  • 5 likes
  • 4 in conversation