10-04-2013 05:07 PM
I'm considering options for a multivariate logistic regression which will predict inpatient admission on a particular day based on past admissions and other time-varying covariates during a two year study period.
Given that my dataset will contain several million patient*days, only a small % of which are inpatient admission days, I'm considering sampling from the dependent variable (also called endogenous stratified sampling) as suggested by Paul Allison in Logistic Regression Using the SAS System when modeling rare binomial events.
My strategy is to retain all of the patient*days on which an inpatient admission occured (to avoid sacrificing the model's power) and to randomly select a small sample, by patient, from the millions of patient*days when there was no IP admit.
Is proc surveylogistic the appropriate procedure to use for this model?
Or do I even need to make adjustments for sampling? I believe the model coefficients (except the intercept) will not be biased when sampling from the dependent variable.