I’m conducting logistic regression using proc logistic on the sample consisting of approximately 150000 people described by 1500 variables. The analysis lasts for about 8 hours. Do you know if there is any methodical way to speed it up? Or is it rather a software/hardware problem?
Hi Iryna.
I don't think you really need all these 1500 variables to be used in the model, do you ?
So I'd rather use both SELECTION=FORWARD and STOP=50 to see which variables are the (at most) fifty best-contributing to your model, and then rerun the model with them...
Are any of these 1,500 variables highly correlated? If so, you might be able to select one among a group of highly correlated variables or use a small number of principal components (from a Principal Components Analysis) for your logistic regression.
iryna, i think you don't need all 150000 records/observations either.
for example, if you are interested in variables that rate respondents' ratings of certain job attributes, you may want to use the data for employed respondents only.