11-15-2011 01:42 PM
I finally finished the running for a linear regresion model using PROC GENMOD. But it took 25 hours. The dataset has 1 million cases and 40 categorical variables.
What I don't quite understand is that:
The estimated intercept is 1400, as the overall mean (all the predicators in the model are categorical and parameterized with effect coding). But the original observed mean for the dependent variable is only 250. I don't understand why there is such a big difference. Because of poor model fit?
Thanks for your idea.
11-15-2011 02:11 PM
What is important is the number of levels (=unique values) in your classification variables. If each classification variable has 10 levels, then the regression involves approximately 400 dummy variables as regressors.
If I recall, you are using GENMOD only because you want to use a parametrerization that is different than the GLM encoding. How long does it take for your problem to run in GLM? GENMOD solves a maximum likelihood problem, which involves an iterative optimization, so it will be slower than GLM on the same problem.
For effect coding, the main effects estimate the difference in the effect of each nonreference level compared to the average effect over all four levels. That average effect gets lumped in with the intercept. That's why your Intercept estimate is different than the observed mean.
I assume you know that the predicted values you get from GENMOD are the same as you get from GLM. The only difference is how to INTERPRET the parameters. For an example with continuous variables, see http://blogs.sas.com/content/iml/2010/11/10/regression-coefficients-for-different-polynomial-bases/