03-14-2013 06:49 PM
Let me start by saying I am new to fitting mixed models, am very excited to learn about them, and feel I have good understanding of the theory behind them (but not the optimization SAS is doing). I have been building a model in GLIMMIX, with only fixed effects for the time being (hoping to complicate things after binning fixed classification effects). Everything was going well, and I decided to add a spline (experimental in my version of SAS) for the age variable as follows:
DATA = SORTED;
CLASS X1 X2;
EFFECT AgeSpline = SPLINE(Age);
MODEL logPrice = X1 X2 AgeSpline /
DIST = NORMAL
LINK = IDENTITY
After fitting the spline, many (but not all?) of the standard errors of my class effect parameters go to 0, leaving me unable to determine which remain significantly different from one another (the non-base level parameters are nonzero). I have 400K observations and 298 columns in X (many more indicator variables for various classification levels than shown above for simplicity sake), so I do not believe I am fitting to every cell. The spline appears to have seven parameters associated with it.
I believe my issue may be in understanding how the spline equations work? Should I be using RANDOM _RESIDUAL_ / TYPE=RSMOOTH instead? Am I somehow overfitting something using the default spline?
Thank you very much for your help in what I hope isn't a foolish question.
03-16-2013 01:36 PM
As you describe it, your problem has nothing to do with the spline function.
If you have 298 columns in X [=intercept + # columns to represent X1 + # columns to represent X2 + # columns to represent spline function],
then this itself might explain why many of the standard errors of your 298 columns might equal zero even with 400K observations. You need at least two observations for each of the combinations among these columns to generate a standard error, and 298 columns/indicator variables for these CLASS effects can generate as many as 2**298 possible combinations, which greatly exceeds 400K.
Besides thinking over again what you want to do, consider using subject-matter knowledge to reduce the number of columns in X, perhaps by combining several similar categories of X1 and of X2.
03-17-2013 07:47 AM
Thanks for your help!
I guess I was blaming it on the spline because without the spline, I have no standard error issues and I didn't think adding a continuous predictor (actually 7 with the spline) would cause the sparseness issue. Actually just using Age as a linear predictor doesn't cause any issues either, which further led me to believe that I didn't have the problem you described. My hope was since Age is such a useful variable for my predictive model, I would be able to fully parameterize the model and then group my classification variables after making the adjustment to trim them down to a reasonable number. There is enough interplay between the variables where I worried that a binning conclusion made without all adjustments may not be valid after taking into account the others. But I guess I'm going to need to deal with it.
03-17-2013 11:36 AM
That is puzzling, that the zero standard errors surfaced only after you included the spline function. Generally speaking, CLASS statement variables are assumed to be categorical with one value, usually the largest (although you can change this), being assigned as a reference category with an estimate of zero and a missing value as its standard error. I can't explain why adding the spline would lead to the problem of zero standard errors in the other variables unless the number of independent indicator variables equals or almost equals the number of observations. Generally speaking, to obtain stable and reproducible models, one wants to have at least five and preferably at least ten observations per independent variable (where the number of levels in a categorical variable are each counted as separate independent variables).
03-17-2013 12:14 PM
Sorry. Because you did say that you had about 400,000 observations, my last comment about the number of estimated parameters vs. the number of observations is irrelevant.