08-08-2017 10:29 AM
I am running logit regressions using proc hpgenselect. The model is specified with class statement and there can be more than 100 levels in the class variable. The problem I am facing is some of coefficient estimates have zero DF with missing Standard Error. My understanding is that SE can be missing when the Hessian could not be calculated, and I thought choosing second order optimization method could resolve the issue. I have less variables with missing SE after using second order optimization, but still have some parameters missing SE. Can anyone explain why? I am looking for both 'why' and 'how to resolve'. I have no liberty to make changes on model specification. Thanks!
08-08-2017 10:41 AM
Are you sure that this isn't just the standard behavior when you use the GLM parameterization for a classification variable?
Try switching to a reference parameterization and see if it solves the issue. Compare the following:
proc hpgenselect data=sashelp.class; class age; /* use GLM parameterization, which is singular */ model weight = age; run; proc hpgenselect data=sashelp.class; class age / ref=LAST; /* use nonsingular parameterization */ model weight = age; run;
For documentation of the various parameterizations, see
08-08-2017 05:01 PM
Thank you, Rick. I see less number of parameters are missing SE when the reference parameterization is used. But I still have missing SE for some variables. I think parameterization is part of the reason but probably not the only reason. Because if I run the same model with GLM parameterization on different input datasets, I do not have missing SE at all.
08-08-2017 06:39 PM
if I run the same model with GLM parameterization on different input datasets, I do not have missing SE at all.
Which strongly indicates there is something in your data, perhaps no variability for some variables or combinations there of.
08-09-2017 04:20 AM
If your class variable with more than 100 levels is not the parameter of interest, then I will suggest to use a conditional regression. That means you still have the variable in the model, but in a non-parametric way. You will therefore not get estimates for that variable. This will reduce the number of parameters alot, and therefore also likely solve the problem. There are different ways to do that. In logistic regression you can simply use the strata statement in PROC LOGISTIC. If your data is continous you can put the class variable into the "absorb" statement instead of class statement in PROC GLM.
08-09-2017 09:29 AM
Hi Rick @Rick_SAS, the syntax is as below.
proc hpgenselect data=subset;
class GeogKey ;
id prodkey geogkey week;
ods output parameterestimates = est0 ConvergenceStatus=ConvgStatus ;
model &dep (event='1') = GeogKey &explanatoryvars /dist=Binary link=logit NoInt ;
output out=PredData predicted=Pred_&dep ;
@JacobSimonsen Nonparameterizing class variables cannot be an option for this project. I agree with you @ballardw that this is very likely to be data issue, but all the invariant variables are dropped before this procedure. My question at this point is how optimization could converge eventhough it could not calculate Hessian (my understanding is that the default optimization technique is second derivative method).
Many thanks for all of your help.
08-09-2017 09:50 AM
My guess is that missing values are the issue.
Let's say that one level of GEOGKEY (call it GEOKEY='K') has only a few observations and that one or more of the covariates have missing values for those observations. The procedure will drop the observations that have missing values, which will perhaps leave only ONE observation for the set GEOGKEY='K'. That will lead to the DF=0 issue that you report. (This could also occur if your weight variable has missing or nonpositive values.)
To determine whether this is the issue, you could do the following:
1. Output the design matrix, probably by using PROC LOGISTIC
2. Use the DATA step to delete all rows that have any missing values for the explanatory variables or invalid weight variables.
3. Use PROC FREQ ORDER=FREQ; TABLE GEOGKEY; RUN; to count the number of valid observations in each level of GEOGKEY.
If there are one or more levels that have insufficiently many observations, you can use a WHERE clause to exclude them from the analysis.
08-31-2017 09:55 AM - edited 08-31-2017 10:00 AM
I only thought about invariance of X variables at the total model level, but I never thought about the invariance within the class variable. In my data there are X variables invariant within the geogkeys. I concluded invariant X variable within a class level caused the singularity. Thank you all for your help.