BookmarkSubscribeRSS Feed
ncmc
Fluorite | Level 6

I am running logit regressions using proc hpgenselect. The model is specified with class statement and there can be more than 100 levels in the class variable. The problem I am facing is some of coefficient estimates have zero DF with missing Standard Error. My understanding is that SE can be missing when the Hessian could not be calculated, and I thought choosing second order optimization method could resolve the issue. I have less variables with missing SE after using second order optimization, but still have some parameters missing SE. Can anyone explain why? I am looking for both 'why' and 'how to resolve'. I have no liberty to make changes on model specification. Thanks!

 

8 REPLIES 8
Rick_SAS
SAS Super FREQ

Are you sure that this isn't just the standard behavior when you use the GLM parameterization for a classification variable?

Try switching to a reference parameterization and see if it solves the issue. Compare the following:

 

proc hpgenselect data=sashelp.class;
class age; /* use GLM parameterization, which is singular */
model weight = age;
run;

proc hpgenselect data=sashelp.class;
class age / ref=LAST;  /* use nonsingular parameterization */
model weight = age;
run;

For documentation of the various parameterizations, see 

http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_introcom_sec...

 

ncmc
Fluorite | Level 6

Thank you, Rick. I see less number of parameters are missing SE when the reference parameterization is used. But I still have missing SE for some variables. I think parameterization is part of the reason but probably not the only reason. Because if I run the same model with GLM parameterization on different input datasets, I do not have missing SE at all.  

ballardw
Super User

@ncmc wrote:

if I run the same model with GLM parameterization on different input datasets, I do not have missing SE at all.  


Which strongly indicates there is something in your data, perhaps no variability for some variables or combinations there of.

JacobSimonsen
Barite | Level 11

If your class variable with more than 100 levels is not the parameter of interest, then I will suggest to use a conditional regression. That means you still have the variable in the model, but in a non-parametric way. You will therefore not get estimates for that variable. This will reduce the number of parameters alot, and therefore also likely solve the problem. There are different ways to do that. In logistic regression you can simply use the strata statement in PROC LOGISTIC. If your data is continous you can put the class variable into the "absorb" statement instead of class statement in PROC GLM. 

Rick_SAS
SAS Super FREQ

Please show us the syntax that you are using to call PROC HPGENSELECT.

ncmc
Fluorite | Level 6

Hi Rick @Rick_SAS, the syntax is as below.

 

proc hpgenselect data=subset;

where HoldOut=0;

class GeogKey ;

id prodkey geogkey week;

ods output parameterestimates = est0 ConvergenceStatus=ConvgStatus ;

model &dep (event='1') = GeogKey  &explanatoryvars /dist=Binary link=logit NoInt ;

weight OBSWGT;

output out=PredData predicted=Pred_&dep ;

run;

 

@JacobSimonsen Nonparameterizing class variables cannot be an option for this project. I agree with you @ballardw that this is very likely to be data issue, but all the invariant variables are dropped before this procedure. My question at this point is how optimization could converge eventhough it could not calculate Hessian (my understanding is that the default optimization technique is second derivative method). 

 

Many thanks for all of your help. 

Rick_SAS
SAS Super FREQ

My guess is that missing values are the issue.

 

Let's say that one level of GEOGKEY (call it GEOKEY='K') has only a few observations and that one or more of the covariates have missing values for those observations. The procedure will drop the observations that have missing values, which will perhaps leave only ONE observation for the set GEOGKEY='K'.  That will lead to the DF=0 issue that you report.  (This could also occur if your weight variable has missing or nonpositive values.)

 

To determine whether this is the issue, you could do the following:

1. Output the design matrix, probably by using PROC LOGISTIC

2. Use the DATA step to delete all rows that have any missing values for the explanatory variables or invalid weight variables.

3. Use PROC FREQ ORDER=FREQ; TABLE GEOGKEY; RUN; to count the number of valid observations in each level of GEOGKEY.  

If there are one or more levels that have insufficiently many observations, you can use a WHERE clause to exclude them from the analysis.

ncmc
Fluorite | Level 6

I only thought about invariance of X variables at the total model level, but I never thought about the invariance within the class variable. In my data there are X variables invariant within the geogkeys. I concluded invariant X variable within a class level caused the singularity. Thank you all for your help.  

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1937 views
  • 1 like
  • 4 in conversation