I'm running a model similar to the following:
proc logistic data=table; model Y = X1 X2 X1*X2 X3 X4 X5; run;
In this model, Y equals 0 or 1 while X1 and X2 are indicator variables (equal to 0 or 1) and X3, X4, and X5 are continuous. In this sample, Y = 0 for all observations where X1*X2 = 1. Thus, X1*X2 should not be estimable. However, SAS still provides a point estimate and a statistically significant p value for X1*X2 without displaying any error or warning in the log such as separation of data points. As far as SAS is concerned, "convergence criterion (GCONV=1E-8) satisfied" and all is dandy in the world.
Why? What is going on? Surely SAS shouldn't be behaving this way? When running this same model on the same sample in Stata, Stata appropriately drops X1*X2 when estimating this model.
Any insights on this would be great.
If X1 and X2 are binary variables, you should not treat them as regression variables. Put a Class Statement above your Model Statement like this
class X1 X2;Looks to me like X2 is an excellent predictor for Y. Colinearity is a problem when it occurs between predictors, in which case it is sometimes better to drop one of the culprits. But one does expect some sort of relationship between the dependent variable and its predictors. Issuing a note when that relationship is a little too perfect might be a good idea though.
Looks to me like X2 is an excellent predictor for Y. Colinearity is a problem when it occurs between predictors, in which case it is sometimes better to drop one of the culprits. But one does expect some sort of relationship between the dependent variable and its predictors. Issuing a note when that relationship is a little too perfect might be a good idea though.
PG
Edited my original post to clarify the model. However, the original point still stands. You should not be able to estimate a point estimate for a variable in a logistic model via maximum likelihood if that variable has no variation in Y. For example, see http://support.sas.com/rnd/app/stat/papers/logistic.pdf or https://www.statalist.org/forums/forum/general-stata-discussion/general/1357105-stata-omits-variable... or page 5 of https://www.stata.com/manuals13/rlogit.pdf.
I would expect SAS to at least throw a warning or an error when this happens. It should not be providing a point estimate with p values and pretending like nothing is wrong. Does anyone know why SAS is behaving this way?
You haven't provided data, so there is not a lot we can say. Issues like this usually require looking at the data.
I can say that when I try to reproduce your claim by using a simulation, SAS reports the error that you are expecting. Try running the code below. Do you see these warnings? If so, maybe your data are not what you believe them to be.
SAS Log:
WARNING: There is possibly a quasi-complete separation of data points.
The maximum likelihood estimate may not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning.
Results shown are based on the last maximum likelihood
iteration. Validity of the model fit is questionable.
SAS Output:
SAS Output
| Model Convergence Status | 
|---|
| Quasi-complete separation of data points detected. | 
| Warning: | The maximum likelihood estimate may not exist. | 
data Have;
call streaminit(1234);
do i = 1 to 200;
   x1 = rand("Bernoulli", 0.7);
   x2 = rand("Bernoulli", 0.5);
   x3 = rand("Normal", 2, 3);
   x4 = rand("Normal", 0, 1);
   x5 = rand("Normal", -1, 2);
   eta = x1 - x2 + 0.5*x1*x2 + x3 - 2*x4 + 3*x5;
   if x1*x2=1 then 
      Y = 1;
   else
      Y = rand("Bernoulli", logistic(eta));
   output;
end;
run;
proc logistic data=Have;
 class x1 x2;
 model Y(event='1') = X1 X2 X1*X2 X3 X4 X5;  /* quasi-separation */
 *model Y = X1 X2 X3 X4 X5;  /* model OK */
run;I can't provide the data on a public form. However, I know usually that a warning message is displayed. I've seen complete or quasi-separation of data point warning messages before. (I get the quasi-separation of data points warning when running your code.) In my case, however, no warning is being displayed. I assure you my data is as described. Plus, Stata behaves exactly as expected by dropping the variable so...
Maybe I could privately share the dataset with someone at SAS who can diagnose? This may be a rare edge case. SAS has been known to provide misleading coefficients before without appropriate warning messages (https://pdfs.semanticscholar.org/4f17/1322108dff719da6aa0d354d5f73c9c474de.pdf).
SAS Technical Support is always happy to help.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
