I need to understand and, ideally, resolve the small discrepancy between (unpenalized) Logistic regression results in SAS and Python. As far as I can tell, it does not seem directly attributable to parameter arguments that can be easily changed in either implementation. I'm also curious about the same difference with R, but that's less important and solving Python will probably indirectly answer the same question with R.
I began with a Python-centric approach on StackOverflow, using a semi-famous SAS example from UCLA Stats department. No one there was able to give an answer, so a couple of days ago I opened a bounty on the question and today I thought I'd ping you SAS gurus.
Here's the example on StackOverflow:
Here's a transcription in case you don't want to click the link:
I'm trying to match the results of SAS's PROC LOGISTIC
with sklearn
in Python 3. SAS uses unpenalized regression, which I can achieve in sklearn.linear_model.LogisticRegression
with the option C = 1e9
or penalty='none'
.
That should be the end of the story, but I still notice a small difference when I use a public data set from UCLA and try to replicate their multiple regression of FEMALE
and MATH
on hiwrite
.
This is my Python script:
# module imports
import pandas as pd
from sklearn.linear_model import LogisticRegression
# read in the data
df = pd.read_sas("~/Downloads/hsb2-4.sas7bdat")
# FE
df["hiwrite"] = df["write"] >= 52
print("\n\n")
print("Multiple Regression of Female and Math on hiwrite:")
feature_cols = ['female','math']
y=df["hiwrite"]
X=df[feature_cols]
# sklearn output
model = LogisticRegression(fit_intercept = True, C = 1e9)
mdl = model.fit(X, y)
print(mdl.intercept_)
print(mdl.coef_)
which yields:
Multiple Regression of Female and Math on hiwrite:
[-10.36619688]
[[1.63062846 0.1978864 ]]
UCLA has this result from SAS:
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -10.3651 1.5535 44.5153 <.0001
FEMALE 1 1.6304 0.4052 16.1922 <.0001
MATH 1 0.1979 0.0293 45.5559 <.0001
which is close, but as you can see the intercept parameter estimate is different at the 3rd decimal place and the estimate on female
is different at the 4th decimal place. I tried changing some of the other parameters (like tol
and max_iter
as well as the solver
) but it did not change the results. I also tried the Logit
in statsmodel.api
- it matches sklearn
, not SAS. R matches Python on the intercept and first coefficient, but is slightly different from both SAS and Python on the second coefficient...
Update: I have found that I can affect the estimates by playing with tol
and also by changing the solver to liblinear
while removing the penalty in a hack-ish way via the C
parameter, however no matter how extreme the values it still doesn't quite match SAS.
Any thoughts on the source of the error and how to make Python match SAS?
@hackr wrote:
Any thoughts on the source of the error and how to make Python match SAS?
I really wouldn't bother. Clearly the underlying algorithms are different. This isn't like ordinary least squares regression, where a closed form solution exists and so the only difference between software should be round-off error.
These small, practically insignificant differences always track back to differences in the iterative maximum likelihood algorithm, so you are on the right track. Valid fitting algorithms can take many approaches on deciding when it has reasonably converged on an optimal solution. Among the criteria used is the amount of change in the likelihood, in the gradients, or in the parameters themselves. In PROC LOGISTIC, you will see in the documentation of the MODEL statement that there are several options you can use to select and set the optimization criterion. The default option, GCONV=, uses the change in the gradients. Other choices are FCONV=, ABSFCONV=, and XCONV=. If you require that PROC LOGISTIC declare convergence based on the change in the log likelihood, then PROC LOGISTIC gives the same results as you show from Python - that is, if you specify the MODEL statement options ABSFCONV=1e-8 GCONV=0. These two options together require the log likelihood to not change more than 1x10**-8 and to not allow convergence based on the gradients. However, ultimately this makes no practical difference in almost any case as the results are both correct and quite close to each other.
Thank you StatDave, this is very helpful.
If you saw my other reply a moment ago, please disregard. I was trying to respond to someone else and I hadn't seen your answer yet.
Also check if the algorithm to estimate(Optimization Technique) is different .
Newton-Raphson with Ridging
V.S.
Fisher's scoring
Check the following code .
proc logistic data=sashelp.class;
model sex=weight height;
run;
proc hplogistic data=sashelp.class;
model sex=weight height;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.