I need to understand and, ideally, resolve the small discrepancy between (unpenalized) Logistic regression results in SAS and Python. As far as I can tell, it does not seem directly attributable to parameter arguments that can be easily changed in either implementation. I'm also curious about the same difference with R, but that's less important and solving Python will probably indirectly answer the same question with R. I began with a Python-centric approach on StackOverflow, using a semi-famous SAS example from UCLA Stats department. No one there was able to give an answer, so a couple of days ago I opened a bounty on the question and today I thought I'd ping you SAS gurus. Here's the example on StackOverflow: https://stackoverflow.com/questions/67128818/source-of-small-deviance-between-sas-proc-logistic-and-python-sklearn-logisticre Here's a transcription in case you don't want to click the link: I'm trying to match the results of SAS's PROC LOGISTIC with sklearn in Python 3. SAS uses unpenalized regression, which I can achieve in sklearn.linear_model.LogisticRegression with the option C = 1e9 or penalty='none' . That should be the end of the story, but I still notice a small difference when I use a public data set from UCLA and try to replicate their multiple regression of FEMALE and MATH on hiwrite . This is my Python script: # module imports
import pandas as pd
from sklearn.linear_model import LogisticRegression
# read in the data
df = pd.read_sas("~/Downloads/hsb2-4.sas7bdat")
# FE
df["hiwrite"] = df["write"] >= 52
print("\n\n")
print("Multiple Regression of Female and Math on hiwrite:")
feature_cols = ['female','math']
y=df["hiwrite"]
X=df[feature_cols]
# sklearn output
model = LogisticRegression(fit_intercept = True, C = 1e9)
mdl = model.fit(X, y)
print(mdl.intercept_)
print(mdl.coef_) which yields: Multiple Regression of Female and Math on hiwrite:
[-10.36619688]
[[1.63062846 0.1978864 ]] UCLA has this result from SAS: Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -10.3651 1.5535 44.5153 <.0001
FEMALE 1 1.6304 0.4052 16.1922 <.0001
MATH 1 0.1979 0.0293 45.5559 <.0001
which is close, but as you can see the intercept parameter estimate is different at the 3rd decimal place and the estimate on female is different at the 4th decimal place. I tried changing some of the other parameters (like tol and max_iter as well as the solver ) but it did not change the results. I also tried the Logit in statsmodel.api - it matches sklearn , not SAS. R matches Python on the intercept and first coefficient, but is slightly different from both SAS and Python on the second coefficient... Update: I have found that I can affect the estimates by playing with tol and also by changing the solver to liblinear while removing the penalty in a hack-ish way via the C parameter, however no matter how extreme the values it still doesn't quite match SAS. Any thoughts on the source of the error and how to make Python match SAS?
... View more