topic Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn in Statistical Procedures

Understanding the difference between Logistic regression results between SAS and Python (sklearn)

hackr — Wed, 21 Apr 2021 13:40:37 GMT

I need to understand and, ideally, resolve the small discrepancy between (unpenalized) Logistic regression results in SAS and Python. As far as I can tell, it does not seem directly attributable to parameter arguments that can be easily changed in either implementation. I'm also curious about the same difference with R, but that's less important and solving Python will probably indirectly answer the same question with R.

I began with a Python-centric approach on StackOverflow, using a semi-famous SAS example from UCLA Stats department. No one there was able to give an answer, so a couple of days ago I opened a bounty on the question and today I thought I'd ping you SAS gurus.

Here's the example on StackOverflow:

https://stackoverflow.com/questions/67128818/source-of-small-deviance-between-sas-proc-logistic-and-python-sklearn-logisticre

Here's a transcription in case you don't want to click the link:

I'm trying to match the results of SAS's PROC LOGISTIC with sklearn in Python 3. SAS uses unpenalized regression, which I can achieve in sklearn.linear_model.LogisticRegression with the option C = 1e9 or penalty='none'.

That should be the end of the story, but I still notice a small difference when I use a public data set from UCLA and try to replicate their multiple regression of FEMALE and MATH on hiwrite.

This is my Python script:

# module imports
import pandas as pd
from sklearn.linear_model import LogisticRegression

# read in the data
df = pd.read_sas("~/Downloads/hsb2-4.sas7bdat")

# FE
df["hiwrite"] = df["write"] >= 52

print("\n\n")
print("Multiple Regression of Female and Math on hiwrite:")
feature_cols = ['female','math']

y=df["hiwrite"]
X=df[feature_cols]

# sklearn output
model = LogisticRegression(fit_intercept = True, C = 1e9)
mdl = model.fit(X, y)
print(mdl.intercept_)
print(mdl.coef_)

which yields:

Multiple Regression of Female and Math on hiwrite:
[-10.36619688]
[[1.63062846 0.1978864 ]]

UCLA has this result from SAS:

             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept     1    -10.3651      1.5535       44.5153        <.0001
FEMALE        1      1.6304      0.4052       16.1922        <.0001
MATH          1      0.1979      0.0293       45.5559        <.0001

which is close, but as you can see the intercept parameter estimate is different at the 3rd decimal place and the estimate on female is different at the 4th decimal place. I tried changing some of the other parameters (like tol and max_iter as well as the solver) but it did not change the results. I also tried the Logit in statsmodel.api - it matches sklearn, not SAS. R matches Python on the intercept and first coefficient, but is slightly different from both SAS and Python on the second coefficient...

Update: I have found that I can affect the estimates by playing with tol and also by changing the solver to liblinear while removing the penalty in a hack-ish way via the C parameter, however no matter how extreme the values it still doesn't quite match SAS.

Any thoughts on the source of the error and how to make Python match SAS?

Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn

PaigeMiller — Wed, 21 Apr 2021 13:56:50 GMT

@hackr wrote:

Any thoughts on the source of the error and how to make Python match SAS?

I really wouldn't bother. Clearly the underlying algorithms are different. This isn't like ordinary least squares regression, where a closed form solution exists and so the only difference between software should be round-off error.

Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn

StatDave — Wed, 21 Apr 2021 14:25:51 GMT

These small, practically insignificant differences always track back to differences in the iterative maximum likelihood algorithm, so you are on the right track. Valid fitting algorithms can take many approaches on deciding when it has reasonably converged on an optimal solution. Among the criteria used is the amount of change in the likelihood, in the gradients, or in the parameters themselves. In PROC LOGISTIC, you will see in the documentation of the MODEL statement that there are several options you can use to select and set the optimization criterion. The default option, GCONV=, uses the change in the gradients. Other choices are FCONV=, ABSFCONV=, and XCONV=. If you require that PROC LOGISTIC declare convergence based on the change in the log likelihood, then PROC LOGISTIC gives the same results as you show from Python - that is, if you specify the MODEL statement options ABSFCONV=1e-8 GCONV=0. These two options together require the log likelihood to not change more than 1x10**-8 and to not allow convergence based on the gradients. However, ultimately this makes no practical difference in almost any case as the results are both correct and quite close to each other.

Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn

hackr — Wed, 21 Apr 2021 16:00:38 GMT

Thank you StatDave, this is very helpful.

If you saw my other reply a moment ago, please disregard. I was trying to respond to someone else and I hadn't seen your answer yet.

Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn

Ksharp — Thu, 22 Apr 2021 12:20:10 GMT

Also check if the algorithm to estimate(Optimization Technique) is different .

Newton-Raphson with Ridging

V.S.

Fisher's scoring

Check the following code .

proc logistic data=sashelp.class;
model sex=weight height;
run;

proc hplogistic data=sashelp.class;
model sex=weight height;
run;