About hackr

hackr · ‎04-21-2021

Thank you StatDave, this is very helpful. If you saw my other reply a moment ago, please disregard. I was trying to respond to someone else and I hadn't seen your answer yet.

hackr · ‎04-21-2021

I need to understand and, ideally, resolve the small discrepancy between (unpenalized) Logistic regression results in SAS and Python. As far as I can tell, it does not seem directly attributable to parameter arguments that can be easily changed in either implementation. I'm also curious about the same difference with R, but that's less important and solving Python will probably indirectly answer the same question with R. I began with a Python-centric approach on StackOverflow, using a semi-famous SAS example from UCLA Stats department. No one there was able to give an answer, so a couple of days ago I opened a bounty on the question and today I thought I'd ping you SAS gurus. Here's the example on StackOverflow: https://stackoverflow.com/questions/67128818/source-of-small-deviance-between-sas-proc-logistic-and-python-sklearn-logisticre Here's a transcription in case you don't want to click the link: I'm trying to match the results of SAS's PROC LOGISTIC with sklearn in Python 3. SAS uses unpenalized regression, which I can achieve in sklearn.linear_model.LogisticRegression with the option C = 1e9 or penalty='none' . That should be the end of the story, but I still notice a small difference when I use a public data set from UCLA and try to replicate their multiple regression of FEMALE and MATH on hiwrite . This is my Python script: # module imports import pandas as pd from sklearn.linear_model import LogisticRegression # read in the data df = pd.read_sas("~/Downloads/hsb2-4.sas7bdat") # FE df["hiwrite"] = df["write"] >= 52 print("\n\n") print("Multiple Regression of Female and Math on hiwrite:") feature_cols = ['female','math'] y=df["hiwrite"] X=df[feature_cols] # sklearn output model = LogisticRegression(fit_intercept = True, C = 1e9) mdl = model.fit(X, y) print(mdl.intercept_) print(mdl.coef_) which yields: Multiple Regression of Female and Math on hiwrite: [-10.36619688] [[1.63062846 0.1978864 ]] UCLA has this result from SAS: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -10.3651 1.5535 44.5153 <.0001 FEMALE 1 1.6304 0.4052 16.1922 <.0001 MATH 1 0.1979 0.0293 45.5559 <.0001 which is close, but as you can see the intercept parameter estimate is different at the 3rd decimal place and the estimate on female is different at the 4th decimal place. I tried changing some of the other parameters (like tol and max_iter as well as the solver ) but it did not change the results. I also tried the Logit in statsmodel.api - it matches sklearn , not SAS. R matches Python on the intercept and first coefficient, but is slightly different from both SAS and Python on the second coefficient... Update: I have found that I can affect the estimates by playing with tol and also by changing the solver to liblinear while removing the penalty in a hack-ish way via the C parameter, however no matter how extreme the values it still doesn't quite match SAS. Any thoughts on the source of the error and how to make Python match SAS?

Online Status	Offline
Date Last Visited	‎04-21-2021 03:21 PM

Re: Understanding the difference between Logistic regression results b...

Understanding the difference between Logistic regression results betwe...

Re: Understanding the difference between Logistic regression results b...

Re: Understanding the difference between Logistic regression results b...

Understanding the difference between Logistic regression results betwe...