<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn in Statistical Procedures</title>
    <link>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/735971#M35734</link>
    <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/379215"&gt;@hackr&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Any thoughts on the source of the error and how to make Python match SAS?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I really wouldn't bother. Clearly the underlying algorithms are different. This isn't like ordinary least squares regression, where a closed form solution exists and so the only difference between software should be round-off error.&lt;/P&gt;</description>
    <pubDate>Wed, 21 Apr 2021 13:56:50 GMT</pubDate>
    <dc:creator>PaigeMiller</dc:creator>
    <dc:date>2021-04-21T13:56:50Z</dc:date>
    <item>
      <title>Understanding the difference between Logistic regression results between SAS and Python (sklearn)</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/735963#M35733</link>
      <description>&lt;P&gt;I need to understand and, ideally, resolve the small discrepancy between (unpenalized) Logistic regression results in SAS and Python. As far as I can tell, it does not seem directly attributable to parameter arguments that can be easily changed in either implementation. I'm also curious about the same difference with R, but that's less important and solving Python will probably indirectly answer the same question with R.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I began with a Python-centric approach on StackOverflow, using a semi-famous SAS example from UCLA Stats department. No one there was able to give an answer, so a couple of days ago I opened a bounty on the question and today I thought I'd ping you SAS gurus.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here's the example on StackOverflow:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/67128818/source-of-small-deviance-between-sas-proc-logistic-and-python-sklearn-logisticre" target="_blank" rel="noopener"&gt;https://stackoverflow.com/questions/67128818/source-of-small-deviance-between-sas-proc-logistic-and-python-sklearn-logisticre&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;Here's a transcription in case you don't want to click the link:&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm trying to match the results of SAS's&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;PROC LOGISTIC&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;sklearn&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in Python 3. SAS uses unpenalized regression, which I can achieve in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;sklearn.linear_model.LogisticRegression&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;with the option&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;C = 1e9&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;or&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;penalty='none'&lt;/CODE&gt;.&lt;/P&gt;&lt;P&gt;That should be the end of the story, but I still notice a small difference when I use a public&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsb2-4.sas7bdat" target="_blank" rel="nofollow noopener noreferrer"&gt;data set from UCLA&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and try to replicate&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://stats.idre.ucla.edu/unlinked/sas-logistic/proc-logistic-and-logistic-regression-models/" target="_blank" rel="nofollow noopener noreferrer"&gt;their multiple regression&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;of&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;FEMALE&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;MATH&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;on&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;hiwrite&lt;/CODE&gt;.&lt;/P&gt;&lt;P&gt;This is my Python script:&lt;/P&gt;&lt;PRE class="lang-py s-code-block hljs python"&gt;&lt;CODE&gt;&lt;SPAN class="hljs-comment"&gt;# module imports&lt;/SPAN&gt;
&lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; pandas &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; pd
&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; sklearn.linear_model &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; LogisticRegression

&lt;SPAN class="hljs-comment"&gt;# read in the data&lt;/SPAN&gt;
df = pd.read_sas(&lt;SPAN class="hljs-string"&gt;"~/Downloads/hsb2-4.sas7bdat"&lt;/SPAN&gt;)

&lt;SPAN class="hljs-comment"&gt;# FE&lt;/SPAN&gt;
df[&lt;SPAN class="hljs-string"&gt;"hiwrite"&lt;/SPAN&gt;] = df[&lt;SPAN class="hljs-string"&gt;"write"&lt;/SPAN&gt;] &amp;gt;= &lt;SPAN class="hljs-number"&gt;52&lt;/SPAN&gt;

&lt;SPAN class="hljs-built_in"&gt;print&lt;/SPAN&gt;(&lt;SPAN class="hljs-string"&gt;"\n\n"&lt;/SPAN&gt;)
&lt;SPAN class="hljs-built_in"&gt;print&lt;/SPAN&gt;(&lt;SPAN class="hljs-string"&gt;"Multiple Regression of Female and Math on hiwrite:"&lt;/SPAN&gt;)
feature_cols = [&lt;SPAN class="hljs-string"&gt;'female'&lt;/SPAN&gt;,&lt;SPAN class="hljs-string"&gt;'math'&lt;/SPAN&gt;]

y=df[&lt;SPAN class="hljs-string"&gt;"hiwrite"&lt;/SPAN&gt;]
X=df[feature_cols]

&lt;SPAN class="hljs-comment"&gt;# sklearn output&lt;/SPAN&gt;
model = LogisticRegression(fit_intercept = &lt;SPAN class="hljs-literal"&gt;True&lt;/SPAN&gt;, C = &lt;SPAN class="hljs-number"&gt;1e9&lt;/SPAN&gt;)
mdl = model.fit(X, y)
&lt;SPAN class="hljs-built_in"&gt;print&lt;/SPAN&gt;(mdl.intercept_)
&lt;SPAN class="hljs-built_in"&gt;print&lt;/SPAN&gt;(mdl.coef_)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;which yields:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;PRE class="lang-py s-code-block hljs python"&gt;&lt;CODE&gt;Multiple Regression of Female &lt;SPAN class="hljs-keyword"&gt;and&lt;/SPAN&gt; Math on hiwrite:
[-&lt;SPAN class="hljs-number"&gt;10.36619688&lt;/SPAN&gt;]
[[&lt;SPAN class="hljs-number"&gt;1.63062846&lt;/SPAN&gt; &lt;SPAN class="hljs-number"&gt;0.1978864&lt;/SPAN&gt; ]]&lt;/CODE&gt;&lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;UCLA has this result from SAS:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;PRE class="lang-py s-code-block hljs python"&gt;&lt;CODE&gt;             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr &amp;gt; ChiSq
Intercept     &lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;    -&lt;SPAN class="hljs-number"&gt;10.3651&lt;/SPAN&gt;      &lt;SPAN class="hljs-number"&gt;1.5535&lt;/SPAN&gt;       &lt;SPAN class="hljs-number"&gt;44.5153&lt;/SPAN&gt;        &amp;lt;&lt;SPAN class="hljs-number"&gt;.0001&lt;/SPAN&gt;
FEMALE        &lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;      &lt;SPAN class="hljs-number"&gt;1.6304&lt;/SPAN&gt;      &lt;SPAN class="hljs-number"&gt;0.4052&lt;/SPAN&gt;       &lt;SPAN class="hljs-number"&gt;16.1922&lt;/SPAN&gt;        &amp;lt;&lt;SPAN class="hljs-number"&gt;.0001&lt;/SPAN&gt;
MATH          &lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;      &lt;SPAN class="hljs-number"&gt;0.1979&lt;/SPAN&gt;      &lt;SPAN class="hljs-number"&gt;0.0293&lt;/SPAN&gt;       &lt;SPAN class="hljs-number"&gt;45.5559&lt;/SPAN&gt;        &amp;lt;&lt;SPAN class="hljs-number"&gt;.0001&lt;/SPAN&gt;
&lt;/CODE&gt;&lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;which is close, but as you can see the intercept parameter estimate is different at the 3rd decimal place and the estimate on&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;female&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is different at the 4th decimal place. I tried changing some of the other parameters (like&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;tol&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;max_iter&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;as well as the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;solver&lt;/CODE&gt;) but it did not change the results. I also tried the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;Logit&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;statsmodel.api&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;- it matches&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;sklearn&lt;/CODE&gt;, not SAS. R matches Python on the intercept and first coefficient, but is slightly different from both SAS and Python on the second coefficient...&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Update:&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;I have found that I can affect the estimates by playing with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;tol&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and also by changing the solver to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;liblinear&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;while removing the penalty in a hack-ish way via the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;C&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;parameter, however no matter how extreme the values it still doesn't quite match SAS.&lt;/P&gt;&lt;P&gt;Any thoughts on the source of the error and how to make Python match SAS?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Apr 2021 13:40:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/735963#M35733</guid>
      <dc:creator>hackr</dc:creator>
      <dc:date>2021-04-21T13:40:37Z</dc:date>
    </item>
    <item>
      <title>Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/735971#M35734</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/379215"&gt;@hackr&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Any thoughts on the source of the error and how to make Python match SAS?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I really wouldn't bother. Clearly the underlying algorithms are different. This isn't like ordinary least squares regression, where a closed form solution exists and so the only difference between software should be round-off error.&lt;/P&gt;</description>
      <pubDate>Wed, 21 Apr 2021 13:56:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/735971#M35734</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2021-04-21T13:56:50Z</dc:date>
    </item>
    <item>
      <title>Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/735986#M35735</link>
      <description>&lt;P&gt;These small, practically insignificant differences always track back to differences in the iterative maximum likelihood algorithm, so you are on the right track. Valid fitting algorithms can take many approaches on deciding when it has reasonably converged on an optimal solution. Among the criteria used is the amount of change in the likelihood, in the gradients, or in the parameters themselves. In PROC LOGISTIC, you will see in the documentation of the MODEL statement that there are several options you can use to select and set the optimization criterion. The default option, GCONV=, uses the change in the gradients. Other choices are FCONV=, ABSFCONV=, and XCONV=. If you require that PROC LOGISTIC declare convergence based on the change in the log likelihood, then PROC LOGISTIC gives the same results as you show from Python - that is, if you specify the MODEL statement options ABSFCONV=1e-8 GCONV=0. These two options together require the log likelihood to not change more than 1x10**-8 and to not allow convergence based on the gradients. However, ultimately this makes no practical difference in almost any case as the results are both correct and quite close to each other.&lt;/P&gt;</description>
      <pubDate>Wed, 21 Apr 2021 14:25:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/735986#M35735</guid>
      <dc:creator>StatDave</dc:creator>
      <dc:date>2021-04-21T14:25:51Z</dc:date>
    </item>
    <item>
      <title>Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/736047#M35740</link>
      <description>&lt;P&gt;Thank you StatDave, this is very helpful.&lt;BR /&gt;&lt;BR /&gt;If you saw my other reply a moment ago, please disregard. I was trying to respond to someone else and I hadn't seen your answer yet.&lt;/P&gt;</description>
      <pubDate>Wed, 21 Apr 2021 16:00:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/736047#M35740</guid>
      <dc:creator>hackr</dc:creator>
      <dc:date>2021-04-21T16:00:38Z</dc:date>
    </item>
    <item>
      <title>Re: Understanding the difference between Logistic regression results between SAS and Python (sklearn</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/736380#M35755</link>
      <description>&lt;P&gt;Also check if the algorithm to estimate(Optimization Technique) is different .&lt;/P&gt;
&lt;P&gt;Newton-Raphson with Ridging&lt;/P&gt;
&lt;P&gt;V.S.&lt;/P&gt;
&lt;P&gt;Fisher's scoring&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Check the following code .&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;proc logistic data=sashelp.class;&lt;BR /&gt;model sex=weight height;&lt;BR /&gt;run;&lt;/P&gt;
&lt;P&gt;proc hplogistic data=sashelp.class;&lt;BR /&gt;model sex=weight height;&lt;BR /&gt;run;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Apr 2021 12:20:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Understanding-the-difference-between-Logistic-regression-results/m-p/736380#M35755</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2021-04-22T12:20:10Z</dc:date>
    </item>
  </channel>
</rss>

