Programming the statistical procedures from SAS

Apparent bug in Proc Logistic scoring

Reply
Respected Advisor
Posts: 4,756

Apparent bug in Proc Logistic scoring

There is seemingly a bug in the treatment of WHERE dataset option in the SCORE statement of PROC LOGISTIC (SAS 9.3, Windows 32). Consider the simple test :

/* A dataset with selected=0 and selected=1 observations */

data test;
call streaminit(7657);
do x = 1 to 10;
selected = 1;
do i = 1 to 15;
  y = rand("NORMAL",5,2) < x;
  output;
  end;
selected = 0;
do i = 1 to 5;
  y = rand("BERNOULLI",0.5);
  output;
  end;
end;
run;

/* Estimate parameters with selected=1 observations and score with selected=0 observations */

proc logistic data=test(where=(selected));
model y(event="1") = x;
score data=test(where=(not selected)) fitstat;
run;

The output :

                           Fit Statistics for SCORE Data

                 Total           Log       Error
Data Set     Frequency    Likelihood        Rate         AIC        AICC         BIC

WORK.TEST          150      -61.3126      0.1867    126.6252    126.7069    132.6465

                                     Max-Rescaled                   Brier
Data Set           SC    R-Square        R-Square         AUC       Score

WORK.TEST    132.6465    0.430152        0.574768    0.893183    0.130008

Notice the scoring done on the original data (selected=1, n=150) and not on the requested observations (selected=0, n=50).

PG

PG
Super User
Super User
Posts: 6,703

Re: Apparent bug in Proc Logistic scoring

Not sure what you are talking about.  The NOTEs in the SAS log clearly show that it only read in 50 observations from the data set test.

675  proc logistic data=test(where=(selected));

676  model y(event="1") = x;

677  score data=test(where=(not selected)) fitstat;

678  run;

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

      WHERE selected;

NOTE: The data set WORK.DATA2 has 50 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

      real time           0.10 seconds

      cpu time            0.09 seconds

Respected Advisor
Posts: 4,756

Re: Apparent bug in Proc Logistic scoring

Not on my machine:


181
182  proc logistic data=test(where=(selected));
183  model y(event="1") = x;
184  score data=test(where=(not selected)) fitstat;
185  run;

NOTE: PROC LOGISTIC is modeling the probability that y=1.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: There were 150 observations read from the data set WORK.TEST.
      WHERE selected;
NOTE: The data set WORK.DATA2 has 150 observations and 8 variables.
NOTE: PROCEDURE LOGISTIC used (Total process time):
      real time           0.21 seconds
      cpu time            0.04 seconds

PG
Super User
Posts: 18,580

Re: Apparent bug in Proc Logistic scoring

Not on mine either. SAS 9.3 TS1M1

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

      WHERE selected;

NOTE: The data set WORK.DATA2 has 150 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

      real time           0.17 seconds

      cpu time            0.01 seconds

It does work fine on SAS 9.2 TS2M3 I believe.

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

      WHERE selected;

NOTE: The data set WORK.DATA1 has 50 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

      real time           1.01 seconds

      cpu time            0.15 seconds

Super User
Super User
Posts: 6,703

Re: Apparent bug in Proc Logistic scoring

Try it with two different datasets.

data selected notselected;

set test;

if selected output selected;

else output notselected;

run;

Respected Advisor
Posts: 4,756

Re: Apparent bug in Proc Logistic scoring

It works fine with two different datasets. This, of course, gives me a workaround. That's why I suspect the problem is with the WHERE condition. - PG

PG
Super User
Super User
Posts: 6,703

Re: Apparent bug in Proc Logistic scoring

You should report it to Tech Support to make sure they have it as a bug they are tracking

Respected Advisor
Posts: 4,756

Re: Apparent bug in Proc Logistic scoring

When I met this in my work, I would have been totally fooled if my data had been split in half.   - PG

PG
Respected Advisor
Posts: 4,756

Re: Apparent bug in Proc Logistic scoring

Thank you Tom and Reeza, I just submitted a bug report, citing this thread.

PG

PG
Respected Advisor
Posts: 4,756

Re: Apparent bug in Proc Logistic scoring

Here is SAS acknowledgement of the problem:

This does appear to be a defect in the software.  I have let development know and they should be fixing the problem shortly.  In the mean time you will need to use the work-around mentioned in the forum of creating the data sets prior to running logistic.  Thank you for letting us know.

Sincerely,  

Rob Agnelli  

Technical Support Statistician  

SAS    

PG
Contributor
Posts: 60

Re: Apparent bug in Proc Logistic scoring

Is

"NOTE: Convergence criterion (GCONV=1E-8) satisfied." a good thing o r bad thing when running the logisitic regression?

Respected Advisor
Posts: 4,756

Re: Apparent bug in Proc Logistic scoring

This question is not related to the subject of the discussion, it should have been submitted as a new discussion. But the answer is GOOD. Convergence means that the procedure found parameter values that maximize the log-likelihood, i.e. the best fit between your data and your model.

PG

PG
Ask a Question
Discussion stats
  • 11 replies
  • 749 views
  • 6 likes
  • 4 in conversation