BookmarkSubscribeRSS Feed
PGStats
Opal | Level 21

There is seemingly a bug in the treatment of WHERE dataset option in the SCORE statement of PROC LOGISTIC (SAS 9.3, Windows 32). Consider the simple test :

/* A dataset with selected=0 and selected=1 observations */

data test;
call streaminit(7657);
do x = 1 to 10;
selected = 1;
do i = 1 to 15;
  y = rand("NORMAL",5,2) < x;
  output;
  end;
selected = 0;
do i = 1 to 5;
  y = rand("BERNOULLI",0.5);
  output;
  end;
end;
run;

/* Estimate parameters with selected=1 observations and score with selected=0 observations */

proc logistic data=test(where=(selected));
model y(event="1") = x;
score data=test(where=(not selected)) fitstat;
run;

The output :

                           Fit Statistics for SCORE Data

                 Total           Log       Error
Data Set     Frequency    Likelihood        Rate         AIC        AICC         BIC

WORK.TEST          150      -61.3126      0.1867    126.6252    126.7069    132.6465

                                     Max-Rescaled                   Brier
Data Set           SC    R-Square        R-Square         AUC       Score

WORK.TEST    132.6465    0.430152        0.574768    0.893183    0.130008

Notice the scoring done on the original data (selected=1, n=150) and not on the requested observations (selected=0, n=50).

PG

PG
11 REPLIES 11
Tom
Super User Tom
Super User

Not sure what you are talking about.  The NOTEs in the SAS log clearly show that it only read in 50 observations from the data set test.

675  proc logistic data=test(where=(selected));

676  model y(event="1") = x;

677  score data=test(where=(not selected)) fitstat;

678  run;

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

      WHERE selected;

NOTE: The data set WORK.DATA2 has 50 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

      real time           0.10 seconds

      cpu time            0.09 seconds

PGStats
Opal | Level 21

Not on my machine:


181
182  proc logistic data=test(where=(selected));
183  model y(event="1") = x;
184  score data=test(where=(not selected)) fitstat;
185  run;

NOTE: PROC LOGISTIC is modeling the probability that y=1.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: There were 150 observations read from the data set WORK.TEST.
      WHERE selected;
NOTE: The data set WORK.DATA2 has 150 observations and 8 variables.
NOTE: PROCEDURE LOGISTIC used (Total process time):
      real time           0.21 seconds
      cpu time            0.04 seconds

PG
Reeza
Super User

Not on mine either. SAS 9.3 TS1M1

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

      WHERE selected;

NOTE: The data set WORK.DATA2 has 150 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

      real time           0.17 seconds

      cpu time            0.01 seconds

It does work fine on SAS 9.2 TS2M3 I believe.

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

      WHERE selected;

NOTE: The data set WORK.DATA1 has 50 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

      real time           1.01 seconds

      cpu time            0.15 seconds

Tom
Super User Tom
Super User

Try it with two different datasets.

data selected notselected;

set test;

if selected output selected;

else output notselected;

run;

PGStats
Opal | Level 21

It works fine with two different datasets. This, of course, gives me a workaround. That's why I suspect the problem is with the WHERE condition. - PG

PG
Tom
Super User Tom
Super User

You should report it to Tech Support to make sure they have it as a bug they are tracking

PGStats
Opal | Level 21

When I met this in my work, I would have been totally fooled if my data had been split in half.   - PG

PG
PGStats
Opal | Level 21

Thank you Tom and Reeza, I just submitted a bug report, citing this thread.

PG

PG
PGStats
Opal | Level 21

Here is SAS acknowledgement of the problem:

This does appear to be a defect in the software.  I have let development know and they should be fixing the problem shortly.  In the mean time you will need to use the work-around mentioned in the forum of creating the data sets prior to running logistic.  Thank you for letting us know.

Sincerely,  

Rob Agnelli  

Technical Support Statistician  

SAS    

PG
spraynardz90
Calcite | Level 5

Is

"NOTE: Convergence criterion (GCONV=1E-8) satisfied." a good thing o r bad thing when running the logisitic regression?

PGStats
Opal | Level 21

This question is not related to the subject of the discussion, it should have been submitted as a new discussion. But the answer is GOOD. Convergence means that the procedure found parameter values that maximize the log-likelihood, i.e. the best fit between your data and your model.

PG

PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 2151 views
  • 6 likes
  • 4 in conversation