06-06-2012 11:51 AM

There is seemingly a bug in the treatment of WHERE dataset option in the SCORE statement of PROC LOGISTIC (SAS 9.3, Windows 32). Consider the simple test :

/* A dataset with selected=0 and selected=1 observations */

data test;

call streaminit(7657);

do x = 1 to 10;

selected = 1;

do i = 1 to 15;

y = rand("NORMAL",5,2) < x;

output;

end;

selected = 0;

do i = 1 to 5;

y = rand("BERNOULLI",0.5);

output;

end;

end;

run;

/* Estimate parameters with selected=1 observations and score with selected=0 observations */

proc logistic data=test(where=(selected));

model y(event="1") = x;

score data=test(where=(not selected)) fitstat;

run;

The output :

Fit Statistics for SCORE Data

Total Log Error

Data Set Frequency Likelihood Rate AIC AICC BIC

WORK.TEST 150 -61.3126 0.1867 126.6252 126.7069 132.6465

Max-Rescaled Brier

Data Set SC R-Square R-Square AUC Score

WORK.TEST 132.6465 0.430152 0.574768 0.893183 0.130008

Notice the scoring done on the original data (selected=1, n=150) and not on the requested observations (selected=0, n=50).

PG

PG

06-06-2012 12:09 PM

Not sure what you are talking about. The NOTEs in the SAS log clearly show that it only read in 50 observations from the data set test.

675 proc logistic data=test(where=(selected));

676 model y(event="1") = x;

677 score data=test(where=(not selected)) fitstat;

678 run;

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

WHERE selected;

NOTE: The data set WORK.DATA2 has 50 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

real time 0.10 seconds

cpu time 0.09 seconds

06-06-2012 01:11 PM

Not on my machine:

181

182 proc logistic data=test(where=(selected));

183 model y(event="1") = x;

184 score data=test(where=(not selected)) fitstat;

185 run;

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

WHERE selected;

NOTE: The data set WORK.DATA2 has 150 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

real time 0.21 seconds

cpu time 0.04 seconds

PG

06-06-2012 01:14 PM

Not on mine either. SAS 9.3 TS1M1

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

WHERE selected;

NOTE: The data set WORK.DATA2 has 150 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

real time 0.17 seconds

cpu time 0.01 seconds

It does work fine on SAS 9.2 TS2M3 I believe.

NOTE: PROC LOGISTIC is modeling the probability that y=1.

NOTE: Convergence criterion (GCONV=1E-8) satisfied.

NOTE: There were 150 observations read from the data set WORK.TEST.

WHERE selected;

NOTE: The data set WORK.DATA1 has 50 observations and 8 variables.

NOTE: PROCEDURE LOGISTIC used (Total process time):

real time 1.01 seconds

cpu time 0.15 seconds

06-06-2012 01:19 PM

Try it with two different datasets.

data selected notselected;

set test;

if selected output selected;

else output notselected;

run;

06-06-2012 01:22 PM

It works fine with two different datasets. This, of course, gives me a workaround. That's why I suspect the problem is with the WHERE condition. - PG

PG

06-06-2012 01:34 PM

You should report it to Tech Support to make sure they have it as a bug they are tracking

06-06-2012 01:26 PM

When I met this in my work, I would have been totally fooled if my data had been split in half. - PG

PG

06-06-2012 01:53 PM

Thank you Tom and Reeza, I just submitted a bug report, citing this thread.

PG

PG

06-07-2012 11:12 AM

Here is SAS acknowledgement of the problem:

**This does appear to be a defect in the software. I have let development know and they should be fixing the problem shortly. In the mean time you will need to use the work-around mentioned in the forum of creating the data sets prior to running logistic. Thank you for letting us know.**

** **

**Sincerely, **

**Rob Agnelli **

**Technical Support Statistician **

**SAS **

PG

07-15-2012 06:33 AM

Is

"NOTE: Convergence criterion (GCONV=1E-8) satisfied." a good thing o r bad thing when running the logisitic regression?

07-15-2012 12:25 PM

This question is not related to the subject of the discussion, it should have been submitted as a new discussion. But the answer is GOOD. Convergence means that the procedure found parameter values that maximize the log-likelihood, i.e. the best fit between your data and your model.

PG

PG