There is seemingly a bug in the treatment of WHERE dataset option in the SCORE statement of PROC LOGISTIC (SAS 9.3, Windows 32). Consider the simple test :
/* A dataset with selected=0 and selected=1 observations */
data test;
call streaminit(7657);
do x = 1 to 10;
selected = 1;
do i = 1 to 15;
y = rand("NORMAL",5,2) < x;
output;
end;
selected = 0;
do i = 1 to 5;
y = rand("BERNOULLI",0.5);
output;
end;
end;
run;
/* Estimate parameters with selected=1 observations and score with selected=0 observations */
proc logistic data=test(where=(selected));
model y(event="1") = x;
score data=test(where=(not selected)) fitstat;
run;
The output :
Fit Statistics for SCORE Data
Total Log Error
Data Set Frequency Likelihood Rate AIC AICC BIC
WORK.TEST 150 -61.3126 0.1867 126.6252 126.7069 132.6465
Max-Rescaled Brier
Data Set SC R-Square R-Square AUC Score
WORK.TEST 132.6465 0.430152 0.574768 0.893183 0.130008
Notice the scoring done on the original data (selected=1, n=150) and not on the requested observations (selected=0, n=50).
PG
Not sure what you are talking about. The NOTEs in the SAS log clearly show that it only read in 50 observations from the data set test.
675 proc logistic data=test(where=(selected));
676 model y(event="1") = x;
677 score data=test(where=(not selected)) fitstat;
678 run;
NOTE: PROC LOGISTIC is modeling the probability that y=1.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: There were 150 observations read from the data set WORK.TEST.
WHERE selected;
NOTE: The data set WORK.DATA2 has 50 observations and 8 variables.
NOTE: PROCEDURE LOGISTIC used (Total process time):
real time 0.10 seconds
cpu time 0.09 seconds
Not on my machine:
181
182 proc logistic data=test(where=(selected));
183 model y(event="1") = x;
184 score data=test(where=(not selected)) fitstat;
185 run;
NOTE: PROC LOGISTIC is modeling the probability that y=1.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: There were 150 observations read from the data set WORK.TEST.
WHERE selected;
NOTE: The data set WORK.DATA2 has 150 observations and 8 variables.
NOTE: PROCEDURE LOGISTIC used (Total process time):
real time 0.21 seconds
cpu time 0.04 seconds
Not on mine either. SAS 9.3 TS1M1
NOTE: PROC LOGISTIC is modeling the probability that y=1.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: There were 150 observations read from the data set WORK.TEST.
WHERE selected;
NOTE: The data set WORK.DATA2 has 150 observations and 8 variables.
NOTE: PROCEDURE LOGISTIC used (Total process time):
real time 0.17 seconds
cpu time 0.01 seconds
It does work fine on SAS 9.2 TS2M3 I believe.
NOTE: PROC LOGISTIC is modeling the probability that y=1.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: There were 150 observations read from the data set WORK.TEST.
WHERE selected;
NOTE: The data set WORK.DATA1 has 50 observations and 8 variables.
NOTE: PROCEDURE LOGISTIC used (Total process time):
real time 1.01 seconds
cpu time 0.15 seconds
Try it with two different datasets.
data selected notselected;
set test;
if selected output selected;
else output notselected;
run;
It works fine with two different datasets. This, of course, gives me a workaround. That's why I suspect the problem is with the WHERE condition. - PG
You should report it to Tech Support to make sure they have it as a bug they are tracking
When I met this in my work, I would have been totally fooled if my data had been split in half. - PG
Thank you Tom and Reeza, I just submitted a bug report, citing this thread.
PG
Here is SAS acknowledgement of the problem:
This does appear to be a defect in the software. I have let development know and they should be fixing the problem shortly. In the mean time you will need to use the work-around mentioned in the forum of creating the data sets prior to running logistic. Thank you for letting us know.
Sincerely,
Rob Agnelli
Technical Support Statistician
SAS
Is
"NOTE: Convergence criterion (GCONV=1E-8) satisfied." a good thing o r bad thing when running the logisitic regression?
This question is not related to the subject of the discussion, it should have been submitted as a new discussion. But the answer is GOOD. Convergence means that the procedure found parameter values that maximize the log-likelihood, i.e. the best fit between your data and your model.
PG
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.